All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/12] x86: more or less log-dirty related improvements
@ 2021-06-25 13:15 Jan Beulich
  2021-06-25 13:17 ` [PATCH 01/12] libxc: split xc_logdirty_control() from xc_shadow_control() Jan Beulich
                   ` (11 more replies)
  0 siblings, 12 replies; 36+ messages in thread
From: Jan Beulich @ 2021-06-25 13:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson

... or so I hope. This series continues the attempt to deal with
the ovmf change putting the shared info page at a very high address
(which is now planned to get reverted there, but the general
problem doesn't go away by them doing so). There are further issues
with truncated value, which are being dealt with here. But there
are also not directly related changes, when I simply spotted things
that aren't very likely to be right the way they are. And then
there are also adjustments to the underlying hypervisor
implementation, with the goal of making the returned data more
useful to the consumers.

With these changes in place, a 1Gb guest which has "inflated"
itself by putting a page right below the 16Tb boundary migrates
successfully, albeit the process takes from some 20 minutes to over
half an hour on my test system.

01: libxc: split xc_logdirty_control() from xc_shadow_control()
02: libxenguest: deal with log-dirty op stats overflow
03: libxenguest: short-circuit "all-dirty" handling
04: libxenguest: avoid allocating unused deferred-pages bitmap
05: libxenguest: complete loops in xc_map_domain_meminfo()
06: libxenguest: guard against overflow from too large p2m when checkpointing
07: libxenguest: fix off-by-1 in colo-secondary-bitmap merging
08: x86/paging: deal with log-dirty stats overflow
09: x86/paging: supply more useful log-dirty page count
10: x86/mm: update log-dirty bitmap when manipulating P2M
11: x86/mm: pull a sanity check earlier in xenmem_add_to_physmap_one()
12: SUPPORT.md: write down restriction of 32-bit tool stacks

Jan



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/12] libxc: split xc_logdirty_control() from xc_shadow_control()
  2021-06-25 13:15 [PATCH 00/12] x86: more or less log-dirty related improvements Jan Beulich
@ 2021-06-25 13:17 ` Jan Beulich
  2021-06-25 14:51   ` Christian Lindig
  2021-06-25 15:49   ` Andrew Cooper
  2021-06-25 13:18 ` [PATCH 02/12] libxenguest: deal with log-dirty op stats overflow Jan Beulich
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 36+ messages in thread
From: Jan Beulich @ 2021-06-25 13:17 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson, Marek Marczykowski,
	Christian Lindig, David Scott

For log-dirty operations a 64-bit field is being truncated to become an
"int" return value. Seeing the large number of arguments the present
function takes, reduce its set of parameters to that needed for all
operations not involving the log-dirty bitmap, while introducing a new
wrapper for the log-dirty bitmap operations. This new function in turn
doesn't need an "mb" parameter, but has a 64-bit return type. (Using the
return value in favor of a pointer-type parameter is left as is, to
disturb callers as little as possible.)

While altering xc_shadow_control() anyway, also adjust the types of the
last two of the remaining parameters.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I wonder whether we shouldn't take the opportunity and also rename
xc_shadow_control() to, say, xc_paging_control(), matching the layer
above the HAP/shadow distinction in the hypervisor.

--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -885,11 +885,15 @@ typedef struct xen_domctl_shadow_op_stat
 int xc_shadow_control(xc_interface *xch,
                       uint32_t domid,
                       unsigned int sop,
-                      xc_hypercall_buffer_t *dirty_bitmap,
-                      unsigned long pages,
-                      unsigned long *mb,
-                      uint32_t mode,
-                      xc_shadow_op_stats_t *stats);
+                      unsigned int *mb,
+                      unsigned int mode);
+long long xc_logdirty_control(xc_interface *xch,
+                              uint32_t domid,
+                              unsigned int sop,
+                              xc_hypercall_buffer_t *dirty_bitmap,
+                              unsigned long pages,
+                              unsigned int mode,
+                              xc_shadow_op_stats_t *stats);
 
 int xc_sched_credit_domain_set(xc_interface *xch,
                                uint32_t domid,
--- a/tools/libs/ctrl/xc_domain.c
+++ b/tools/libs/ctrl/xc_domain.c
@@ -650,25 +650,48 @@ int xc_watchdog(xc_interface *xch,
 int xc_shadow_control(xc_interface *xch,
                       uint32_t domid,
                       unsigned int sop,
-                      xc_hypercall_buffer_t *dirty_bitmap,
-                      unsigned long pages,
-                      unsigned long *mb,
-                      uint32_t mode,
-                      xc_shadow_op_stats_t *stats)
+                      unsigned int *mb,
+                      unsigned int mode)
 {
     int rc;
     DECLARE_DOMCTL;
-    DECLARE_HYPERCALL_BUFFER_ARGUMENT(dirty_bitmap);
 
     memset(&domctl, 0, sizeof(domctl));
 
     domctl.cmd = XEN_DOMCTL_shadow_op;
     domctl.domain = domid;
     domctl.u.shadow_op.op     = sop;
-    domctl.u.shadow_op.pages  = pages;
     domctl.u.shadow_op.mb     = mb ? *mb : 0;
     domctl.u.shadow_op.mode   = mode;
-    if (dirty_bitmap != NULL)
+
+    rc = do_domctl(xch, &domctl);
+
+    if ( mb )
+        *mb = domctl.u.shadow_op.mb;
+
+    return rc;
+}
+
+long long xc_logdirty_control(xc_interface *xch,
+                              uint32_t domid,
+                              unsigned int sop,
+                              xc_hypercall_buffer_t *dirty_bitmap,
+                              unsigned long pages,
+                              unsigned int mode,
+                              xc_shadow_op_stats_t *stats)
+{
+    int rc;
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BUFFER_ARGUMENT(dirty_bitmap);
+
+    memset(&domctl, 0, sizeof(domctl));
+
+    domctl.cmd = XEN_DOMCTL_shadow_op;
+    domctl.domain = domid;
+    domctl.u.shadow_op.op    = sop;
+    domctl.u.shadow_op.pages = pages;
+    domctl.u.shadow_op.mode  = mode;
+    if ( dirty_bitmap )
         set_xen_guest_handle(domctl.u.shadow_op.dirty_bitmap,
                                 dirty_bitmap);
 
@@ -678,9 +701,6 @@ int xc_shadow_control(xc_interface *xch,
         memcpy(stats, &domctl.u.shadow_op.stats,
                sizeof(xc_shadow_op_stats_t));
     
-    if ( mb ) 
-        *mb = domctl.u.shadow_op.mb;
-
     return (rc == 0) ? domctl.u.shadow_op.pages : rc;
 }
 
--- a/tools/libs/guest/xg_sr_restore.c
+++ b/tools/libs/guest/xg_sr_restore.c
@@ -459,10 +459,10 @@ static int send_checkpoint_dirty_pfn_lis
     DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
                                     &ctx->restore.dirty_bitmap_hbuf);
 
-    if ( xc_shadow_control(
+    if ( xc_logdirty_control(
              xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
              HYPERCALL_BUFFER(dirty_bitmap), ctx->restore.p2m_size,
-             NULL, 0, &stats) != ctx->restore.p2m_size )
+             0, &stats) != ctx->restore.p2m_size )
     {
         PERROR("Failed to retrieve logdirty bitmap");
         goto err;
--- a/tools/libs/guest/xg_sr_save.c
+++ b/tools/libs/guest/xg_sr_save.c
@@ -428,18 +428,18 @@ static int enable_logdirty(struct xc_sr_
     /* This juggling is required if logdirty is enabled for VRAM tracking. */
     rc = xc_shadow_control(xch, ctx->domid,
                            XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
-                           NULL, 0, NULL, 0, NULL);
+                           NULL, 0);
     if ( rc < 0 )
     {
         on1 = errno;
         rc = xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
-                               NULL, 0, NULL, 0, NULL);
+                               NULL, 0);
         if ( rc < 0 )
             off = errno;
         else {
             rc = xc_shadow_control(xch, ctx->domid,
                                    XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
-                                   NULL, 0, NULL, 0, NULL);
+                                   NULL, 0);
             if ( rc < 0 )
                 on2 = errno;
         }
@@ -556,10 +556,10 @@ static int send_memory_live(struct xc_sr
         if ( policy_decision != XGS_POLICY_CONTINUE_PRECOPY )
             break;
 
-        if ( xc_shadow_control(
+        if ( xc_logdirty_control(
                  xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
                  &ctx->save.dirty_bitmap_hbuf, ctx->save.p2m_size,
-                 NULL, 0, &stats) != ctx->save.p2m_size )
+                 0, &stats) != ctx->save.p2m_size )
         {
             PERROR("Failed to retrieve logdirty bitmap");
             rc = -1;
@@ -653,10 +653,10 @@ static int suspend_and_send_dirty(struct
     if ( rc )
         goto out;
 
-    if ( xc_shadow_control(
+    if ( xc_logdirty_control(
              xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
              HYPERCALL_BUFFER(dirty_bitmap), ctx->save.p2m_size,
-             NULL, XEN_DOMCTL_SHADOW_LOGDIRTY_FINAL, &stats) !=
+             XEN_DOMCTL_SHADOW_LOGDIRTY_FINAL, &stats) !=
          ctx->save.p2m_size )
     {
         PERROR("Failed to retrieve logdirty bitmap");
@@ -716,10 +716,10 @@ static int verify_frames(struct xc_sr_co
     if ( rc )
         goto out;
 
-    if ( xc_shadow_control(
+    if ( xc_logdirty_control(
              xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_PEEK,
              &ctx->save.dirty_bitmap_hbuf, ctx->save.p2m_size,
-             NULL, 0, &stats) != ctx->save.p2m_size )
+             0, &stats) != ctx->save.p2m_size )
     {
         PERROR("Failed to retrieve logdirty bitmap");
         rc = -1;
@@ -834,7 +834,7 @@ static void cleanup(struct xc_sr_context
 
 
     xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
-                      NULL, 0, NULL, 0, NULL);
+                      NULL, 0);
 
     if ( ctx->save.ops.cleanup(ctx) )
         PERROR("Failed to clean up");
--- a/tools/libs/light/libxl_colo_restore.c
+++ b/tools/libs/light/libxl_colo_restore.c
@@ -62,7 +62,7 @@ static void colo_enable_logdirty(libxl__
     /* we need to know which pages are dirty to restore the guest */
     if (xc_shadow_control(CTX->xch, domid,
                           XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
-                          NULL, 0, NULL, 0, NULL) < 0) {
+                          NULL, 0) < 0) {
         LOGD(ERROR, domid, "cannot enable secondary vm's logdirty");
         lds->callback(egc, lds, ERROR_FAIL);
         return;
@@ -90,7 +90,7 @@ static void colo_disable_logdirty(libxl_
 
     /* we need to know which pages are dirty to restore the guest */
     if (xc_shadow_control(CTX->xch, domid, XEN_DOMCTL_SHADOW_OP_OFF,
-                          NULL, 0, NULL, 0, NULL) < 0)
+                          NULL, 0) < 0)
         LOGD(WARN, domid, "cannot disable secondary vm's logdirty");
 
     if (crs->hvm) {
--- a/tools/libs/light/libxl_x86.c
+++ b/tools/libs/light/libxl_x86.c
@@ -529,10 +529,10 @@ int libxl__arch_domain_create(libxl__gc
         xc_domain_set_time_offset(ctx->xch, domid, rtc_timeoffset);
 
     if (d_config->b_info.type != LIBXL_DOMAIN_TYPE_PV) {
-        unsigned long shadow = DIV_ROUNDUP(d_config->b_info.shadow_memkb,
-                                           1024);
+        unsigned int shadow = DIV_ROUNDUP(d_config->b_info.shadow_memkb,
+                                          1024);
         xc_shadow_control(ctx->xch, domid, XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION,
-                          NULL, 0, &shadow, 0, NULL);
+                          &shadow, 0);
     }
 
     if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV &&
--- a/tools/ocaml/libs/xc/xenctrl_stubs.c
+++ b/tools/ocaml/libs/xc/xenctrl_stubs.c
@@ -997,13 +997,13 @@ CAMLprim value stub_shadow_allocation_ge
 {
 	CAMLparam2(xch, domid);
 	CAMLlocal1(mb);
-	unsigned long c_mb;
+	unsigned int c_mb;
 	int ret;
 
 	caml_enter_blocking_section();
 	ret = xc_shadow_control(_H(xch), _D(domid),
 				XEN_DOMCTL_SHADOW_OP_GET_ALLOCATION,
-				NULL, 0, &c_mb, 0, NULL);
+				&c_mb, 0);
 	caml_leave_blocking_section();
 	if (ret != 0)
 		failwith_xc(_H(xch));
@@ -1016,14 +1016,14 @@ CAMLprim value stub_shadow_allocation_se
 					  value mb)
 {
 	CAMLparam3(xch, domid, mb);
-	unsigned long c_mb;
+	unsigned int c_mb;
 	int ret;
 
 	c_mb = Int_val(mb);
 	caml_enter_blocking_section();
 	ret = xc_shadow_control(_H(xch), _D(domid),
 				XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION,
-				NULL, 0, &c_mb, 0, NULL);
+				&c_mb, 0);
 	caml_leave_blocking_section();
 	if (ret != 0)
 		failwith_xc(_H(xch));
--- a/tools/python/xen/lowlevel/xc/xc.c
+++ b/tools/python/xen/lowlevel/xc/xc.c
@@ -1192,8 +1192,7 @@ static PyObject *pyxc_shadow_control(PyO
                                       &dom, &op) )
         return NULL;
     
-    if ( xc_shadow_control(xc->xc_handle, dom, op, NULL, 0, NULL, 0, NULL) 
-         < 0 )
+    if ( xc_shadow_control(xc->xc_handle, dom, op, NULL, 0) < 0 )
         return pyxc_error_to_exception(xc->xc_handle);
     
     Py_INCREF(zero);
@@ -1208,7 +1207,7 @@ static PyObject *pyxc_shadow_mem_control
     int op;
     uint32_t dom;
     int mbarg = -1;
-    unsigned long mb;
+    unsigned int mb;
 
     static char *kwd_list[] = { "dom", "mb", NULL };
 
@@ -1223,7 +1222,7 @@ static PyObject *pyxc_shadow_mem_control
         mb = mbarg;
         op = XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION;
     }
-    if ( xc_shadow_control(xc->xc_handle, dom, op, NULL, 0, &mb, 0, NULL) < 0 )
+    if ( xc_shadow_control(xc->xc_handle, dom, op, &mb, 0) < 0 )
         return pyxc_error_to_exception(xc->xc_handle);
     
     mbarg = mb;



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 02/12] libxenguest: deal with log-dirty op stats overflow
  2021-06-25 13:15 [PATCH 00/12] x86: more or less log-dirty related improvements Jan Beulich
  2021-06-25 13:17 ` [PATCH 01/12] libxc: split xc_logdirty_control() from xc_shadow_control() Jan Beulich
@ 2021-06-25 13:18 ` Jan Beulich
  2021-06-25 16:36   ` Andrew Cooper
  2021-06-25 13:18 ` [PATCH 03/12] libxenguest: short-circuit "all-dirty" handling Jan Beulich
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2021-06-25 13:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson

In send_memory_live() the precise value the dirty_count struct field
gets initialized to doesn't matter much (apart from the triggering of
the log message in send_dirty_pages(), see below), but it is important
that it not be zero on the first iteration (or else send_dirty_pages()
won't get called at all). Saturate the initializer value at the maximum
value the field can hold.

While there also initialize struct precopy_stats' respective field to a
more sane value: We don't really know how many dirty pages there are at
that point.

In suspend_and_send_dirty() and verify_frames() the local variables
don't need initializing at all, as they're only an output from the
hypercall which gets invoked first thing.

In send_checkpoint_dirty_pfn_list() the local variable can be dropped
altogether: It's optional to xc_logdirty_control() and not used anywhere
else.

Note that in case the clipping actually takes effect, the "Bitmap
contained more entries than expected..." log message will trigger. This
being just an informational message, I don't think this is overly
concerning.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/tools/libs/guest/xg_sr_restore.c
+++ b/tools/libs/guest/xg_sr_restore.c
@@ -452,7 +452,6 @@ static int send_checkpoint_dirty_pfn_lis
     unsigned int count, written;
     uint64_t i, *pfns = NULL;
     struct iovec *iov = NULL;
-    xc_shadow_op_stats_t stats = { 0, ctx->restore.p2m_size };
     struct xc_sr_record rec = {
         .type = REC_TYPE_CHECKPOINT_DIRTY_PFN_LIST,
     };
@@ -462,7 +461,7 @@ static int send_checkpoint_dirty_pfn_lis
     if ( xc_logdirty_control(
              xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
              HYPERCALL_BUFFER(dirty_bitmap), ctx->restore.p2m_size,
-             0, &stats) != ctx->restore.p2m_size )
+             0, NULL) != ctx->restore.p2m_size )
     {
         PERROR("Failed to retrieve logdirty bitmap");
         goto err;
--- a/tools/libs/guest/xg_sr_save.c
+++ b/tools/libs/guest/xg_sr_save.c
@@ -500,7 +500,9 @@ static int simple_precopy_policy(struct
 static int send_memory_live(struct xc_sr_context *ctx)
 {
     xc_interface *xch = ctx->xch;
-    xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
+    xc_shadow_op_stats_t stats = {
+        .dirty_count = MIN(ctx->save.p2m_size, (typeof(stats.dirty_count))~0)
+    };
     char *progress_str = NULL;
     unsigned int x = 0;
     int rc;
@@ -519,7 +521,7 @@ static int send_memory_live(struct xc_sr
         goto out;
 
     ctx->save.stats = (struct precopy_stats){
-        .dirty_count = ctx->save.p2m_size,
+        .dirty_count = -1,
     };
     policy_stats = &ctx->save.stats;
 
@@ -643,7 +645,7 @@ static int colo_merge_secondary_dirty_bi
 static int suspend_and_send_dirty(struct xc_sr_context *ctx)
 {
     xc_interface *xch = ctx->xch;
-    xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
+    xc_shadow_op_stats_t stats;
     char *progress_str = NULL;
     int rc;
     DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
@@ -701,7 +703,7 @@ static int suspend_and_send_dirty(struct
 static int verify_frames(struct xc_sr_context *ctx)
 {
     xc_interface *xch = ctx->xch;
-    xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
+    xc_shadow_op_stats_t stats;
     int rc;
     struct xc_sr_record rec = { .type = REC_TYPE_VERIFY };
 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 03/12] libxenguest: short-circuit "all-dirty" handling
  2021-06-25 13:15 [PATCH 00/12] x86: more or less log-dirty related improvements Jan Beulich
  2021-06-25 13:17 ` [PATCH 01/12] libxc: split xc_logdirty_control() from xc_shadow_control() Jan Beulich
  2021-06-25 13:18 ` [PATCH 02/12] libxenguest: deal with log-dirty op stats overflow Jan Beulich
@ 2021-06-25 13:18 ` Jan Beulich
  2021-06-25 17:02   ` Andrew Cooper
  2021-06-25 13:19 ` [PATCH 04/12] libxenguest: avoid allocating unused deferred-pages bitmap Jan Beulich
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2021-06-25 13:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson

For one it is unnecessary to fill a perhaps large chunk of memory with
all ones. Add a new parameter to send_dirty_pages() for callers to
indicate so.

Then it is further unnecessary to allocate the dirty bitmap altogether
when all that's ever going to happen is a single all-dirty run.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/tools/libs/guest/xg_sr_save.c
+++ b/tools/libs/guest/xg_sr_save.c
@@ -368,7 +368,7 @@ static int suspend_domain(struct xc_sr_c
  * Bitmap is bounded by p2m_size.
  */
 static int send_dirty_pages(struct xc_sr_context *ctx,
-                            unsigned long entries)
+                            unsigned long entries, bool all_dirty)
 {
     xc_interface *xch = ctx->xch;
     xen_pfn_t p;
@@ -379,7 +379,7 @@ static int send_dirty_pages(struct xc_sr
 
     for ( p = 0, written = 0; p < ctx->save.p2m_size; ++p )
     {
-        if ( !test_bit(p, dirty_bitmap) )
+        if ( !all_dirty && !test_bit(p, dirty_bitmap) )
             continue;
 
         rc = add_to_batch(ctx, p);
@@ -411,12 +411,7 @@ static int send_dirty_pages(struct xc_sr
  */
 static int send_all_pages(struct xc_sr_context *ctx)
 {
-    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
-                                    &ctx->save.dirty_bitmap_hbuf);
-
-    bitmap_set(dirty_bitmap, ctx->save.p2m_size);
-
-    return send_dirty_pages(ctx, ctx->save.p2m_size);
+    return send_dirty_pages(ctx, ctx->save.p2m_size, true /* all_dirty */);
 }
 
 static int enable_logdirty(struct xc_sr_context *ctx)
@@ -508,9 +503,6 @@ static int send_memory_live(struct xc_sr
     int rc;
     int policy_decision;
 
-    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
-                                    &ctx->save.dirty_bitmap_hbuf);
-
     precopy_policy_t precopy_policy = ctx->save.callbacks->precopy_policy;
     void *data = ctx->save.callbacks->data;
 
@@ -528,8 +520,6 @@ static int send_memory_live(struct xc_sr
     if ( precopy_policy == NULL )
         precopy_policy = simple_precopy_policy;
 
-    bitmap_set(dirty_bitmap, ctx->save.p2m_size);
-
     for ( ; ; )
     {
         policy_decision = precopy_policy(*policy_stats, data);
@@ -541,7 +531,7 @@ static int send_memory_live(struct xc_sr
             if ( rc )
                 goto out;
 
-            rc = send_dirty_pages(ctx, stats.dirty_count);
+            rc = send_dirty_pages(ctx, stats.dirty_count, x == 1);
             if ( rc )
                 goto out;
         }
@@ -687,7 +677,8 @@ static int suspend_and_send_dirty(struct
         }
     }
 
-    rc = send_dirty_pages(ctx, stats.dirty_count + ctx->save.nr_deferred_pages);
+    rc = send_dirty_pages(ctx, stats.dirty_count + ctx->save.nr_deferred_pages,
+                          false /* all_dirty */);
     if ( rc )
         goto out;
 
@@ -807,8 +798,11 @@ static int setup(struct xc_sr_context *c
     if ( rc )
         goto err;
 
-    dirty_bitmap = xc_hypercall_buffer_alloc_pages(
-        xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->save.p2m_size)));
+    dirty_bitmap = ctx->save.live || ctx->stream_type != XC_STREAM_PLAIN
+        ? xc_hypercall_buffer_alloc_pages(
+              xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->save.p2m_size)))
+        : (void *)-1L;
+
     ctx->save.batch_pfns = malloc(MAX_BATCH_SIZE *
                                   sizeof(*ctx->save.batch_pfns));
     ctx->save.deferred_pages = bitmap_alloc(ctx->save.p2m_size);



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 04/12] libxenguest: avoid allocating unused deferred-pages bitmap
  2021-06-25 13:15 [PATCH 00/12] x86: more or less log-dirty related improvements Jan Beulich
                   ` (2 preceding siblings ...)
  2021-06-25 13:18 ` [PATCH 03/12] libxenguest: short-circuit "all-dirty" handling Jan Beulich
@ 2021-06-25 13:19 ` Jan Beulich
  2021-06-25 18:08   ` Andrew Cooper
  2021-06-25 13:19 ` [PATCH 05/12] libxenguest: complete loops in xc_map_domain_meminfo() Jan Beulich
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2021-06-25 13:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson

Like for the dirty bitmap, it is unnecessary to allocate the deferred-
pages bitmap when all that's ever going to happen is a single all-dirty
run.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
The clearing of the bitmap at the end of suspend_and_send_dirty() also
looks unnecessary - am I overlooking anything?

--- a/tools/libs/guest/xg_sr_save.c
+++ b/tools/libs/guest/xg_sr_save.c
@@ -130,7 +130,7 @@ static int write_batch(struct xc_sr_cont
                                                       ctx->save.batch_pfns[i]);
 
         /* Likely a ballooned page. */
-        if ( mfns[i] == INVALID_MFN )
+        if ( mfns[i] == INVALID_MFN && ctx->save.deferred_pages )
         {
             set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
             ++ctx->save.nr_deferred_pages;
@@ -196,8 +196,12 @@ static int write_batch(struct xc_sr_cont
             {
                 if ( rc == -1 && errno == EAGAIN )
                 {
-                    set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
-                    ++ctx->save.nr_deferred_pages;
+                    if ( ctx->save.deferred_pages )
+                    {
+                        set_bit(ctx->save.batch_pfns[i],
+                                ctx->save.deferred_pages);
+                        ++ctx->save.nr_deferred_pages;
+                    }
                     types[i] = XEN_DOMCTL_PFINFO_XTAB;
                     --nr_pages;
                 }
@@ -665,7 +669,8 @@ static int suspend_and_send_dirty(struct
     else
         xc_set_progress_prefix(xch, "Checkpointed save");
 
-    bitmap_or(dirty_bitmap, ctx->save.deferred_pages, ctx->save.p2m_size);
+    if ( ctx->save.deferred_pages )
+        bitmap_or(dirty_bitmap, ctx->save.deferred_pages, ctx->save.p2m_size);
 
     if ( !ctx->save.live && ctx->stream_type == XC_STREAM_COLO )
     {
@@ -682,7 +687,8 @@ static int suspend_and_send_dirty(struct
     if ( rc )
         goto out;
 
-    bitmap_clear(ctx->save.deferred_pages, ctx->save.p2m_size);
+    if ( ctx->save.deferred_pages )
+        bitmap_clear(ctx->save.deferred_pages, ctx->save.p2m_size);
     ctx->save.nr_deferred_pages = 0;
 
  out:
@@ -791,24 +797,31 @@ static int setup(struct xc_sr_context *c
 {
     xc_interface *xch = ctx->xch;
     int rc;
-    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
-                                    &ctx->save.dirty_bitmap_hbuf);
 
     rc = ctx->save.ops.setup(ctx);
     if ( rc )
         goto err;
 
-    dirty_bitmap = ctx->save.live || ctx->stream_type != XC_STREAM_PLAIN
-        ? xc_hypercall_buffer_alloc_pages(
-              xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->save.p2m_size)))
-        : (void *)-1L;
+    if ( ctx->save.live || ctx->stream_type != XC_STREAM_PLAIN )
+    {
+        DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                        &ctx->save.dirty_bitmap_hbuf);
+
+        dirty_bitmap =
+            xc_hypercall_buffer_alloc_pages(
+                xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->save.p2m_size)));
+        ctx->save.deferred_pages = bitmap_alloc(ctx->save.p2m_size);
+
+        if ( !dirty_bitmap || !ctx->save.deferred_pages )
+            goto enomem;
+    }
 
     ctx->save.batch_pfns = malloc(MAX_BATCH_SIZE *
                                   sizeof(*ctx->save.batch_pfns));
-    ctx->save.deferred_pages = bitmap_alloc(ctx->save.p2m_size);
 
-    if ( !ctx->save.batch_pfns || !dirty_bitmap || !ctx->save.deferred_pages )
+    if ( !ctx->save.batch_pfns )
     {
+    enomem:
         ERROR("Unable to allocate memory for dirty bitmaps, batch pfns and"
               " deferred pages");
         rc = -1;



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 05/12] libxenguest: complete loops in xc_map_domain_meminfo()
  2021-06-25 13:15 [PATCH 00/12] x86: more or less log-dirty related improvements Jan Beulich
                   ` (3 preceding siblings ...)
  2021-06-25 13:19 ` [PATCH 04/12] libxenguest: avoid allocating unused deferred-pages bitmap Jan Beulich
@ 2021-06-25 13:19 ` Jan Beulich
  2021-06-25 18:30   ` Andrew Cooper
  2021-06-25 13:20 ` [PATCH 06/12] libxenguest: guard against overflow from too large p2m when checkpointing Jan Beulich
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2021-06-25 13:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson

minfo->p2m_size may have more than 31 significant bits. Change the
induction variable to unsigned long, and (largely for signed-ness
consistency) a helper variable to unsigned int.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/tools/libs/guest/xg_domain.c
+++ b/tools/libs/guest/xg_domain.c
@@ -40,7 +40,7 @@ int xc_map_domain_meminfo(xc_interface *
     xc_dominfo_t info;
     shared_info_any_t *live_shinfo;
     xen_capabilities_info_t xen_caps = "";
-    int i;
+    unsigned long i;
 
     /* Only be initialized once */
     if ( minfo->pfn_type || minfo->p2m_table )
@@ -116,12 +116,12 @@ int xc_map_domain_meminfo(xc_interface *
     /* Retrieve PFN types in batches */
     for ( i = 0; i < minfo->p2m_size ; i+=1024 )
     {
-        int count = ((minfo->p2m_size - i ) > 1024 ) ?
-                        1024: (minfo->p2m_size - i);
+        unsigned int count = ((minfo->p2m_size - i) > 1024) ?
+                             1024 : (minfo->p2m_size - i);
 
         if ( xc_get_pfn_type_batch(xch, domid, count, minfo->pfn_type + i) )
         {
-            PERROR("Could not get %d-eth batch of PFN types", (i+1)/1024);
+            PERROR("Could not get batch %lu of PFN types", (i + 1) / 1024);
             goto failed;
         }
     }



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 06/12] libxenguest: guard against overflow from too large p2m when checkpointing
  2021-06-25 13:15 [PATCH 00/12] x86: more or less log-dirty related improvements Jan Beulich
                   ` (4 preceding siblings ...)
  2021-06-25 13:19 ` [PATCH 05/12] libxenguest: complete loops in xc_map_domain_meminfo() Jan Beulich
@ 2021-06-25 13:20 ` Jan Beulich
  2021-06-25 19:00   ` Andrew Cooper
  2021-06-25 13:20 ` [PATCH 07/12] libxenguest: fix off-by-1 in colo-secondary-bitmap merging Jan Beulich
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2021-06-25 13:20 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson

struct xc_sr_record's length field has just 32 bits. Fill it early and
check that the calculated value hasn't overflowed. Additionally check
for counter overflow early - there's no point even trying to allocate
any memory in such an event.

While there also limit an induction variable's type to unsigned long:
There's no gain from it being uint64_t.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Of course looping over test_bit() is pretty inefficient, but given that
I have no idea how to test this code I wanted to restrict changes to
what can sensibly be seen as no worse than before from just looking at
the changes.

--- a/tools/libs/guest/xg_sr_restore.c
+++ b/tools/libs/guest/xg_sr_restore.c
@@ -450,7 +450,8 @@ static int send_checkpoint_dirty_pfn_lis
     xc_interface *xch = ctx->xch;
     int rc = -1;
     unsigned int count, written;
-    uint64_t i, *pfns = NULL;
+    unsigned long i;
+    uint64_t *pfns = NULL;
     struct iovec *iov = NULL;
     struct xc_sr_record rec = {
         .type = REC_TYPE_CHECKPOINT_DIRTY_PFN_LIST,
@@ -469,16 +470,28 @@ static int send_checkpoint_dirty_pfn_lis
 
     for ( i = 0, count = 0; i < ctx->restore.p2m_size; i++ )
     {
-        if ( test_bit(i, dirty_bitmap) )
-            count++;
+        if ( test_bit(i, dirty_bitmap) && !++count )
+            break;
     }
 
+    if ( i < ctx->restore.p2m_size )
+    {
+        ERROR("Too many dirty pfns");
+        goto err;
+    }
+
+    rec.length = count * sizeof(*pfns);
+    if ( rec.length / sizeof(*pfns) != count )
+    {
+        ERROR("Too many (%u) dirty pfns", count);
+        goto err;
+    }
 
-    pfns = malloc(count * sizeof(*pfns));
+    pfns = malloc(rec.length);
     if ( !pfns )
     {
-        ERROR("Unable to allocate %zu bytes of memory for dirty pfn list",
-              count * sizeof(*pfns));
+        ERROR("Unable to allocate %u bytes of memory for dirty pfn list",
+              rec.length);
         goto err;
     }
 
@@ -504,8 +517,6 @@ static int send_checkpoint_dirty_pfn_lis
         goto err;
     }
 
-    rec.length = count * sizeof(*pfns);
-
     iov[0].iov_base = &rec.type;
     iov[0].iov_len = sizeof(rec.type);
 
@@ -513,7 +524,7 @@ static int send_checkpoint_dirty_pfn_lis
     iov[1].iov_len = sizeof(rec.length);
 
     iov[2].iov_base = pfns;
-    iov[2].iov_len = count * sizeof(*pfns);
+    iov[2].iov_len = rec.length;
 
     if ( writev_exact(ctx->restore.send_back_fd, iov, 3) )
     {



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 07/12] libxenguest: fix off-by-1 in colo-secondary-bitmap merging
  2021-06-25 13:15 [PATCH 00/12] x86: more or less log-dirty related improvements Jan Beulich
                   ` (5 preceding siblings ...)
  2021-06-25 13:20 ` [PATCH 06/12] libxenguest: guard against overflow from too large p2m when checkpointing Jan Beulich
@ 2021-06-25 13:20 ` Jan Beulich
  2021-06-25 19:06   ` Andrew Cooper
  2021-06-25 13:21 ` [PATCH 08/12] x86/paging: deal with log-dirty stats overflow Jan Beulich
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2021-06-25 13:20 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson

Valid GFNs (having a representation in the dirty bitmap) need to be
strictly below p2m_size.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/tools/libs/guest/xg_sr_save.c
+++ b/tools/libs/guest/xg_sr_save.c
@@ -614,7 +614,7 @@ static int colo_merge_secondary_dirty_bi
     for ( i = 0; i < count; i++ )
     {
         pfn = pfns[i];
-        if ( pfn > ctx->save.p2m_size )
+        if ( pfn >= ctx->save.p2m_size )
         {
             PERROR("Invalid pfn 0x%" PRIx64, pfn);
             rc = -1;



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 08/12] x86/paging: deal with log-dirty stats overflow
  2021-06-25 13:15 [PATCH 00/12] x86: more or less log-dirty related improvements Jan Beulich
                   ` (6 preceding siblings ...)
  2021-06-25 13:20 ` [PATCH 07/12] libxenguest: fix off-by-1 in colo-secondary-bitmap merging Jan Beulich
@ 2021-06-25 13:21 ` Jan Beulich
  2021-06-25 19:09   ` Andrew Cooper
  2021-06-25 13:21 ` [PATCH 09/12] x86/paging: supply more useful log-dirty page count Jan Beulich
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2021-06-25 13:21 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap

While the precise values are unlikely of interest once they exceed 4
billion (allowing us to leave alone the domctl struct), we still
shouldn't wrap or truncate the actual values. It is in particular
problematic if the truncated values were zero (causing libxenguest to
skip an iteration altogether) or a very small value (leading to
premature exiting of the pre-copy phase).

Change the internal fields to unsigned long, and suitably saturate for
copying to guest context.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/paging.c
+++ b/xen/arch/x86/mm/paging.c
@@ -446,14 +446,16 @@ static int paging_log_dirty_op(struct do
 
     clean = (sc->op == XEN_DOMCTL_SHADOW_OP_CLEAN);
 
-    PAGING_DEBUG(LOGDIRTY, "log-dirty %s: dom %u faults=%u dirty=%u\n",
+    PAGING_DEBUG(LOGDIRTY, "log-dirty %s: dom %u faults=%lu dirty=%lu\n",
                  (clean) ? "clean" : "peek",
                  d->domain_id,
                  d->arch.paging.log_dirty.fault_count,
                  d->arch.paging.log_dirty.dirty_count);
 
-    sc->stats.fault_count = d->arch.paging.log_dirty.fault_count;
-    sc->stats.dirty_count = d->arch.paging.log_dirty.dirty_count;
+    sc->stats.fault_count = min(d->arch.paging.log_dirty.fault_count,
+                                UINT32_MAX + 0UL);
+    sc->stats.dirty_count = min(d->arch.paging.log_dirty.dirty_count,
+                                UINT32_MAX + 0UL);
 
     if ( guest_handle_is_null(sc->dirty_bitmap) )
         /* caller may have wanted just to clean the state or access stats. */
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -190,8 +190,8 @@ struct log_dirty_domain {
     unsigned int   failed_allocs;
 
     /* log-dirty mode stats */
-    unsigned int   fault_count;
-    unsigned int   dirty_count;
+    unsigned long  fault_count;
+    unsigned long  dirty_count;
 
     /* functions which are paging mode specific */
     const struct log_dirty_ops {



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 09/12] x86/paging: supply more useful log-dirty page count
  2021-06-25 13:15 [PATCH 00/12] x86: more or less log-dirty related improvements Jan Beulich
                   ` (7 preceding siblings ...)
  2021-06-25 13:21 ` [PATCH 08/12] x86/paging: deal with log-dirty stats overflow Jan Beulich
@ 2021-06-25 13:21 ` Jan Beulich
  2021-06-25 13:22 ` [PATCH 10/12] x86/mm: update log-dirty bitmap when manipulating P2M Jan Beulich
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Jan Beulich @ 2021-06-25 13:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson

In paging_log_dirty_op(), always update the count of pages field:
- if more pages were specified than the guest has ever accessed (HVM) or
  marked part of the p2m (PV), there's no point for the caller to
  inspect bits beyond the one for that last page,
- if the guest's p2m size has grown in the meantime, the caller would
  have no indication that it may not have caught all dirty bits.

Also exit the loop once having passed the last valid GFN. To balance
overhead and savings, do this before inspecting a new L2 table.

Adjust libxenguest accordingly, albeit these changes are necessary only
for guests which actually alter their P2M size while under migration.
They do, however, additionally open up the option of the hypervisor
eventually zapping large ranges of trailing zeros from the bitmap when
providing it back to the tools.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Of course this still is far from ideal: At the very least a perhaps
large tail of zeros could very well also result in a reduced page
count.

--- a/tools/libs/guest/xg_sr_common.h
+++ b/tools/libs/guest/xg_sr_common.h
@@ -237,7 +237,16 @@ struct xc_sr_context
             /* Further debugging information in the stream. */
             bool debug;
 
+            /*
+             * Counts of bits (each representing a guest page), expressing
+             * respectively
+             * - obtained P2M size,
+             * - allocated bitmap size,
+             * - range actually filled with valid data.
+             */
             unsigned long p2m_size;
+            unsigned long p2m_alloc_size;
+            unsigned long p2m_used_size;
 
             struct precopy_stats stats;
 
@@ -245,6 +254,7 @@ struct xc_sr_context
             unsigned int nr_batch_pfns;
             unsigned long *deferred_pages;
             unsigned long nr_deferred_pages;
+            unsigned long used_deferred_pages;
             xc_hypercall_buffer_t dirty_bitmap_hbuf;
         } save;
 
--- a/tools/libs/guest/xg_sr_restore.c
+++ b/tools/libs/guest/xg_sr_restore.c
@@ -450,7 +450,8 @@ static int send_checkpoint_dirty_pfn_lis
     xc_interface *xch = ctx->xch;
     int rc = -1;
     unsigned int count, written;
-    unsigned long i;
+    unsigned long i, p2m_size;
+    long long ret;
     uint64_t *pfns = NULL;
     struct iovec *iov = NULL;
     struct xc_sr_record rec = {
@@ -459,22 +460,29 @@ static int send_checkpoint_dirty_pfn_lis
     DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
                                     &ctx->restore.dirty_bitmap_hbuf);
 
-    if ( xc_logdirty_control(
-             xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
-             HYPERCALL_BUFFER(dirty_bitmap), ctx->restore.p2m_size,
-             0, NULL) != ctx->restore.p2m_size )
+    ret = xc_logdirty_control(
+              xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
+              HYPERCALL_BUFFER(dirty_bitmap), ctx->restore.p2m_size,
+              0, NULL);
+    if ( ret < 0 )
     {
         PERROR("Failed to retrieve logdirty bitmap");
         goto err;
     }
+    if ( ret > ctx->restore.p2m_size )
+    {
+        ERROR("Guest has grown its p2m too much");
+        goto err;
+    }
+    p2m_size = ret;
 
-    for ( i = 0, count = 0; i < ctx->restore.p2m_size; i++ )
+    for ( i = 0, count = 0; i < p2m_size; i++ )
     {
         if ( test_bit(i, dirty_bitmap) && !++count )
             break;
     }
 
-    if ( i < ctx->restore.p2m_size )
+    if ( i < p2m_size )
     {
         ERROR("Too many dirty pfns");
         goto err;
@@ -495,7 +503,7 @@ static int send_checkpoint_dirty_pfn_lis
         goto err;
     }
 
-    for ( i = 0, written = 0; i < ctx->restore.p2m_size; ++i )
+    for ( i = 0, written = 0; i < p2m_size; ++i )
     {
         if ( !test_bit(i, dirty_bitmap) )
             continue;
@@ -739,8 +747,10 @@ static int setup(struct xc_sr_context *c
 
     if ( ctx->stream_type == XC_STREAM_COLO )
     {
+        unsigned long pages = NRPAGES(bitmap_size(ctx->restore.p2m_size));
+
         dirty_bitmap = xc_hypercall_buffer_alloc_pages(
-            xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->restore.p2m_size)));
+            xch, dirty_bitmap, pages);
 
         if ( !dirty_bitmap )
         {
@@ -748,6 +758,8 @@ static int setup(struct xc_sr_context *c
             rc = -1;
             goto err;
         }
+
+        ctx->restore.p2m_size = pages << (PAGE_SHIFT + 3);
     }
 
     rc = ctx->restore.ops.setup(ctx);
--- a/tools/libs/guest/xg_sr_save.c
+++ b/tools/libs/guest/xg_sr_save.c
@@ -74,6 +74,16 @@ static int write_checkpoint_record(struc
     return write_record(ctx, &checkpoint);
 }
 
+static void update_deferred_pages(struct xc_sr_context *ctx, xen_pfn_t pfn)
+{
+    if ( !ctx->save.deferred_pages )
+        return;
+    set_bit(pfn, ctx->save.deferred_pages);
+    ++ctx->save.nr_deferred_pages;
+    if ( pfn >= ctx->save.used_deferred_pages )
+        ctx->save.used_deferred_pages = pfn + 1;
+}
+
 /*
  * Writes a batch of memory as a PAGE_DATA record into the stream.  The batch
  * is constructed in ctx->save.batch_pfns.
@@ -130,11 +140,8 @@ static int write_batch(struct xc_sr_cont
                                                       ctx->save.batch_pfns[i]);
 
         /* Likely a ballooned page. */
-        if ( mfns[i] == INVALID_MFN && ctx->save.deferred_pages )
-        {
-            set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
-            ++ctx->save.nr_deferred_pages;
-        }
+        if ( mfns[i] == INVALID_MFN )
+            update_deferred_pages(ctx, ctx->save.batch_pfns[i]);
     }
 
     rc = xc_get_pfn_type_batch(xch, ctx->domid, nr_pfns, types);
@@ -196,12 +203,7 @@ static int write_batch(struct xc_sr_cont
             {
                 if ( rc == -1 && errno == EAGAIN )
                 {
-                    if ( ctx->save.deferred_pages )
-                    {
-                        set_bit(ctx->save.batch_pfns[i],
-                                ctx->save.deferred_pages);
-                        ++ctx->save.nr_deferred_pages;
-                    }
+                    update_deferred_pages(ctx, ctx->save.batch_pfns[i]);
                     types[i] = XEN_DOMCTL_PFINFO_XTAB;
                     --nr_pages;
                 }
@@ -369,7 +371,7 @@ static int suspend_domain(struct xc_sr_c
  * Send a subset of pages in the guests p2m, according to the dirty bitmap.
  * Used for each subsequent iteration of the live migration loop.
  *
- * Bitmap is bounded by p2m_size.
+ * Bitmap is bounded by p2m_alloc_size, but populated only up to p2m_used_size.
  */
 static int send_dirty_pages(struct xc_sr_context *ctx,
                             unsigned long entries, bool all_dirty)
@@ -381,7 +383,10 @@ static int send_dirty_pages(struct xc_sr
     DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
                                     &ctx->save.dirty_bitmap_hbuf);
 
-    for ( p = 0, written = 0; p < ctx->save.p2m_size; ++p )
+    if ( all_dirty )
+        ctx->save.p2m_used_size = ctx->save.p2m_size;
+
+    for ( p = 0, written = 0; p < ctx->save.p2m_used_size; ++p )
     {
         if ( !all_dirty && !test_bit(p, dirty_bitmap) )
             continue;
@@ -526,6 +531,8 @@ static int send_memory_live(struct xc_sr
 
     for ( ; ; )
     {
+        long long ret;
+
         policy_decision = precopy_policy(*policy_stats, data);
         x++;
 
@@ -552,15 +559,23 @@ static int send_memory_live(struct xc_sr
         if ( policy_decision != XGS_POLICY_CONTINUE_PRECOPY )
             break;
 
-        if ( xc_logdirty_control(
-                 xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
-                 &ctx->save.dirty_bitmap_hbuf, ctx->save.p2m_size,
-                 0, &stats) != ctx->save.p2m_size )
+        ret = xc_logdirty_control(
+                  xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
+                  &ctx->save.dirty_bitmap_hbuf, ctx->save.p2m_alloc_size,
+                  0, &stats);
+        if ( ret < 0 )
         {
             PERROR("Failed to retrieve logdirty bitmap");
             rc = -1;
             goto out;
         }
+        if ( ret > ctx->save.p2m_alloc_size )
+        {
+            ERROR("Guest has grown its p2m too much");
+            rc = -1;
+            goto out;
+        }
+        ctx->save.p2m_used_size = ret;
 
         policy_stats->dirty_count = stats.dirty_count;
 
@@ -614,7 +629,7 @@ static int colo_merge_secondary_dirty_bi
     for ( i = 0; i < count; i++ )
     {
         pfn = pfns[i];
-        if ( pfn >= ctx->save.p2m_size )
+        if ( pfn >= ctx->save.p2m_alloc_size )
         {
             PERROR("Invalid pfn 0x%" PRIx64, pfn);
             rc = -1;
@@ -642,6 +657,7 @@ static int suspend_and_send_dirty(struct
     xc_shadow_op_stats_t stats;
     char *progress_str = NULL;
     int rc;
+    long long ret;
     DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
                                     &ctx->save.dirty_bitmap_hbuf);
 
@@ -649,16 +665,22 @@ static int suspend_and_send_dirty(struct
     if ( rc )
         goto out;
 
-    if ( xc_logdirty_control(
-             xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
-             HYPERCALL_BUFFER(dirty_bitmap), ctx->save.p2m_size,
-             XEN_DOMCTL_SHADOW_LOGDIRTY_FINAL, &stats) !=
-         ctx->save.p2m_size )
+    ret = xc_logdirty_control(
+              xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
+              HYPERCALL_BUFFER(dirty_bitmap), ctx->save.p2m_alloc_size,
+              XEN_DOMCTL_SHADOW_LOGDIRTY_FINAL, &stats);
+    if ( ret < 0 )
     {
         PERROR("Failed to retrieve logdirty bitmap");
         rc = -1;
         goto out;
     }
+    if ( ret > ctx->save.p2m_alloc_size )
+    {
+        ERROR("Guest has grown its p2m too much");
+        rc = -1;
+        goto out;
+    }
 
     if ( ctx->save.live )
     {
@@ -670,7 +692,8 @@ static int suspend_and_send_dirty(struct
         xc_set_progress_prefix(xch, "Checkpointed save");
 
     if ( ctx->save.deferred_pages )
-        bitmap_or(dirty_bitmap, ctx->save.deferred_pages, ctx->save.p2m_size);
+        bitmap_or(dirty_bitmap, ctx->save.deferred_pages, ctx->save.p2m_alloc_size);
+    ctx->save.p2m_used_size = MAX(ret, ctx->save.used_deferred_pages);
 
     if ( !ctx->save.live && ctx->stream_type == XC_STREAM_COLO )
     {
@@ -688,8 +711,9 @@ static int suspend_and_send_dirty(struct
         goto out;
 
     if ( ctx->save.deferred_pages )
-        bitmap_clear(ctx->save.deferred_pages, ctx->save.p2m_size);
+        bitmap_clear(ctx->save.deferred_pages, ctx->save.p2m_alloc_size);
     ctx->save.nr_deferred_pages = 0;
+    ctx->save.used_deferred_pages = 0;
 
  out:
     xc_set_progress_prefix(xch, NULL);
@@ -702,6 +726,7 @@ static int verify_frames(struct xc_sr_co
     xc_interface *xch = ctx->xch;
     xc_shadow_op_stats_t stats;
     int rc;
+    long long ret;
     struct xc_sr_record rec = { .type = REC_TYPE_VERIFY };
 
     DPRINTF("Enabling verify mode");
@@ -715,15 +740,18 @@ static int verify_frames(struct xc_sr_co
     if ( rc )
         goto out;
 
-    if ( xc_logdirty_control(
-             xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_PEEK,
-             &ctx->save.dirty_bitmap_hbuf, ctx->save.p2m_size,
-             0, &stats) != ctx->save.p2m_size )
+    ret = xc_logdirty_control(
+              xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_PEEK,
+              &ctx->save.dirty_bitmap_hbuf, ctx->save.p2m_alloc_size,
+              0, &stats);
+    if ( ret < 0 )
     {
         PERROR("Failed to retrieve logdirty bitmap");
         rc = -1;
         goto out;
     }
+    if ( ret > ctx->save.p2m_alloc_size )
+        IPRINTF("Guest has grown its p2m too much");
 
     DPRINTF("  Further stats: faults %u, dirty %u",
             stats.fault_count, stats.dirty_count);
@@ -804,13 +832,14 @@ static int setup(struct xc_sr_context *c
 
     if ( ctx->save.live || ctx->stream_type != XC_STREAM_PLAIN )
     {
+        unsigned long pages = NRPAGES(bitmap_size(ctx->save.p2m_size));
         DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
                                         &ctx->save.dirty_bitmap_hbuf);
 
         dirty_bitmap =
-            xc_hypercall_buffer_alloc_pages(
-                xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->save.p2m_size)));
-        ctx->save.deferred_pages = bitmap_alloc(ctx->save.p2m_size);
+            xc_hypercall_buffer_alloc_pages(xch, dirty_bitmap, pages);
+        ctx->save.p2m_alloc_size = pages << (PAGE_SHIFT + 3);
+        ctx->save.deferred_pages = bitmap_alloc(ctx->save.p2m_alloc_size);
 
         if ( !dirty_bitmap || !ctx->save.deferred_pages )
             goto enomem;
--- a/xen/arch/x86/mm/paging.c
+++ b/xen/arch/x86/mm/paging.c
@@ -397,6 +397,19 @@ int paging_mfn_is_dirty(struct domain *d
     return rv;
 }
 
+/*
+ * This is used to provide a rough (upper) estimate to the caller of how many
+ * more pages we might have data for.
+ */
+static unsigned int last_valid_entry(const mfn_t *tbl, unsigned int idx) {
+    unsigned int last = LOGDIRTY_NODE_ENTRIES;
+
+    for ( ; idx < LOGDIRTY_NODE_ENTRIES; ++idx )
+        if ( mfn_valid(tbl[idx]) )
+            last = idx;
+
+    return last;
+}
 
 /* Read a domain's log-dirty bitmap and stats.  If the operation is a CLEAN,
  * clear the bitmap and stats as well. */
@@ -405,10 +418,10 @@ static int paging_log_dirty_op(struct do
                                bool_t resuming)
 {
     int rv = 0, clean = 0, peek = 1;
-    unsigned long pages = 0;
+    unsigned long pages = 0, extra = 0;
     mfn_t *l4 = NULL, *l3 = NULL, *l2 = NULL;
     unsigned long *l1 = NULL;
-    int i4, i3, i2;
+    unsigned int i4, i3, i2;
 
     if ( !resuming )
     {
@@ -479,6 +492,15 @@ static int paging_log_dirty_op(struct do
         l3 = (l4 && mfn_valid(l4[i4])) ? map_domain_page(l4[i4]) : NULL;
         for ( ; (pages < sc->pages) && (i3 < LOGDIRTY_NODE_ENTRIES); i3++ )
         {
+            unsigned long max_gfn = domain_get_maximum_gpfn(d);
+
+            if ( (i4 * LOGDIRTY_NODE_ENTRIES + i3) *
+                 LOGDIRTY_NODE_ENTRIES * PAGE_SIZE * 8 > max_gfn )
+            {
+                i4 = LOGDIRTY_NODE_ENTRIES;
+                break;
+            }
+
             l2 = ((l3 && mfn_valid(l3[i3])) ?
                   map_domain_page(l3[i3]) : NULL);
             for ( i2 = 0;
@@ -502,18 +524,36 @@ static int paging_log_dirty_op(struct do
                         goto out;
                     }
                 }
+
                 pages += bytes << 3;
+
                 if ( l1 )
                 {
+                    if ( unlikely(pages >= sc->pages) )
+                        extra = (PAGE_SIZE - bytes) << 3;
+
                     if ( clean )
                         clear_page(l1);
                     unmap_domain_page(l1);
                 }
             }
+
             if ( l2 )
+            {
+                if ( unlikely(pages >= sc->pages) )
+                {
+                    i2 = last_valid_entry(l2, i2);
+                    if ( i2 < LOGDIRTY_NODE_ENTRIES )
+                        extra = ((i4 * LOGDIRTY_NODE_ENTRIES + i3) *
+                                 LOGDIRTY_NODE_ENTRIES + i2 + 1) *
+                                PAGE_SIZE * 8;
+                }
+
                 unmap_domain_page(l2);
+            }
 
-            if ( i3 < LOGDIRTY_NODE_ENTRIES - 1 && hypercall_preempt_check() )
+            if ( pages < sc->pages && i3 < LOGDIRTY_NODE_ENTRIES - 1 &&
+                 hypercall_preempt_check() )
             {
                 d->arch.paging.preempt.log_dirty.i4 = i4;
                 d->arch.paging.preempt.log_dirty.i3 = i3 + 1;
@@ -521,10 +561,21 @@ static int paging_log_dirty_op(struct do
                 break;
             }
         }
+
         if ( l3 )
+        {
+            if ( !rv && unlikely(pages >= sc->pages) )
+            {
+                i3 = last_valid_entry(l3, i3);
+                if ( i3 < LOGDIRTY_NODE_ENTRIES )
+                    extra = (i4 * LOGDIRTY_NODE_ENTRIES + i3 + 1) *
+                            LOGDIRTY_NODE_ENTRIES * PAGE_SIZE * 8;
+            }
+
             unmap_domain_page(l3);
+        }
 
-        if ( !rv && i4 < LOGDIRTY_NODE_ENTRIES - 1 &&
+        if ( !rv && pages < sc->pages && i4 < LOGDIRTY_NODE_ENTRIES - 1 &&
              hypercall_preempt_check() )
         {
             d->arch.paging.preempt.log_dirty.i4 = i4 + 1;
@@ -534,8 +585,19 @@ static int paging_log_dirty_op(struct do
         if ( rv )
             break;
     }
+
     if ( l4 )
+    {
+        if ( !rv && unlikely(pages >= sc->pages) )
+        {
+            i4 = last_valid_entry(l4, i4);
+            if ( i4 < LOGDIRTY_NODE_ENTRIES )
+                extra = (i4 + 1) * LOGDIRTY_NODE_ENTRIES *
+                        LOGDIRTY_NODE_ENTRIES * PAGE_SIZE * 8;
+        }
+
         unmap_domain_page(l4);
+    }
 
     if ( !rv )
     {
@@ -562,8 +624,8 @@ static int paging_log_dirty_op(struct do
         return rv;
     }
 
-    if ( pages < sc->pages )
-        sc->pages = pages;
+    sc->pages = min(pages + extra, domain_get_maximum_gpfn(d) + 1);
+
     if ( clean )
     {
         /* We need to further call clean_dirty_bitmap() functions of specific
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -261,7 +261,8 @@ struct xen_domctl_shadow_op {
 
     /* OP_PEEK / OP_CLEAN */
     XEN_GUEST_HANDLE_64(uint8) dirty_bitmap;
-    uint64_aligned_t pages; /* Size of buffer. Updated with actual size. */
+    uint64_aligned_t pages; /* Size of buffer. Updated with actual (or
+                               potentially needed) size. */
     struct xen_domctl_shadow_op_stats stats;
 };
 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 10/12] x86/mm: update log-dirty bitmap when manipulating P2M
  2021-06-25 13:15 [PATCH 00/12] x86: more or less log-dirty related improvements Jan Beulich
                   ` (8 preceding siblings ...)
  2021-06-25 13:21 ` [PATCH 09/12] x86/paging: supply more useful log-dirty page count Jan Beulich
@ 2021-06-25 13:22 ` Jan Beulich
  2021-06-25 13:22 ` [PATCH 11/12] x86/mm: pull a sanity check earlier in xenmem_add_to_physmap_one() Jan Beulich
  2021-06-25 13:24 ` [PATCH 12/12] SUPPORT.md: write down restriction of 32-bit tool stacks Jan Beulich
  11 siblings, 0 replies; 36+ messages in thread
From: Jan Beulich @ 2021-06-25 13:22 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson

Just like for PV guests MMU_MACHPHYS_UPDATE implies marking of the
respective page as dirty, additions to a HVM guest's P2M should do so.

For HVM the opposite is als true: Pages being removed from the P2M are
no longer dirty at their prior GFN; there's no point in telling the tool
stack to try and copy that page, when this will fail anyway (until
perhaps a new page gets placed there). Introduce paging_mark_pfn_clean()
(intentionally without a paging_mark_clean() counterpart) to handle
this. Note that while there is an earlier call to set_gpfn_from_mfn() in
guest_physmap_add_entry(), but there's little reason to mark the page
clean there when later in the function it'll be marked dirty. This is
even more so given that at this point it's only the M2P that gets
updated, with the P2M still left unchanged.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
guest_physmap_add_entry()'s error handling looks bogus in this regard
anyway: If an error occurs before an MFN actually gets assciated with
the new GFN, the M2P entry ought to be restored imo. But of course a
guest is still hosed if the operation succeeds partially.

Note that I've not even checked mem-paging and mem-sharing code for
whether they may need similar adjustment. At least the latters is, aiui,
incompatible with log-dirty mode anyway.

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -818,7 +818,10 @@ p2m_remove_page(struct p2m_domain *p2m,
         {
             p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0, NULL, NULL);
             if ( !p2m_is_grant(t) && !p2m_is_shared(t) && !p2m_is_foreign(t) )
+            {
                 set_gpfn_from_mfn(mfn_x(mfn) + i, INVALID_M2P_ENTRY);
+                paging_mark_pfn_clean(p2m->domain, _pfn(gfn_x(gfn) + i));
+            }
         }
     }
 
@@ -1027,8 +1030,11 @@ guest_physmap_add_entry(struct domain *d
         if ( !p2m_is_grant(t) )
         {
             for ( i = 0; i < (1UL << page_order); i++ )
+            {
                 set_gpfn_from_mfn(mfn_x(mfn_add(mfn, i)),
                                   gfn_x(gfn_add(gfn, i)));
+                paging_mark_pfn_dirty(d, _pfn(gfn_x(gfn) + i));
+            }
         }
     }
 
@@ -1314,6 +1320,7 @@ static int set_typed_p2m_entry(struct do
         {
             ASSERT(mfn_valid(mfn_add(omfn, i)));
             set_gpfn_from_mfn(mfn_x(omfn) + i, INVALID_M2P_ENTRY);
+            paging_mark_pfn_clean(d, _pfn(gfn_x(gfn) + i));
         }
 
         ioreq_request_mapcache_invalidate(d);
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -645,7 +645,10 @@ p2m_pod_decrease_reservation(struct doma
             }
             p2m_tlb_flush_sync(p2m);
             for ( j = 0; j < n; ++j )
+            {
                 set_gpfn_from_mfn(mfn_x(mfn), INVALID_M2P_ENTRY);
+                paging_mark_pfn_clean(d, _pfn(gfn_x(gfn) + i + j));
+            }
             p2m_pod_cache_add(p2m, page, cur_order);
 
             ioreq_request_mapcache_invalidate(d);
--- a/xen/arch/x86/mm/paging.c
+++ b/xen/arch/x86/mm/paging.c
@@ -259,7 +259,7 @@ static int paging_log_dirty_disable(stru
 }
 
 /* Mark a page as dirty, with taking guest pfn as parameter */
-void paging_mark_pfn_dirty(struct domain *d, pfn_t pfn)
+static void mark_pfn_dirty(struct domain *d, pfn_t pfn, bool dirty)
 {
     bool changed;
     mfn_t mfn, *l4, *l3, *l2;
@@ -290,14 +290,15 @@ void paging_mark_pfn_dirty(struct domain
 
     if ( unlikely(!mfn_valid(d->arch.paging.log_dirty.top)) ) 
     {
-         d->arch.paging.log_dirty.top = paging_new_log_dirty_node(d);
+         if ( dirty )
+             d->arch.paging.log_dirty.top = paging_new_log_dirty_node(d);
          if ( unlikely(!mfn_valid(d->arch.paging.log_dirty.top)) )
              goto out;
     }
 
     l4 = paging_map_log_dirty_bitmap(d);
     mfn = l4[i4];
-    if ( !mfn_valid(mfn) )
+    if ( !mfn_valid(mfn) && dirty )
         l4[i4] = mfn = paging_new_log_dirty_node(d);
     unmap_domain_page(l4);
     if ( !mfn_valid(mfn) )
@@ -305,7 +306,7 @@ void paging_mark_pfn_dirty(struct domain
 
     l3 = map_domain_page(mfn);
     mfn = l3[i3];
-    if ( !mfn_valid(mfn) )
+    if ( !mfn_valid(mfn) && dirty )
         l3[i3] = mfn = paging_new_log_dirty_node(d);
     unmap_domain_page(l3);
     if ( !mfn_valid(mfn) )
@@ -313,21 +314,22 @@ void paging_mark_pfn_dirty(struct domain
 
     l2 = map_domain_page(mfn);
     mfn = l2[i2];
-    if ( !mfn_valid(mfn) )
+    if ( !mfn_valid(mfn) && dirty )
         l2[i2] = mfn = paging_new_log_dirty_leaf(d);
     unmap_domain_page(l2);
     if ( !mfn_valid(mfn) )
         goto out;
 
     l1 = map_domain_page(mfn);
-    changed = !__test_and_set_bit(i1, l1);
+    changed = dirty ? !__test_and_set_bit(i1, l1)
+                    : __test_and_clear_bit(i1, l1);
     unmap_domain_page(l1);
     if ( changed )
     {
         PAGING_DEBUG(LOGDIRTY,
-                     "d%d: marked mfn %" PRI_mfn " (pfn %" PRI_pfn ")\n",
-                     d->domain_id, mfn_x(mfn), pfn_x(pfn));
-        d->arch.paging.log_dirty.dirty_count++;
+                     "%pd: marked mfn %" PRI_mfn " (pfn %" PRI_pfn ") %s\n",
+                     d, mfn_x(mfn), pfn_x(pfn), dirty ? "dirty" : "clean");
+        d->arch.paging.log_dirty.dirty_count += dirty ? 1 : -1;
     }
 
 out:
@@ -336,6 +338,16 @@ out:
     return;
 }
 
+void paging_mark_pfn_dirty(struct domain *d, pfn_t pfn)
+{
+    mark_pfn_dirty(d, pfn, true);
+}
+
+void paging_mark_pfn_clean(struct domain *d, pfn_t pfn)
+{
+    mark_pfn_dirty(d, pfn, false);
+}
+
 /* Mark a page as dirty */
 void paging_mark_dirty(struct domain *d, mfn_t gmfn)
 {
@@ -348,7 +360,7 @@ void paging_mark_dirty(struct domain *d,
     /* We /really/ mean PFN here, even for non-translated guests. */
     pfn = _pfn(get_gpfn_from_mfn(mfn_x(gmfn)));
 
-    paging_mark_pfn_dirty(d, pfn);
+    mark_pfn_dirty(d, pfn, true);
 }
 
 
--- a/xen/include/asm-x86/paging.h
+++ b/xen/include/asm-x86/paging.h
@@ -170,8 +170,9 @@ void paging_log_dirty_init(struct domain
 
 /* mark a page as dirty */
 void paging_mark_dirty(struct domain *d, mfn_t gmfn);
-/* mark a page as dirty with taking guest pfn as parameter */
+/* mark a page as dirty/clean with taking guest pfn as parameter */
 void paging_mark_pfn_dirty(struct domain *d, pfn_t pfn);
+void paging_mark_pfn_clean(struct domain *d, pfn_t pfn);
 
 /* is this guest page dirty? 
  * This is called from inside paging code, with the paging lock held. */



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 11/12] x86/mm: pull a sanity check earlier in xenmem_add_to_physmap_one()
  2021-06-25 13:15 [PATCH 00/12] x86: more or less log-dirty related improvements Jan Beulich
                   ` (9 preceding siblings ...)
  2021-06-25 13:22 ` [PATCH 10/12] x86/mm: update log-dirty bitmap when manipulating P2M Jan Beulich
@ 2021-06-25 13:22 ` Jan Beulich
  2021-06-25 19:10   ` Andrew Cooper
  2021-06-25 13:24 ` [PATCH 12/12] SUPPORT.md: write down restriction of 32-bit tool stacks Jan Beulich
  11 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2021-06-25 13:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap

We should try to limit the failure reasons after we've started making
changes.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -2763,6 +2763,15 @@ int xenmem_add_to_physmap_one(
         goto put_both;
     }
 
+    /* XENMAPSPACE_gmfn: Check if the MFN is associated with another GFN. */
+    old_gpfn = get_gpfn_from_mfn(mfn_x(mfn));
+    ASSERT(!SHARED_M2P(old_gpfn));
+    if ( space == XENMAPSPACE_gmfn && old_gpfn != gfn )
+    {
+        rc = -EXDEV;
+        goto put_both;
+    }
+
     /* Remove previously mapped page if it was present. */
     prev_mfn = get_gfn(d, gfn_x(gpfn), &p2mt);
     if ( mfn_valid(prev_mfn) )
@@ -2781,13 +2790,6 @@ int xenmem_add_to_physmap_one(
         goto put_both;
 
     /* Unmap from old location, if any. */
-    old_gpfn = get_gpfn_from_mfn(mfn_x(mfn));
-    ASSERT(!SHARED_M2P(old_gpfn));
-    if ( space == XENMAPSPACE_gmfn && old_gpfn != gfn )
-    {
-        rc = -EXDEV;
-        goto put_both;
-    }
     if ( old_gpfn != INVALID_M2P_ENTRY )
         rc = guest_physmap_remove_page(d, _gfn(old_gpfn), mfn, PAGE_ORDER_4K);
 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 12/12] SUPPORT.md: write down restriction of 32-bit tool stacks
  2021-06-25 13:15 [PATCH 00/12] x86: more or less log-dirty related improvements Jan Beulich
                   ` (10 preceding siblings ...)
  2021-06-25 13:22 ` [PATCH 11/12] x86/mm: pull a sanity check earlier in xenmem_add_to_physmap_one() Jan Beulich
@ 2021-06-25 13:24 ` Jan Beulich
  2021-06-25 19:45   ` Andrew Cooper
  11 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2021-06-25 13:24 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, Wei Liu

Let's try to avoid giving the impression that 32-bit tool stacks are as
capable as 64-bit ones.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/SUPPORT.md
+++ b/SUPPORT.md
@@ -131,6 +131,11 @@ ARM only has one guest type at the momen
 
 ## Toolstack
 
+While 32-bit builds of the tool stack are generally supported, restrictions
+apply in particular when running on top of a 64-bit hypervisor.  For example,
+very large guests aren't supported in this case.  This includes guests giving
+the appearance of being large, by altering their own memory layouts.
+
 ### xl
 
     Status: Supported



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/12] libxc: split xc_logdirty_control() from xc_shadow_control()
  2021-06-25 13:17 ` [PATCH 01/12] libxc: split xc_logdirty_control() from xc_shadow_control() Jan Beulich
@ 2021-06-25 14:51   ` Christian Lindig
  2021-06-25 15:49   ` Andrew Cooper
  1 sibling, 0 replies; 36+ messages in thread
From: Christian Lindig @ 2021-06-25 14:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, Wei Liu, Roger Pau Monne,
	Juergen Gross, George Dunlap, Ian Jackson, Marek Marczykowski,
	David Scott

[-- Attachment #1: Type: text/plain, Size: 11617 bytes --]

For the OCaml part:

Acked-by: Christian Lindig <christian.lindig@citrix.com<mailto:christian.lindig@citrix.com>>

On 25 Jun 2021, at 14:17, Jan Beulich <jbeulich@suse.com<mailto:jbeulich@suse.com>> wrote:

For log-dirty operations a 64-bit field is being truncated to become an
"int" return value. Seeing the large number of arguments the present
function takes, reduce its set of parameters to that needed for all
operations not involving the log-dirty bitmap, while introducing a new
wrapper for the log-dirty bitmap operations. This new function in turn
doesn't need an "mb" parameter, but has a 64-bit return type. (Using the
return value in favor of a pointer-type parameter is left as is, to
disturb callers as little as possible.)

While altering xc_shadow_control() anyway, also adjust the types of the
last two of the remaining parameters.

Signed-off-by: Jan Beulich <jbeulich@suse.com<mailto:jbeulich@suse.com>>
---
I wonder whether we shouldn't take the opportunity and also rename
xc_shadow_control() to, say, xc_paging_control(), matching the layer
above the HAP/shadow distinction in the hypervisor.

--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -885,11 +885,15 @@ typedef struct xen_domctl_shadow_op_stat
int xc_shadow_control(xc_interface *xch,
                      uint32_t domid,
                      unsigned int sop,
-                      xc_hypercall_buffer_t *dirty_bitmap,
-                      unsigned long pages,
-                      unsigned long *mb,
-                      uint32_t mode,
-                      xc_shadow_op_stats_t *stats);
+                      unsigned int *mb,
+                      unsigned int mode);
+long long xc_logdirty_control(xc_interface *xch,
+                              uint32_t domid,
+                              unsigned int sop,
+                              xc_hypercall_buffer_t *dirty_bitmap,
+                              unsigned long pages,
+                              unsigned int mode,
+                              xc_shadow_op_stats_t *stats);

int xc_sched_credit_domain_set(xc_interface *xch,
                               uint32_t domid,
--- a/tools/libs/ctrl/xc_domain.c
+++ b/tools/libs/ctrl/xc_domain.c
@@ -650,25 +650,48 @@ int xc_watchdog(xc_interface *xch,
int xc_shadow_control(xc_interface *xch,
                      uint32_t domid,
                      unsigned int sop,
-                      xc_hypercall_buffer_t *dirty_bitmap,
-                      unsigned long pages,
-                      unsigned long *mb,
-                      uint32_t mode,
-                      xc_shadow_op_stats_t *stats)
+                      unsigned int *mb,
+                      unsigned int mode)
{
    int rc;
    DECLARE_DOMCTL;
-    DECLARE_HYPERCALL_BUFFER_ARGUMENT(dirty_bitmap);

    memset(&domctl, 0, sizeof(domctl));

    domctl.cmd = XEN_DOMCTL_shadow_op;
    domctl.domain = domid;
    domctl.u.shadow_op.op     = sop;
-    domctl.u.shadow_op.pages  = pages;
    domctl.u.shadow_op.mb     = mb ? *mb : 0;
    domctl.u.shadow_op.mode   = mode;
-    if (dirty_bitmap != NULL)
+
+    rc = do_domctl(xch, &domctl);
+
+    if ( mb )
+        *mb = domctl.u.shadow_op.mb;
+
+    return rc;
+}
+
+long long xc_logdirty_control(xc_interface *xch,
+                              uint32_t domid,
+                              unsigned int sop,
+                              xc_hypercall_buffer_t *dirty_bitmap,
+                              unsigned long pages,
+                              unsigned int mode,
+                              xc_shadow_op_stats_t *stats)
+{
+    int rc;
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BUFFER_ARGUMENT(dirty_bitmap);
+
+    memset(&domctl, 0, sizeof(domctl));
+
+    domctl.cmd = XEN_DOMCTL_shadow_op;
+    domctl.domain = domid;
+    domctl.u.shadow_op.op    = sop;
+    domctl.u.shadow_op.pages = pages;
+    domctl.u.shadow_op.mode  = mode;
+    if ( dirty_bitmap )
        set_xen_guest_handle(domctl.u.shadow_op.dirty_bitmap,
                                dirty_bitmap);

@@ -678,9 +701,6 @@ int xc_shadow_control(xc_interface *xch,
        memcpy(stats, &domctl.u.shadow_op.stats,
               sizeof(xc_shadow_op_stats_t));

-    if ( mb )
-        *mb = domctl.u.shadow_op.mb;
-
    return (rc == 0) ? domctl.u.shadow_op.pages : rc;
}

--- a/tools/libs/guest/xg_sr_restore.c
+++ b/tools/libs/guest/xg_sr_restore.c
@@ -459,10 +459,10 @@ static int send_checkpoint_dirty_pfn_lis
    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
                                    &ctx->restore.dirty_bitmap_hbuf);

-    if ( xc_shadow_control(
+    if ( xc_logdirty_control(
             xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
             HYPERCALL_BUFFER(dirty_bitmap), ctx->restore.p2m_size,
-             NULL, 0, &stats) != ctx->restore.p2m_size )
+             0, &stats) != ctx->restore.p2m_size )
    {
        PERROR("Failed to retrieve logdirty bitmap");
        goto err;
--- a/tools/libs/guest/xg_sr_save.c
+++ b/tools/libs/guest/xg_sr_save.c
@@ -428,18 +428,18 @@ static int enable_logdirty(struct xc_sr_
    /* This juggling is required if logdirty is enabled for VRAM tracking. */
    rc = xc_shadow_control(xch, ctx->domid,
                           XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
-                           NULL, 0, NULL, 0, NULL);
+                           NULL, 0);
    if ( rc < 0 )
    {
        on1 = errno;
        rc = xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
-                               NULL, 0, NULL, 0, NULL);
+                               NULL, 0);
        if ( rc < 0 )
            off = errno;
        else {
            rc = xc_shadow_control(xch, ctx->domid,
                                   XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
-                                   NULL, 0, NULL, 0, NULL);
+                                   NULL, 0);
            if ( rc < 0 )
                on2 = errno;
        }
@@ -556,10 +556,10 @@ static int send_memory_live(struct xc_sr
        if ( policy_decision != XGS_POLICY_CONTINUE_PRECOPY )
            break;

-        if ( xc_shadow_control(
+        if ( xc_logdirty_control(
                 xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
                 &ctx->save.dirty_bitmap_hbuf, ctx->save.p2m_size,
-                 NULL, 0, &stats) != ctx->save.p2m_size )
+                 0, &stats) != ctx->save.p2m_size )
        {
            PERROR("Failed to retrieve logdirty bitmap");
            rc = -1;
@@ -653,10 +653,10 @@ static int suspend_and_send_dirty(struct
    if ( rc )
        goto out;

-    if ( xc_shadow_control(
+    if ( xc_logdirty_control(
             xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
             HYPERCALL_BUFFER(dirty_bitmap), ctx->save.p2m_size,
-             NULL, XEN_DOMCTL_SHADOW_LOGDIRTY_FINAL, &stats) !=
+             XEN_DOMCTL_SHADOW_LOGDIRTY_FINAL, &stats) !=
         ctx->save.p2m_size )
    {
        PERROR("Failed to retrieve logdirty bitmap");
@@ -716,10 +716,10 @@ static int verify_frames(struct xc_sr_co
    if ( rc )
        goto out;

-    if ( xc_shadow_control(
+    if ( xc_logdirty_control(
             xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_PEEK,
             &ctx->save.dirty_bitmap_hbuf, ctx->save.p2m_size,
-             NULL, 0, &stats) != ctx->save.p2m_size )
+             0, &stats) != ctx->save.p2m_size )
    {
        PERROR("Failed to retrieve logdirty bitmap");
        rc = -1;
@@ -834,7 +834,7 @@ static void cleanup(struct xc_sr_context


    xc_shadow_control(xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_OFF,
-                      NULL, 0, NULL, 0, NULL);
+                      NULL, 0);

    if ( ctx->save.ops.cleanup(ctx) )
        PERROR("Failed to clean up");
--- a/tools/libs/light/libxl_colo_restore.c
+++ b/tools/libs/light/libxl_colo_restore.c
@@ -62,7 +62,7 @@ static void colo_enable_logdirty(libxl__
    /* we need to know which pages are dirty to restore the guest */
    if (xc_shadow_control(CTX->xch, domid,
                          XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
-                          NULL, 0, NULL, 0, NULL) < 0) {
+                          NULL, 0) < 0) {
        LOGD(ERROR, domid, "cannot enable secondary vm's logdirty");
        lds->callback(egc, lds, ERROR_FAIL);
        return;
@@ -90,7 +90,7 @@ static void colo_disable_logdirty(libxl_

    /* we need to know which pages are dirty to restore the guest */
    if (xc_shadow_control(CTX->xch, domid, XEN_DOMCTL_SHADOW_OP_OFF,
-                          NULL, 0, NULL, 0, NULL) < 0)
+                          NULL, 0) < 0)
        LOGD(WARN, domid, "cannot disable secondary vm's logdirty");

    if (crs->hvm) {
--- a/tools/libs/light/libxl_x86.c
+++ b/tools/libs/light/libxl_x86.c
@@ -529,10 +529,10 @@ int libxl__arch_domain_create(libxl__gc
        xc_domain_set_time_offset(ctx->xch, domid, rtc_timeoffset);

    if (d_config->b_info.type != LIBXL_DOMAIN_TYPE_PV) {
-        unsigned long shadow = DIV_ROUNDUP(d_config->b_info.shadow_memkb,
-                                           1024);
+        unsigned int shadow = DIV_ROUNDUP(d_config->b_info.shadow_memkb,
+                                          1024);
        xc_shadow_control(ctx->xch, domid, XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION,
-                          NULL, 0, &shadow, 0, NULL);
+                          &shadow, 0);
    }

    if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV &&
--- a/tools/ocaml/libs/xc/xenctrl_stubs.c
+++ b/tools/ocaml/libs/xc/xenctrl_stubs.c
@@ -997,13 +997,13 @@ CAMLprim value stub_shadow_allocation_ge
{
CAMLparam2(xch, domid);
CAMLlocal1(mb);
- unsigned long c_mb;
+ unsigned int c_mb;
int ret;

caml_enter_blocking_section();
ret = xc_shadow_control(_H(xch), _D(domid),
XEN_DOMCTL_SHADOW_OP_GET_ALLOCATION,
- NULL, 0, &c_mb, 0, NULL);
+ &c_mb, 0);
caml_leave_blocking_section();
if (ret != 0)
failwith_xc(_H(xch));
@@ -1016,14 +1016,14 @@ CAMLprim value stub_shadow_allocation_se
 value mb)
{
CAMLparam3(xch, domid, mb);
- unsigned long c_mb;
+ unsigned int c_mb;
int ret;

c_mb = Int_val(mb);
caml_enter_blocking_section();
ret = xc_shadow_control(_H(xch), _D(domid),
XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION,
- NULL, 0, &c_mb, 0, NULL);
+ &c_mb, 0);
caml_leave_blocking_section();
if (ret != 0)
failwith_xc(_H(xch));
--- a/tools/python/xen/lowlevel/xc/xc.c
+++ b/tools/python/xen/lowlevel/xc/xc.c
@@ -1192,8 +1192,7 @@ static PyObject *pyxc_shadow_control(PyO
                                      &dom, &op) )
        return NULL;

-    if ( xc_shadow_control(xc->xc_handle, dom, op, NULL, 0, NULL, 0, NULL)
-         < 0 )
+    if ( xc_shadow_control(xc->xc_handle, dom, op, NULL, 0) < 0 )
        return pyxc_error_to_exception(xc->xc_handle);

    Py_INCREF(zero);
@@ -1208,7 +1207,7 @@ static PyObject *pyxc_shadow_mem_control
    int op;
    uint32_t dom;
    int mbarg = -1;
-    unsigned long mb;
+    unsigned int mb;

    static char *kwd_list[] = { "dom", "mb", NULL };

@@ -1223,7 +1222,7 @@ static PyObject *pyxc_shadow_mem_control
        mb = mbarg;
        op = XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION;
    }
-    if ( xc_shadow_control(xc->xc_handle, dom, op, NULL, 0, &mb, 0, NULL) < 0 )
+    if ( xc_shadow_control(xc->xc_handle, dom, op, &mb, 0) < 0 )
        return pyxc_error_to_exception(xc->xc_handle);

    mbarg = mb;



[-- Attachment #2: Type: text/html, Size: 30988 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/12] libxc: split xc_logdirty_control() from xc_shadow_control()
  2021-06-25 13:17 ` [PATCH 01/12] libxc: split xc_logdirty_control() from xc_shadow_control() Jan Beulich
  2021-06-25 14:51   ` Christian Lindig
@ 2021-06-25 15:49   ` Andrew Cooper
  2021-06-28  9:40     ` Jan Beulich
  1 sibling, 1 reply; 36+ messages in thread
From: Andrew Cooper @ 2021-06-25 15:49 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson, Marek Marczykowski,
	Christian Lindig, David Scott

On 25/06/2021 14:17, Jan Beulich wrote:
> For log-dirty operations a 64-bit field is being truncated to become an
> "int" return value. Seeing the large number of arguments the present
> function takes, reduce its set of parameters to that needed for all
> operations not involving the log-dirty bitmap, while introducing a new
> wrapper for the log-dirty bitmap operations. This new function in turn
> doesn't need an "mb" parameter, but has a 64-bit return type. (Using the
> return value in favor of a pointer-type parameter is left as is, to
> disturb callers as little as possible.)
>
> While altering xc_shadow_control() anyway, also adjust the types of the
> last two of the remaining parameters.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I wonder whether we shouldn't take the opportunity and also rename
> xc_shadow_control() to, say, xc_paging_control(), matching the layer
> above the HAP/shadow distinction in the hypervisor.

I do remember this being an especially obnoxious interface to use.  Any
improvement would go a long way, but I think we also need to rename some
domctls too.

>
> --- a/tools/include/xenctrl.h
> +++ b/tools/include/xenctrl.h
> @@ -885,11 +885,15 @@ typedef struct xen_domctl_shadow_op_stat
>  int xc_shadow_control(xc_interface *xch,
>                        uint32_t domid,
>                        unsigned int sop,
> -                      xc_hypercall_buffer_t *dirty_bitmap,
> -                      unsigned long pages,
> -                      unsigned long *mb,
> -                      uint32_t mode,
> -                      xc_shadow_op_stats_t *stats);
> +                      unsigned int *mb,
> +                      unsigned int mode);
> +long long xc_logdirty_control(xc_interface *xch,

uint64_t to match the hypercall?  All users of libxc are stdint.h aware.

> --- a/tools/libs/ctrl/xc_domain.c
> +++ b/tools/libs/ctrl/xc_domain.c
> @@ -650,25 +650,48 @@ int xc_watchdog(xc_interface *xch,
>  int xc_shadow_control(xc_interface *xch,
>                        uint32_t domid,
>                        unsigned int sop,
> -                      xc_hypercall_buffer_t *dirty_bitmap,
> -                      unsigned long pages,
> -                      unsigned long *mb,
> -                      uint32_t mode,
> -                      xc_shadow_op_stats_t *stats)
> +                      unsigned int *mb,
> +                      unsigned int mode)
>  {
>      int rc;
>      DECLARE_DOMCTL;
> -    DECLARE_HYPERCALL_BUFFER_ARGUMENT(dirty_bitmap);
>  
>      memset(&domctl, 0, sizeof(domctl));
>  
>      domctl.cmd = XEN_DOMCTL_shadow_op;
>      domctl.domain = domid;
>      domctl.u.shadow_op.op     = sop;
> -    domctl.u.shadow_op.pages  = pages;
>      domctl.u.shadow_op.mb     = mb ? *mb : 0;
>      domctl.u.shadow_op.mode   = mode;
> -    if (dirty_bitmap != NULL)
> +
> +    rc = do_domctl(xch, &domctl);
> +
> +    if ( mb )
> +        *mb = domctl.u.shadow_op.mb;
> +
> +    return rc;
> +}
> +
> +long long xc_logdirty_control(xc_interface *xch,
> +                              uint32_t domid,
> +                              unsigned int sop,
> +                              xc_hypercall_buffer_t *dirty_bitmap,
> +                              unsigned long pages,
> +                              unsigned int mode,
> +                              xc_shadow_op_stats_t *stats)
> +{
> +    int rc;
> +    DECLARE_DOMCTL;
> +    DECLARE_HYPERCALL_BUFFER_ARGUMENT(dirty_bitmap);
> +
> +    memset(&domctl, 0, sizeof(domctl));
> +
> +    domctl.cmd = XEN_DOMCTL_shadow_op;
> +    domctl.domain = domid;
> +    domctl.u.shadow_op.op    = sop;
> +    domctl.u.shadow_op.pages = pages;
> +    domctl.u.shadow_op.mode  = mode;

Please use:

struct xen_domctl domctl = {
    .cmd = XEN_DOMCTL_shadow_op,
    ...
};

I've been slowly taking out users of DECLARE_DOMCTL, because beyond
being pure code obfuscation, valgrind (rightly) complains that the
hypercall operates on uninitialised memory.

> --- a/tools/libs/light/libxl_colo_restore.c
> +++ b/tools/libs/light/libxl_colo_restore.c
> @@ -62,7 +62,7 @@ static void colo_enable_logdirty(libxl__
>      /* we need to know which pages are dirty to restore the guest */
>      if (xc_shadow_control(CTX->xch, domid,
>                            XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
> -                          NULL, 0, NULL, 0, NULL) < 0) {
> +                          NULL, 0) < 0) {
>          LOGD(ERROR, domid, "cannot enable secondary vm's logdirty");
>          lds->callback(egc, lds, ERROR_FAIL);
>          return;

:-/ even more COLO code which escaped my attempts to use a consistent
coding style.

I'll fix this up later, as its fairly invasive (context wise).

> @@ -90,7 +90,7 @@ static void colo_disable_logdirty(libxl_
>  
>      /* we need to know which pages are dirty to restore the guest */
>      if (xc_shadow_control(CTX->xch, domid, XEN_DOMCTL_SHADOW_OP_OFF,
> -                          NULL, 0, NULL, 0, NULL) < 0)
> +                          NULL, 0) < 0)
>          LOGD(WARN, domid, "cannot disable secondary vm's logdirty");
>  
>      if (crs->hvm) {
> --- a/tools/libs/light/libxl_x86.c
> +++ b/tools/libs/light/libxl_x86.c
> @@ -529,10 +529,10 @@ int libxl__arch_domain_create(libxl__gc
>          xc_domain_set_time_offset(ctx->xch, domid, rtc_timeoffset);
>  
>      if (d_config->b_info.type != LIBXL_DOMAIN_TYPE_PV) {
> -        unsigned long shadow = DIV_ROUNDUP(d_config->b_info.shadow_memkb,
> -                                           1024);
> +        unsigned int shadow = DIV_ROUNDUP(d_config->b_info.shadow_memkb,
> +                                          1024);
>          xc_shadow_control(ctx->xch, domid, XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION,
> -                          NULL, 0, &shadow, 0, NULL);
> +                          &shadow, 0);

I know this isn't introduced by your patch, but this cannot possibly be
correct without error handling.  There is a good chance of this call
running Xen out of memory.

Any chance of a fix split out into a separate patch?

>      }
>  
>      if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV &&
> --- a/tools/ocaml/libs/xc/xenctrl_stubs.c
> +++ b/tools/ocaml/libs/xc/xenctrl_stubs.c
> @@ -997,13 +997,13 @@ CAMLprim value stub_shadow_allocation_ge
>  {
>  	CAMLparam2(xch, domid);
>  	CAMLlocal1(mb);
> -	unsigned long c_mb;
> +	unsigned int c_mb;
>  	int ret;
>  
>  	caml_enter_blocking_section();
>  	ret = xc_shadow_control(_H(xch), _D(domid),
>  				XEN_DOMCTL_SHADOW_OP_GET_ALLOCATION,
> -				NULL, 0, &c_mb, 0, NULL);
> +				&c_mb, 0);
>  	caml_leave_blocking_section();
>  	if (ret != 0)
>  		failwith_xc(_H(xch));

Not a bug introduced in this patch, but this is broken.  There is a kb
vs mb units mismatch, and I don't see any shifts by 10 anywhere in the
Ocaml stubs.

> @@ -1016,14 +1016,14 @@ CAMLprim value stub_shadow_allocation_se
>  					  value mb)
>  {
>  	CAMLparam3(xch, domid, mb);
> -	unsigned long c_mb;
> +	unsigned int c_mb;
>  	int ret;
>  
>  	c_mb = Int_val(mb);

This has a 31 bit truncation issue on 32bit builds.  I'm not sure how
much we care.

>  	caml_enter_blocking_section();
>  	ret = xc_shadow_control(_H(xch), _D(domid),
>  				XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION,
> -				NULL, 0, &c_mb, 0, NULL);
> +				&c_mb, 0);
>  	caml_leave_blocking_section();
>  	if (ret != 0)
>  		failwith_xc(_H(xch));
> --- a/tools/python/xen/lowlevel/xc/xc.c
> +++ b/tools/python/xen/lowlevel/xc/xc.c
> @@ -1192,8 +1192,7 @@ static PyObject *pyxc_shadow_control(PyO
>                                        &dom, &op) )
>          return NULL;
>      
> -    if ( xc_shadow_control(xc->xc_handle, dom, op, NULL, 0, NULL, 0, NULL) 
> -         < 0 )
> +    if ( xc_shadow_control(xc->xc_handle, dom, op, NULL, 0) < 0 )
>          return pyxc_error_to_exception(xc->xc_handle);
>      
>      Py_INCREF(zero);
> @@ -1208,7 +1207,7 @@ static PyObject *pyxc_shadow_mem_control
>      int op;
>      uint32_t dom;
>      int mbarg = -1;
> -    unsigned long mb;
> +    unsigned int mb;
>  
>      static char *kwd_list[] = { "dom", "mb", NULL };
>  
> @@ -1223,7 +1222,7 @@ static PyObject *pyxc_shadow_mem_control
>          mb = mbarg;
>          op = XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION;
>      }
> -    if ( xc_shadow_control(xc->xc_handle, dom, op, NULL, 0, &mb, 0, NULL) < 0 )
> +    if ( xc_shadow_control(xc->xc_handle, dom, op, &mb, 0) < 0 )
>          return pyxc_error_to_exception(xc->xc_handle);

Here too.  There are int truncations on the input and output, and like
the Ocaml stubs, an apparent kb vs mb confusion.

I'm not sure whether switching to PyLong is sensible.  Its probably ok
from a compatibility perspective.

~Andrew



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/12] libxenguest: deal with log-dirty op stats overflow
  2021-06-25 13:18 ` [PATCH 02/12] libxenguest: deal with log-dirty op stats overflow Jan Beulich
@ 2021-06-25 16:36   ` Andrew Cooper
  2021-06-28  7:48     ` Jan Beulich
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Cooper @ 2021-06-25 16:36 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, Juergen Gross, George Dunlap, Ian Jackson

On 25/06/2021 14:18, Jan Beulich wrote:
> In send_memory_live() the precise value the dirty_count struct field
> gets initialized to doesn't matter much (apart from the triggering of
> the log message in send_dirty_pages(), see below), but it is important
> that it not be zero on the first iteration (or else send_dirty_pages()
> won't get called at all). Saturate the initializer value at the maximum
> value the field can hold.

I don't follow.  Migration would be extremely broken if the first
iteration didn't work correctly, so something else is going on here.

>
> While there also initialize struct precopy_stats' respective field to a
> more sane value: We don't really know how many dirty pages there are at
> that point.
>
> In suspend_and_send_dirty() and verify_frames() the local variables
> don't need initializing at all, as they're only an output from the
> hypercall which gets invoked first thing.
>
> In send_checkpoint_dirty_pfn_list() the local variable can be dropped
> altogether: It's optional to xc_logdirty_control() and not used anywhere
> else.
>
> Note that in case the clipping actually takes effect, the "Bitmap
> contained more entries than expected..." log message will trigger. This
> being just an informational message, I don't think this is overly
> concerning.

That message is currently a error, confirming that the VM will crash on
the resuming side.

This is a consequence of it attempting to balloon during the live phase
of migration, and discussed in docs/features/migration.pandoc (well - at
least mentioned on the "noone has fixed this yet" list).

> --- a/tools/libs/guest/xg_sr_save.c
> +++ b/tools/libs/guest/xg_sr_save.c
> @@ -500,7 +500,9 @@ static int simple_precopy_policy(struct
>  static int send_memory_live(struct xc_sr_context *ctx)
>  {
>      xc_interface *xch = ctx->xch;
> -    xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
> +    xc_shadow_op_stats_t stats = {
> +        .dirty_count = MIN(ctx->save.p2m_size, (typeof(stats.dirty_count))~0)
> +    };
>      char *progress_str = NULL;
>      unsigned int x = 0;
>      int rc;
> @@ -519,7 +521,7 @@ static int send_memory_live(struct xc_sr
>          goto out;
>  
>      ctx->save.stats = (struct precopy_stats){
> -        .dirty_count = ctx->save.p2m_size,
> +        .dirty_count = -1,

This is an external interface, and I'm not sure it will tolerate finding
more than p2m_size allegedly dirty.

~Andrew



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/12] libxenguest: short-circuit "all-dirty" handling
  2021-06-25 13:18 ` [PATCH 03/12] libxenguest: short-circuit "all-dirty" handling Jan Beulich
@ 2021-06-25 17:02   ` Andrew Cooper
  2021-06-28  8:26     ` Jan Beulich
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Cooper @ 2021-06-25 17:02 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, Juergen Gross, George Dunlap, Ian Jackson

On 25/06/2021 14:18, Jan Beulich wrote:
> For one it is unnecessary to fill a perhaps large chunk of memory with
> all ones. Add a new parameter to send_dirty_pages() for callers to
> indicate so.
>
> Then it is further unnecessary to allocate the dirty bitmap altogether
> when all that's ever going to happen is a single all-dirty run.

The allocation is deliberate, and does want to stay where it is IMO.

Single all-dirty runs are a debugging technique only.  All production
cases are live, and you don't want to fail midway through because a
late, large, memory allocation failed.


As for the send_{dirty,all}_pages() split, that was deliberate to keep
the logic simple.  The logdirty bitmap is tiny (in comparison to other
structures) outside of artificial cases like this.

What you've done with this change is rendered send_all_pages()
redundant, but not actually taken it out of the code, thereby
complicating it.  At the moment, this doesn't look to be an improvement.

> @@ -807,8 +798,11 @@ static int setup(struct xc_sr_context *c
>      if ( rc )
>          goto err;
>  
> -    dirty_bitmap = xc_hypercall_buffer_alloc_pages(
> -        xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->save.p2m_size)));
> +    dirty_bitmap = ctx->save.live || ctx->stream_type != XC_STREAM_PLAIN
> +        ? xc_hypercall_buffer_alloc_pages(
> +              xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->save.p2m_size)))
> +        : (void *)-1L;

This is a pointer loaded with a timebomb, which doesn't trigger NULL
pointer checks, and for which {set,clear}_bit(dirty_bitmap, large_pfn)
won't fault and will instead corrupt random areas of the address space.

~Andrew



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/12] libxenguest: avoid allocating unused deferred-pages bitmap
  2021-06-25 13:19 ` [PATCH 04/12] libxenguest: avoid allocating unused deferred-pages bitmap Jan Beulich
@ 2021-06-25 18:08   ` Andrew Cooper
  2021-06-28  8:47     ` Jan Beulich
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Cooper @ 2021-06-25 18:08 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, Juergen Gross, George Dunlap, Ian Jackson

On 25/06/2021 14:19, Jan Beulich wrote:
> Like for the dirty bitmap, it is unnecessary to allocate the deferred-
> pages bitmap when all that's ever going to happen is a single all-dirty
> run.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> The clearing of the bitmap at the end of suspend_and_send_dirty() also
> looks unnecessary - am I overlooking anything?

Yes. Remus and COLO.  You don't want accumulate successfully-sent
deferred pages over checkpoints, otherwise you'll eventually be sending
the entire VM every checkpoint.


Answering out of patch order...
> @@ -791,24 +797,31 @@ static int setup(struct xc_sr_context *c
>  {
>      xc_interface *xch = ctx->xch;
>      int rc;
> -    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
> -                                    &ctx->save.dirty_bitmap_hbuf);
>  
>      rc = ctx->save.ops.setup(ctx);
>      if ( rc )
>          goto err;
>  
> -    dirty_bitmap = ctx->save.live || ctx->stream_type != XC_STREAM_PLAIN
> -        ? xc_hypercall_buffer_alloc_pages(
> -              xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->save.p2m_size)))
> -        : (void *)-1L;
> +    if ( ctx->save.live || ctx->stream_type != XC_STREAM_PLAIN )
> +    {
> +        DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
> +                                        &ctx->save.dirty_bitmap_hbuf);
> +
> +        dirty_bitmap =
> +            xc_hypercall_buffer_alloc_pages(
> +                xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->save.p2m_size)));
> +        ctx->save.deferred_pages = bitmap_alloc(ctx->save.p2m_size);
> +
> +        if ( !dirty_bitmap || !ctx->save.deferred_pages )
> +            goto enomem;
> +    }

So this is better than the previous patch.  At least we've got a clean
NULL pointer now.

I could in principle get on board with the optimisation, except its not
safe (see below).

> --- a/tools/libs/guest/xg_sr_save.c
> +++ b/tools/libs/guest/xg_sr_save.c
> @@ -130,7 +130,7 @@ static int write_batch(struct xc_sr_cont
>                                                        ctx->save.batch_pfns[i]);
>  
>          /* Likely a ballooned page. */
> -        if ( mfns[i] == INVALID_MFN )
> +        if ( mfns[i] == INVALID_MFN && ctx->save.deferred_pages )
>          {
>              set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
>              ++ctx->save.nr_deferred_pages;
> @@ -196,8 +196,12 @@ static int write_batch(struct xc_sr_cont
>              {
>                  if ( rc == -1 && errno == EAGAIN )
>                  {
> -                    set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
> -                    ++ctx->save.nr_deferred_pages;
> +                    if ( ctx->save.deferred_pages )
> +                    {
> +                        set_bit(ctx->save.batch_pfns[i],
> +                                ctx->save.deferred_pages);
> +                        ++ctx->save.nr_deferred_pages;
> +                    }

These two blocks are the only two which modify deferred_pages.

It occurs to me that this means deferred_pages is PV-only, because of
the stub implementations of x86_hvm_pfn_to_gfn() and
x86_hvm_normalise_page().  Furthermore, this is likely to be true for
any HVM-like domains even on other architectures.

If these instead were hard errors when !deferred_pages, then that at
least get the logic into an acceptable state. 

However, the first hunk demonstrates that deferred_pages gets used even
in the non-live case.  In particular, it is sensitive to errors with the
guests' handling of its own P2M.  Also, I can't obviously spot anything
which will correctly fail migration if deferred pages survive the final
iteration.

~Andrew



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/12] libxenguest: complete loops in xc_map_domain_meminfo()
  2021-06-25 13:19 ` [PATCH 05/12] libxenguest: complete loops in xc_map_domain_meminfo() Jan Beulich
@ 2021-06-25 18:30   ` Andrew Cooper
  2021-06-28  8:53     ` Jan Beulich
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Cooper @ 2021-06-25 18:30 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, Juergen Gross, George Dunlap, Ian Jackson

On 25/06/2021 14:19, Jan Beulich wrote:
> minfo->p2m_size may have more than 31 significant bits. Change the
> induction variable to unsigned long, and (largely for signed-ness
> consistency) a helper variable to unsigned int.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> --- a/tools/libs/guest/xg_domain.c
> +++ b/tools/libs/guest/xg_domain.c
> @@ -40,7 +40,7 @@ int xc_map_domain_meminfo(xc_interface *
>      xc_dominfo_t info;
>      shared_info_any_t *live_shinfo;
>      xen_capabilities_info_t xen_caps = "";
> -    int i;
> +    unsigned long i;
>  
>      /* Only be initialized once */
>      if ( minfo->pfn_type || minfo->p2m_table )
> @@ -116,12 +116,12 @@ int xc_map_domain_meminfo(xc_interface *
>      /* Retrieve PFN types in batches */
>      for ( i = 0; i < minfo->p2m_size ; i+=1024 )
>      {
> -        int count = ((minfo->p2m_size - i ) > 1024 ) ?
> -                        1024: (minfo->p2m_size - i);
> +        unsigned int count = ((minfo->p2m_size - i) > 1024) ?
> +                             1024 : (minfo->p2m_size - i);

min().

Otherwise, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

This whole infrastructure is almost abandoned, and broken.  Its used by
xen-mfndump (debugging only) and xen-hptool mem-offline.

The mem-offline functionally cannot possibly work usefully.  It is PV
only, despite not having an HVM check, and in particular reads the dead
page in an attempt to restore the contents elsewhere.  There is also no
thought given to writes from outside sources, such as DMA from
passthrough or a different dom0 foreign mapping.

This is perhaps ok as an academic demonstration of "can I shuffle memory
behind an alive VM in ideal circumstances", but will be killed by the
dom0 kernel if you ever try running it to resolve a real memory error on
a VM, because there is no possibility of recovering the data.

The mem-offline functionality needs deleting.  It isn't production
ready, and can't credibly be made so.

~Andrew



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/12] libxenguest: guard against overflow from too large p2m when checkpointing
  2021-06-25 13:20 ` [PATCH 06/12] libxenguest: guard against overflow from too large p2m when checkpointing Jan Beulich
@ 2021-06-25 19:00   ` Andrew Cooper
  2021-06-28  9:05     ` Jan Beulich
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Cooper @ 2021-06-25 19:00 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, Juergen Gross, George Dunlap, Ian Jackson

On 25/06/2021 14:20, Jan Beulich wrote:
> struct xc_sr_record's length field has just 32 bits.

The stream max record length is

/* Somewhat arbitrary - 128MB */
#define REC_LENGTH_MAX                (128U << 20)

and is checked in the low level helpers, making the upper bound on the
number of PFNs 0xFFFFFF once the record header is taken into account.

There doesn't appear to have been any consideration made to what happens
if this number gets too large.  That said, the replication will totally
fall apart if it ever gets to a fraction of this, because this is the
list of pages the source side needs to send again in addition to
whatever *it* dirtied, as it is the state we've lost on the destination
side by permitting the VM to run live.

The common case is that, when execution diverges, the dirtied pages on
source and destination will be almost the same, so merging this on the
source side shouldn't lead to many superfluous pages needing to be sent.

>  Fill it early and
> check that the calculated value hasn't overflowed. Additionally check
> for counter overflow early - there's no point even trying to allocate
> any memory in such an event.
>
> While there also limit an induction variable's type to unsigned long:
> There's no gain from it being uint64_t.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Of course looping over test_bit() is pretty inefficient, but given that
> I have no idea how to test this code I wanted to restrict changes to
> what can sensibly be seen as no worse than before from just looking at
> the changes.

At this point, I'm not sure it can be tested.  IIRC, COLO depends on
some functionality which didn't make its way upstream into Qemu.

> --- a/tools/libs/guest/xg_sr_restore.c
> +++ b/tools/libs/guest/xg_sr_restore.c
> @@ -450,7 +450,8 @@ static int send_checkpoint_dirty_pfn_lis
>      xc_interface *xch = ctx->xch;
>      int rc = -1;
>      unsigned int count, written;
> -    uint64_t i, *pfns = NULL;
> +    unsigned long i;
> +    uint64_t *pfns = NULL;
>      struct iovec *iov = NULL;
>      struct xc_sr_record rec = {
>          .type = REC_TYPE_CHECKPOINT_DIRTY_PFN_LIST,
> @@ -469,16 +470,28 @@ static int send_checkpoint_dirty_pfn_lis
>  
>      for ( i = 0, count = 0; i < ctx->restore.p2m_size; i++ )
>      {
> -        if ( test_bit(i, dirty_bitmap) )
> -            count++;
> +        if ( test_bit(i, dirty_bitmap) && !++count )

This is far too opaque logic.

Its also entirely unnecessary...  All this loop is doing is calculating
the size for the memory allocation below, and that can be done by using
the stats output from xc_logdirty_control(), which means it doesn't want
deleting in the earlier patch.

~Andrew



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/12] libxenguest: fix off-by-1 in colo-secondary-bitmap merging
  2021-06-25 13:20 ` [PATCH 07/12] libxenguest: fix off-by-1 in colo-secondary-bitmap merging Jan Beulich
@ 2021-06-25 19:06   ` Andrew Cooper
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Cooper @ 2021-06-25 19:06 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, Juergen Gross, George Dunlap, Ian Jackson

On 25/06/2021 14:20, Jan Beulich wrote:
> Valid GFNs (having a representation in the dirty bitmap) need to be
> strictly below p2m_size.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 08/12] x86/paging: deal with log-dirty stats overflow
  2021-06-25 13:21 ` [PATCH 08/12] x86/paging: deal with log-dirty stats overflow Jan Beulich
@ 2021-06-25 19:09   ` Andrew Cooper
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Cooper @ 2021-06-25 19:09 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Wei Liu, Roger Pau Monné, George Dunlap

On 25/06/2021 14:21, Jan Beulich wrote:
> While the precise values are unlikely of interest once they exceed 4
> billion (allowing us to leave alone the domctl struct), we still
> shouldn't wrap or truncate the actual values. It is in particular
> problematic if the truncated values were zero (causing libxenguest to
> skip an iteration altogether) or a very small value (leading to
> premature exiting of the pre-copy phase).
>
> Change the internal fields to unsigned long, and suitably saturate for
> copying to guest context.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 11/12] x86/mm: pull a sanity check earlier in xenmem_add_to_physmap_one()
  2021-06-25 13:22 ` [PATCH 11/12] x86/mm: pull a sanity check earlier in xenmem_add_to_physmap_one() Jan Beulich
@ 2021-06-25 19:10   ` Andrew Cooper
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Cooper @ 2021-06-25 19:10 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Wei Liu, Roger Pau Monné, George Dunlap

On 25/06/2021 14:22, Jan Beulich wrote:
> We should try to limit the failure reasons after we've started making
> changes.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 12/12] SUPPORT.md: write down restriction of 32-bit tool stacks
  2021-06-25 13:24 ` [PATCH 12/12] SUPPORT.md: write down restriction of 32-bit tool stacks Jan Beulich
@ 2021-06-25 19:45   ` Andrew Cooper
  2021-06-28  9:22     ` Jan Beulich
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Cooper @ 2021-06-25 19:45 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: George Dunlap, Ian Jackson, Julien Grall, Stefano Stabellini, Wei Liu

On 25/06/2021 14:24, Jan Beulich wrote:
> Let's try to avoid giving the impression that 32-bit tool stacks are as
> capable as 64-bit ones.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> --- a/SUPPORT.md
> +++ b/SUPPORT.md
> @@ -131,6 +131,11 @@ ARM only has one guest type at the momen
>  
>  ## Toolstack
>  
> +While 32-bit builds of the tool stack are generally supported, restrictions
> +apply in particular when running on top of a 64-bit hypervisor.

Actually, this isn't true, and in a way which helps us right now.

PV32 isn't security supported, and neither ARM nor x86 support dom0
bitness != Xen bitness.  On x86, it doesn't remotely work because of the
pointer size mismatch, and while this was bodged in a horrifying way in
the ARM ABI, I doubt anyone is in a hurry to turn that into a supported
configuration.

That said, it is my intent with the ABIv2 changes for a 32bit toolstack,
under 64bit guest kernel, under 64bit or 128bit Xen (yes - RISCV-128 is
already a thing being discussed) to function.

>   For example,
> +very large guests aren't supported in this case.

The wording here wants to be careful, because under certain readings,
you've just dropped security support for 32bit toolstacks.

What we actually mean is "a toolstack with bitness < Xen is not expected
to be able to manage very large domains correctly, and don't pester us
with bugs when it doesn't work because we won't fix them".

Whereas we will fix security issues which only happen to manifest in
32bit builds of the toolstack.


>   This includes guests giving
> +the appearance of being large, by altering their own memory layouts.

I'd drop sentence.  Its an internal detail of a corner case which we're
expecting to remove in the future, and "the guest kernel can DoS itself"
isn't interesting.

~Andrew



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/12] libxenguest: deal with log-dirty op stats overflow
  2021-06-25 16:36   ` Andrew Cooper
@ 2021-06-28  7:48     ` Jan Beulich
  2021-06-28 11:10       ` Olaf Hering
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2021-06-28  7:48 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson, xen-devel

On 25.06.2021 18:36, Andrew Cooper wrote:
> On 25/06/2021 14:18, Jan Beulich wrote:
>> In send_memory_live() the precise value the dirty_count struct field
>> gets initialized to doesn't matter much (apart from the triggering of
>> the log message in send_dirty_pages(), see below), but it is important
>> that it not be zero on the first iteration (or else send_dirty_pages()
>> won't get called at all). Saturate the initializer value at the maximum
>> value the field can hold.
> 
> I don't follow.  Migration would be extremely broken if the first
> iteration didn't work correctly, so something else is going on here.

As per the title we're talking about overflow situation here. In particular
the field did end up zero when ctx->save.p2m_size was 0x100000000. I'm not
claiming in any way that the first iteration would generally not work.

>> While there also initialize struct precopy_stats' respective field to a
>> more sane value: We don't really know how many dirty pages there are at
>> that point.
>>
>> In suspend_and_send_dirty() and verify_frames() the local variables
>> don't need initializing at all, as they're only an output from the
>> hypercall which gets invoked first thing.
>>
>> In send_checkpoint_dirty_pfn_list() the local variable can be dropped
>> altogether: It's optional to xc_logdirty_control() and not used anywhere
>> else.
>>
>> Note that in case the clipping actually takes effect, the "Bitmap
>> contained more entries than expected..." log message will trigger. This
>> being just an informational message, I don't think this is overly
>> concerning.
> 
> That message is currently a error, confirming that the VM will crash on
> the resuming side.

An error? All I see is

    if ( written > entries )
        DPRINTF("Bitmap contained more entries than expected...");

> This is a consequence of it attempting to balloon during the live phase
> of migration, and discussed in docs/features/migration.pandoc (well - at
> least mentioned on the "noone has fixed this yet" list).
> 
>> --- a/tools/libs/guest/xg_sr_save.c
>> +++ b/tools/libs/guest/xg_sr_save.c
>> @@ -500,7 +500,9 @@ static int simple_precopy_policy(struct
>>  static int send_memory_live(struct xc_sr_context *ctx)
>>  {
>>      xc_interface *xch = ctx->xch;
>> -    xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
>> +    xc_shadow_op_stats_t stats = {
>> +        .dirty_count = MIN(ctx->save.p2m_size, (typeof(stats.dirty_count))~0)
>> +    };
>>      char *progress_str = NULL;
>>      unsigned int x = 0;
>>      int rc;
>> @@ -519,7 +521,7 @@ static int send_memory_live(struct xc_sr
>>          goto out;
>>  
>>      ctx->save.stats = (struct precopy_stats){
>> -        .dirty_count = ctx->save.p2m_size,
>> +        .dirty_count = -1,
> 
> This is an external interface, and I'm not sure it will tolerate finding
> more than p2m_size allegedly dirty.

But you do realize that a few lines down from here there already was

        policy_stats->dirty_count   = -1;

? Or are you trying to tell me that -1 (documented as indicating
"unknown") is okay on subsequent iterations, but not on the first one?
That's not being said anywhere ...

Jan



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/12] libxenguest: short-circuit "all-dirty" handling
  2021-06-25 17:02   ` Andrew Cooper
@ 2021-06-28  8:26     ` Jan Beulich
  2021-09-02 17:11       ` Ian Jackson
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2021-06-28  8:26 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson, xen-devel

On 25.06.2021 19:02, Andrew Cooper wrote:
> On 25/06/2021 14:18, Jan Beulich wrote:
>> For one it is unnecessary to fill a perhaps large chunk of memory with
>> all ones. Add a new parameter to send_dirty_pages() for callers to
>> indicate so.
>>
>> Then it is further unnecessary to allocate the dirty bitmap altogether
>> when all that's ever going to happen is a single all-dirty run.
> 
> The allocation is deliberate, and does want to stay where it is IMO.
> 
> Single all-dirty runs are a debugging technique only.  All production
> cases are live, and you don't want to fail midway through because a
> late, large, memory allocation failed.

I'm afraid I don't understand: I don't move _when_ the allocation
occurs; I only suppress the allocation (altogether) when the allocated
memory remains unused.

> As for the send_{dirty,all}_pages() split, that was deliberate to keep
> the logic simple.  The logdirty bitmap is tiny (in comparison to other
> structures) outside of artificial cases like this.
> 
> What you've done with this change is rendered send_all_pages()
> redundant, but not actually taken it out of the code, thereby
> complicating it.  At the moment, this doesn't look to be an improvement.

I view the remaining send_all_pages() as similarly useful (or not) as
e.g. send_domain_memory_checkpointed() (being merely a wrapper around
suspend_and_send_dirty()).

>> @@ -807,8 +798,11 @@ static int setup(struct xc_sr_context *c
>>      if ( rc )
>>          goto err;
>>  
>> -    dirty_bitmap = xc_hypercall_buffer_alloc_pages(
>> -        xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->save.p2m_size)));
>> +    dirty_bitmap = ctx->save.live || ctx->stream_type != XC_STREAM_PLAIN
>> +        ? xc_hypercall_buffer_alloc_pages(
>> +              xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->save.p2m_size)))
>> +        : (void *)-1L;
> 
> This is a pointer loaded with a timebomb, which doesn't trigger NULL
> pointer checks, and for which {set,clear}_bit(dirty_bitmap, large_pfn)
> won't fault and will instead corrupt random areas of the address space.

Yeah, this isn't very nice, and gets done away with again in a later
patch. I'd prefer to keep it like it is (assuming the later change
will also go in), but if really deemed necessary I can move that code
re-arrangement here, such that the use of (void *)-1L wouldn't be
necessary anymore. (You may have noticed that all I did this for is
to be able to pass the !dirty_bitmap later in the function, and that
I deliberately only update the local variable, not the hbuf, making
pretty certain that this pointer isn't going to be de-referenced.)

Jan



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/12] libxenguest: avoid allocating unused deferred-pages bitmap
  2021-06-25 18:08   ` Andrew Cooper
@ 2021-06-28  8:47     ` Jan Beulich
  2021-09-02 17:17       ` Ian Jackson
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2021-06-28  8:47 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson, xen-devel

On 25.06.2021 20:08, Andrew Cooper wrote:
> On 25/06/2021 14:19, Jan Beulich wrote:
>> Like for the dirty bitmap, it is unnecessary to allocate the deferred-
>> pages bitmap when all that's ever going to happen is a single all-dirty
>> run.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> The clearing of the bitmap at the end of suspend_and_send_dirty() also
>> looks unnecessary - am I overlooking anything?
> 
> Yes. Remus and COLO.  You don't want accumulate successfully-sent
> deferred pages over checkpoints, otherwise you'll eventually be sending
> the entire VM every checkpoint.

Oh, so what I've really missed is save() being a loop over these
functions.

> Answering out of patch order...
>> @@ -791,24 +797,31 @@ static int setup(struct xc_sr_context *c
>>  {
>>      xc_interface *xch = ctx->xch;
>>      int rc;
>> -    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
>> -                                    &ctx->save.dirty_bitmap_hbuf);
>>  
>>      rc = ctx->save.ops.setup(ctx);
>>      if ( rc )
>>          goto err;
>>  
>> -    dirty_bitmap = ctx->save.live || ctx->stream_type != XC_STREAM_PLAIN
>> -        ? xc_hypercall_buffer_alloc_pages(
>> -              xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->save.p2m_size)))
>> -        : (void *)-1L;
>> +    if ( ctx->save.live || ctx->stream_type != XC_STREAM_PLAIN )
>> +    {
>> +        DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
>> +                                        &ctx->save.dirty_bitmap_hbuf);
>> +
>> +        dirty_bitmap =
>> +            xc_hypercall_buffer_alloc_pages(
>> +                xch, dirty_bitmap, NRPAGES(bitmap_size(ctx->save.p2m_size)));
>> +        ctx->save.deferred_pages = bitmap_alloc(ctx->save.p2m_size);
>> +
>> +        if ( !dirty_bitmap || !ctx->save.deferred_pages )
>> +            goto enomem;
>> +    }
> 
> So this is better than the previous patch.  At least we've got a clean
> NULL pointer now.
> 
> I could in principle get on board with the optimisation, except its not
> safe (see below).
> 
>> --- a/tools/libs/guest/xg_sr_save.c
>> +++ b/tools/libs/guest/xg_sr_save.c
>> @@ -130,7 +130,7 @@ static int write_batch(struct xc_sr_cont
>>                                                        ctx->save.batch_pfns[i]);
>>  
>>          /* Likely a ballooned page. */
>> -        if ( mfns[i] == INVALID_MFN )
>> +        if ( mfns[i] == INVALID_MFN && ctx->save.deferred_pages )
>>          {
>>              set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
>>              ++ctx->save.nr_deferred_pages;
>> @@ -196,8 +196,12 @@ static int write_batch(struct xc_sr_cont
>>              {
>>                  if ( rc == -1 && errno == EAGAIN )
>>                  {
>> -                    set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
>> -                    ++ctx->save.nr_deferred_pages;
>> +                    if ( ctx->save.deferred_pages )
>> +                    {
>> +                        set_bit(ctx->save.batch_pfns[i],
>> +                                ctx->save.deferred_pages);
>> +                        ++ctx->save.nr_deferred_pages;
>> +                    }
> 
> These two blocks are the only two which modify deferred_pages.
> 
> It occurs to me that this means deferred_pages is PV-only, because of
> the stub implementations of x86_hvm_pfn_to_gfn() and
> x86_hvm_normalise_page().  Furthermore, this is likely to be true for
> any HVM-like domains even on other architectures.

IOW are you suggesting to also avoid allocation for HVM live
migration, thus effectively making assumptions on the two hooks
being just stubs in that case, which can't ever fail?

> If these instead were hard errors when !deferred_pages, then that at
> least get the logic into an acceptable state. 

But the goal here isn't to change the logic, just to avoid allocating
memory that's effectively never used. What you suggest could be a
separate patch, yes, but I'm afraid I'm not feeling confident enough
in understanding why you think this needs changing, so I'd prefer to
leave such a change to you. (If I was to apply some guessing to what
you may mean, I could deduce that you think ->nr_deferred_pages may
still need maintaining, with it being non-zero at the end of the last
step causing migration to fail. But there would then still not be any
need for the bitmap itself in the cases where it no longer gets
allocated.)

> However, the first hunk demonstrates that deferred_pages gets used even
> in the non-live case.  In particular, it is sensitive to errors with the
> guests' handling of its own P2M.  Also, I can't obviously spot anything
> which will correctly fail migration if deferred pages survive the final
> iteration.

How does the first hunk demonstrate this? The question isn't when
the bitmap gets updated, but under what conditions it gets consumed.
If the only sending function ever called is suspend_and_send_dirty(),
then nothing would ever have had a chance to set any bit. And any
bits set in the course of suspend_and_send_dirty() running will get
cleared again at the end of suspend_and_send_dirty().

Jan



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/12] libxenguest: complete loops in xc_map_domain_meminfo()
  2021-06-25 18:30   ` Andrew Cooper
@ 2021-06-28  8:53     ` Jan Beulich
  0 siblings, 0 replies; 36+ messages in thread
From: Jan Beulich @ 2021-06-28  8:53 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson, xen-devel

On 25.06.2021 20:30, Andrew Cooper wrote:
> On 25/06/2021 14:19, Jan Beulich wrote:
>> minfo->p2m_size may have more than 31 significant bits. Change the
>> induction variable to unsigned long, and (largely for signed-ness
>> consistency) a helper variable to unsigned int.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
>> --- a/tools/libs/guest/xg_domain.c
>> +++ b/tools/libs/guest/xg_domain.c
>> @@ -40,7 +40,7 @@ int xc_map_domain_meminfo(xc_interface *
>>      xc_dominfo_t info;
>>      shared_info_any_t *live_shinfo;
>>      xen_capabilities_info_t xen_caps = "";
>> -    int i;
>> +    unsigned long i;
>>  
>>      /* Only be initialized once */
>>      if ( minfo->pfn_type || minfo->p2m_table )
>> @@ -116,12 +116,12 @@ int xc_map_domain_meminfo(xc_interface *
>>      /* Retrieve PFN types in batches */
>>      for ( i = 0; i < minfo->p2m_size ; i+=1024 )
>>      {
>> -        int count = ((minfo->p2m_size - i ) > 1024 ) ?
>> -                        1024: (minfo->p2m_size - i);
>> +        unsigned int count = ((minfo->p2m_size - i) > 1024) ?
>> +                             1024 : (minfo->p2m_size - i);
> 
> min().

min() using 1024UL or MIN()? (I'll use the former unless you tell
me otherwise.)

> Otherwise, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

Thanks.

> This whole infrastructure is almost abandoned, and broken.  Its used by
> xen-mfndump (debugging only) and xen-hptool mem-offline.
> 
> The mem-offline functionally cannot possibly work usefully.  It is PV
> only, despite not having an HVM check, and in particular reads the dead
> page in an attempt to restore the contents elsewhere.  There is also no
> thought given to writes from outside sources, such as DMA from
> passthrough or a different dom0 foreign mapping.
> 
> This is perhaps ok as an academic demonstration of "can I shuffle memory
> behind an alive VM in ideal circumstances", but will be killed by the
> dom0 kernel if you ever try running it to resolve a real memory error on
> a VM, because there is no possibility of recovering the data.
> 
> The mem-offline functionality needs deleting.  It isn't production
> ready, and can't credibly be made so.

I definitely agree; I'm merely trying to address an anomaly found
while auditing the code for certain properties, without any claim
that afterwards any of this would really work.

Jan



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/12] libxenguest: guard against overflow from too large p2m when checkpointing
  2021-06-25 19:00   ` Andrew Cooper
@ 2021-06-28  9:05     ` Jan Beulich
  0 siblings, 0 replies; 36+ messages in thread
From: Jan Beulich @ 2021-06-28  9:05 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson, xen-devel

On 25.06.2021 21:00, Andrew Cooper wrote:
> On 25/06/2021 14:20, Jan Beulich wrote:
>> struct xc_sr_record's length field has just 32 bits.
> 
> The stream max record length is
> 
> /* Somewhat arbitrary - 128MB */
> #define REC_LENGTH_MAX                (128U << 20)
> 
> and is checked in the low level helpers, making the upper bound on the
> number of PFNs 0xFFFFFF once the record header is taken into account.
> 
> There doesn't appear to have been any consideration made to what happens
> if this number gets too large.  That said, the replication will totally
> fall apart if it ever gets to a fraction of this, because this is the
> list of pages the source side needs to send again in addition to
> whatever *it* dirtied, as it is the state we've lost on the destination
> side by permitting the VM to run live.
> 
> The common case is that, when execution diverges, the dirtied pages on
> source and destination will be almost the same, so merging this on the
> source side shouldn't lead to many superfluous pages needing to be sent.

While I follow what you say, what I can't conclude is whether with this
you mean me to make any change to this first sentence of the description
(or even the patch itself).

>> --- a/tools/libs/guest/xg_sr_restore.c
>> +++ b/tools/libs/guest/xg_sr_restore.c
>> @@ -450,7 +450,8 @@ static int send_checkpoint_dirty_pfn_lis
>>      xc_interface *xch = ctx->xch;
>>      int rc = -1;
>>      unsigned int count, written;
>> -    uint64_t i, *pfns = NULL;
>> +    unsigned long i;
>> +    uint64_t *pfns = NULL;
>>      struct iovec *iov = NULL;
>>      struct xc_sr_record rec = {
>>          .type = REC_TYPE_CHECKPOINT_DIRTY_PFN_LIST,
>> @@ -469,16 +470,28 @@ static int send_checkpoint_dirty_pfn_lis
>>  
>>      for ( i = 0, count = 0; i < ctx->restore.p2m_size; i++ )
>>      {
>> -        if ( test_bit(i, dirty_bitmap) )
>> -            count++;
>> +        if ( test_bit(i, dirty_bitmap) && !++count )
> 
> This is far too opaque logic.

This is an observation, without hint towards possible improvement.
I can attach a comment, but it's far from clear whether that's all
you're after.

> Its also entirely unnecessary...  All this loop is doing is calculating
> the size for the memory allocation below, and that can be done by using
> the stats output from xc_logdirty_control(), which means it doesn't want
> deleting in the earlier patch.

Only if there is a guarantee that the stats.dirty_count value hasn't
itself been truncated (or, as per the later patch, saturated) in the
hypervisor. Otherwise there may be more set bits in the bitmap, and
counting locally is still necessary. I generally think that anything
called "stats" may be used as guiding information, but not as precise
data. Yet the latter would be needed if you want to make memory
allocation depend on it.

Jan



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 12/12] SUPPORT.md: write down restriction of 32-bit tool stacks
  2021-06-25 19:45   ` Andrew Cooper
@ 2021-06-28  9:22     ` Jan Beulich
  0 siblings, 0 replies; 36+ messages in thread
From: Jan Beulich @ 2021-06-28  9:22 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: George Dunlap, Ian Jackson, Julien Grall, Stefano Stabellini,
	Wei Liu, xen-devel

On 25.06.2021 21:45, Andrew Cooper wrote:
> On 25/06/2021 14:24, Jan Beulich wrote:
>> Let's try to avoid giving the impression that 32-bit tool stacks are as
>> capable as 64-bit ones.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
>> --- a/SUPPORT.md
>> +++ b/SUPPORT.md
>> @@ -131,6 +131,11 @@ ARM only has one guest type at the momen
>>  
>>  ## Toolstack
>>  
>> +While 32-bit builds of the tool stack are generally supported, restrictions
>> +apply in particular when running on top of a 64-bit hypervisor.
> 
> Actually, this isn't true, and in a way which helps us right now.
> 
> PV32 isn't security supported, and neither ARM nor x86 support dom0
> bitness != Xen bitness.

While I agree this may be one possible way of the recent change of
support status for PV32, I didn't so far think "x86 doesn't support
dom0 bitness != Xen bitness" was stated anywhere. The recent change
was about security support only, yet Dom0 isn't covered by this as
far as its status as a "guest" goes. This view of mine is, I think,
supported by osstest actually spending a fair share of its effort
on testing exactly this.

Also please don't forget that besides a (64,32,32) tuple of (Xen,
Dom0-kernel,Dom0-userspace) there's also the possible (64,64,32)
one.

>  On x86, it doesn't remotely work because of the
> pointer size mismatch,

Are you saying this for (64,32,32) or (64,64,32)? In any event I
have to admit that I don't see where pointer size (and it not
matching) would make this in principle impossible. That's what
both Xen and kernel have compat layers for.

> and while this was bodged in a horrifying way in
> the ARM ABI, I doubt anyone is in a hurry to turn that into a supported
> configuration.
> 
> That said, it is my intent with the ABIv2 changes for a 32bit toolstack,
> under 64bit guest kernel, under 64bit or 128bit Xen (yes - RISCV-128 is
> already a thing being discussed) to function.

I'm curious what your plans are there. Afaict accommodating 128-
bit in the ABI right away would be a good idea, by end up bloating
structures unnecessarily. But perhaps you have some clever ideas
there ...

>>   For example,
>> +very large guests aren't supported in this case.
> 
> The wording here wants to be careful, because under certain readings,
> you've just dropped security support for 32bit toolstacks.
> 
> What we actually mean is "a toolstack with bitness < Xen is not expected
> to be able to manage very large domains correctly, and don't pester us
> with bugs when it doesn't work because we won't fix them".
> 
> Whereas we will fix security issues which only happen to manifest in
> 32bit builds of the toolstack.

I've replaced "supported" by "expected to be manageable". If this
still doesn't fit, then please provide sufficient detail for me
to derive what wording would suit you.

>>   This includes guests giving
>> +the appearance of being large, by altering their own memory layouts.
> 
> I'd drop sentence.  Its an internal detail of a corner case which we're
> expecting to remove in the future,

But this is the main reason for us having notice the lack of clear
statement here. Plus within the current ABI I don't see us having
any means to remove all the truncation issues. We shall be glad if
we get a 64-bit tool stack to correctly deal with such guests
(performance aside).

> and "the guest kernel can DoS itself" isn't interesting.

Of course there's no intention to state anything about this aspect.

Jan



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/12] libxc: split xc_logdirty_control() from xc_shadow_control()
  2021-06-25 15:49   ` Andrew Cooper
@ 2021-06-28  9:40     ` Jan Beulich
  0 siblings, 0 replies; 36+ messages in thread
From: Jan Beulich @ 2021-06-28  9:40 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson, Marek Marczykowski,
	Christian Lindig, David Scott, xen-devel

On 25.06.2021 17:49, Andrew Cooper wrote:
> On 25/06/2021 14:17, Jan Beulich wrote:
>> For log-dirty operations a 64-bit field is being truncated to become an
>> "int" return value. Seeing the large number of arguments the present
>> function takes, reduce its set of parameters to that needed for all
>> operations not involving the log-dirty bitmap, while introducing a new
>> wrapper for the log-dirty bitmap operations. This new function in turn
>> doesn't need an "mb" parameter, but has a 64-bit return type. (Using the
>> return value in favor of a pointer-type parameter is left as is, to
>> disturb callers as little as possible.)
>>
>> While altering xc_shadow_control() anyway, also adjust the types of the
>> last two of the remaining parameters.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> I wonder whether we shouldn't take the opportunity and also rename
>> xc_shadow_control() to, say, xc_paging_control(), matching the layer
>> above the HAP/shadow distinction in the hypervisor.
> 
> I do remember this being an especially obnoxious interface to use.  Any
> improvement would go a long way, but I think we also need to rename some
> domctls too.

Perhaps, but not here and now. Yet still - do you have an opinion
towards renaming xc_shadow_control() at this occasion? I will admit
that I don't consider xc_paging_control() a great name, but at
least it would gte terminology in line with the hypervisor's.

>> --- a/tools/include/xenctrl.h
>> +++ b/tools/include/xenctrl.h
>> @@ -885,11 +885,15 @@ typedef struct xen_domctl_shadow_op_stat
>>  int xc_shadow_control(xc_interface *xch,
>>                        uint32_t domid,
>>                        unsigned int sop,
>> -                      xc_hypercall_buffer_t *dirty_bitmap,
>> -                      unsigned long pages,
>> -                      unsigned long *mb,
>> -                      uint32_t mode,
>> -                      xc_shadow_op_stats_t *stats);
>> +                      unsigned int *mb,
>> +                      unsigned int mode);
>> +long long xc_logdirty_control(xc_interface *xch,
> 
> uint64_t to match the hypercall?  All users of libxc are stdint.h aware.

First of all - if anything, then int64_t. And I first wanted to use
that type, indeed. But then I went and checked: There are no present
uses of int64_t at all here; throughout tools/include/ there's exactly
one: A function parameter in libxl.h. Otoh there is precedent of a
function returning "long long". Hence I went with that.

>> --- a/tools/libs/ctrl/xc_domain.c
>> +++ b/tools/libs/ctrl/xc_domain.c
>> @@ -650,25 +650,48 @@ int xc_watchdog(xc_interface *xch,
>>  int xc_shadow_control(xc_interface *xch,
>>                        uint32_t domid,
>>                        unsigned int sop,
>> -                      xc_hypercall_buffer_t *dirty_bitmap,
>> -                      unsigned long pages,
>> -                      unsigned long *mb,
>> -                      uint32_t mode,
>> -                      xc_shadow_op_stats_t *stats)
>> +                      unsigned int *mb,
>> +                      unsigned int mode)
>>  {
>>      int rc;
>>      DECLARE_DOMCTL;
>> -    DECLARE_HYPERCALL_BUFFER_ARGUMENT(dirty_bitmap);
>>  
>>      memset(&domctl, 0, sizeof(domctl));
>>  
>>      domctl.cmd = XEN_DOMCTL_shadow_op;
>>      domctl.domain = domid;
>>      domctl.u.shadow_op.op     = sop;
>> -    domctl.u.shadow_op.pages  = pages;
>>      domctl.u.shadow_op.mb     = mb ? *mb : 0;
>>      domctl.u.shadow_op.mode   = mode;
>> -    if (dirty_bitmap != NULL)
>> +
>> +    rc = do_domctl(xch, &domctl);
>> +
>> +    if ( mb )
>> +        *mb = domctl.u.shadow_op.mb;
>> +
>> +    return rc;
>> +}
>> +
>> +long long xc_logdirty_control(xc_interface *xch,
>> +                              uint32_t domid,
>> +                              unsigned int sop,
>> +                              xc_hypercall_buffer_t *dirty_bitmap,
>> +                              unsigned long pages,
>> +                              unsigned int mode,
>> +                              xc_shadow_op_stats_t *stats)
>> +{
>> +    int rc;
>> +    DECLARE_DOMCTL;
>> +    DECLARE_HYPERCALL_BUFFER_ARGUMENT(dirty_bitmap);
>> +
>> +    memset(&domctl, 0, sizeof(domctl));
>> +
>> +    domctl.cmd = XEN_DOMCTL_shadow_op;
>> +    domctl.domain = domid;
>> +    domctl.u.shadow_op.op    = sop;
>> +    domctl.u.shadow_op.pages = pages;
>> +    domctl.u.shadow_op.mode  = mode;
> 
> Please use:
> 
> struct xen_domctl domctl = {
>     .cmd = XEN_DOMCTL_shadow_op,
>     ...
> };
> 
> I've been slowly taking out users of DECLARE_DOMCTL, because beyond
> being pure code obfuscation, valgrind (rightly) complains that the
> hypercall operates on uninitialised memory.

Well, yes, if that's the intended new style, then I can surely do it
that way. Would get the two function out of sync though, unless I
also made the (unrelated) change in the original place as well.

(For the record, I don't think valgrind rightly complains: As with
pointer arguments passed to functions, at the call site it is unknown
whether the pointed to object it input, output, or both. Analysis
tools can at best make guesses there.)

>> --- a/tools/libs/light/libxl_x86.c
>> +++ b/tools/libs/light/libxl_x86.c
>> @@ -529,10 +529,10 @@ int libxl__arch_domain_create(libxl__gc
>>          xc_domain_set_time_offset(ctx->xch, domid, rtc_timeoffset);
>>  
>>      if (d_config->b_info.type != LIBXL_DOMAIN_TYPE_PV) {
>> -        unsigned long shadow = DIV_ROUNDUP(d_config->b_info.shadow_memkb,
>> -                                           1024);
>> +        unsigned int shadow = DIV_ROUNDUP(d_config->b_info.shadow_memkb,
>> +                                          1024);
>>          xc_shadow_control(ctx->xch, domid, XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION,
>> -                          NULL, 0, &shadow, 0, NULL);
>> +                          &shadow, 0);
> 
> I know this isn't introduced by your patch, but this cannot possibly be
> correct without error handling.  There is a good chance of this call
> running Xen out of memory.
> 
> Any chance of a fix split out into a separate patch?

Sure - too mechanical editing on my part; I clearly could have
spotted this myself. At the risk of stating the obvious though: This
change will come with a risk of regressions, in case a request was
almost successful and the guest then gets away with the slightly
smaller allocation.

>> --- a/tools/ocaml/libs/xc/xenctrl_stubs.c
>> +++ b/tools/ocaml/libs/xc/xenctrl_stubs.c
>> @@ -997,13 +997,13 @@ CAMLprim value stub_shadow_allocation_ge
>>  {
>>  	CAMLparam2(xch, domid);
>>  	CAMLlocal1(mb);
>> -	unsigned long c_mb;
>> +	unsigned int c_mb;
>>  	int ret;
>>  
>>  	caml_enter_blocking_section();
>>  	ret = xc_shadow_control(_H(xch), _D(domid),
>>  				XEN_DOMCTL_SHADOW_OP_GET_ALLOCATION,
>> -				NULL, 0, &c_mb, 0, NULL);
>> +				&c_mb, 0);
>>  	caml_leave_blocking_section();
>>  	if (ret != 0)
>>  		failwith_xc(_H(xch));
> 
> Not a bug introduced in this patch, but this is broken.  There is a kb
> vs mb units mismatch, and I don't see any shifts by 10 anywhere in the
> Ocaml stubs.

May I please get away with leaving this to the maintainers? I already
don't feel overly comfortable touching Ocaml code...

>> @@ -1016,14 +1016,14 @@ CAMLprim value stub_shadow_allocation_se
>>  					  value mb)
>>  {
>>  	CAMLparam3(xch, domid, mb);
>> -	unsigned long c_mb;
>> +	unsigned int c_mb;
>>  	int ret;
>>  
>>  	c_mb = Int_val(mb);
> 
> This has a 31 bit truncation issue on 32bit builds.  I'm not sure how
> much we care.
> 
>>  	caml_enter_blocking_section();
>>  	ret = xc_shadow_control(_H(xch), _D(domid),
>>  				XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION,
>> -				NULL, 0, &c_mb, 0, NULL);
>> +				&c_mb, 0);
>>  	caml_leave_blocking_section();
>>  	if (ret != 0)
>>  		failwith_xc(_H(xch));
>> --- a/tools/python/xen/lowlevel/xc/xc.c
>> +++ b/tools/python/xen/lowlevel/xc/xc.c
>> @@ -1192,8 +1192,7 @@ static PyObject *pyxc_shadow_control(PyO
>>                                        &dom, &op) )
>>          return NULL;
>>      
>> -    if ( xc_shadow_control(xc->xc_handle, dom, op, NULL, 0, NULL, 0, NULL) 
>> -         < 0 )
>> +    if ( xc_shadow_control(xc->xc_handle, dom, op, NULL, 0) < 0 )
>>          return pyxc_error_to_exception(xc->xc_handle);
>>      
>>      Py_INCREF(zero);
>> @@ -1208,7 +1207,7 @@ static PyObject *pyxc_shadow_mem_control
>>      int op;
>>      uint32_t dom;
>>      int mbarg = -1;
>> -    unsigned long mb;
>> +    unsigned int mb;
>>  
>>      static char *kwd_list[] = { "dom", "mb", NULL };
>>  
>> @@ -1223,7 +1222,7 @@ static PyObject *pyxc_shadow_mem_control
>>          mb = mbarg;
>>          op = XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION;
>>      }
>> -    if ( xc_shadow_control(xc->xc_handle, dom, op, NULL, 0, &mb, 0, NULL) < 0 )
>> +    if ( xc_shadow_control(xc->xc_handle, dom, op, &mb, 0) < 0 )
>>          return pyxc_error_to_exception(xc->xc_handle);
> 
> Here too.  There are int truncations on the input and output, and like
> the Ocaml stubs, an apparent kb vs mb confusion.
> 
> I'm not sure whether switching to PyLong is sensible.  Its probably ok
> from a compatibility perspective.

Same here then, as I don't think I'm making anything worse with
these purely mechanical changes.

Jan



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/12] libxenguest: deal with log-dirty op stats overflow
  2021-06-28  7:48     ` Jan Beulich
@ 2021-06-28 11:10       ` Olaf Hering
  2021-06-28 11:20         ` Jan Beulich
  0 siblings, 1 reply; 36+ messages in thread
From: Olaf Hering @ 2021-06-28 11:10 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson, xen-devel

[-- Attachment #1: Type: text/plain, Size: 964 bytes --]

Am Mon, 28 Jun 2021 09:48:26 +0200
schrieb Jan Beulich <jbeulich@suse.com>:

> On 25.06.2021 18:36, Andrew Cooper wrote:
> > This is an external interface, and I'm not sure it will tolerate finding
> > more than p2m_size allegedly dirty.  
> But you do realize that a few lines down from here there already was
>         policy_stats->dirty_count   = -1;
> ? Or are you trying to tell me that -1 (documented as indicating
> "unknown") is okay on subsequent iterations, but not on the first one?

precopy_policy() gets called twice during each iteration.
Last time I tried to use this API it was difficult to work with.
It is required to look at dirty_count and iteration to see the actual state.
Maybe it was just me who initially failed to fully understand the intent.

I think as it is right now, the first run with iteration being zero is the only way to know the actual p2m_size, in case the consumer really wants to know this detail.

Olaf

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/12] libxenguest: deal with log-dirty op stats overflow
  2021-06-28 11:10       ` Olaf Hering
@ 2021-06-28 11:20         ` Jan Beulich
  2021-06-28 11:30           ` Olaf Hering
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2021-06-28 11:20 UTC (permalink / raw)
  To: Olaf Hering
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson, xen-devel

On 28.06.2021 13:10, Olaf Hering wrote:
> Am Mon, 28 Jun 2021 09:48:26 +0200
> schrieb Jan Beulich <jbeulich@suse.com>:
> 
>> On 25.06.2021 18:36, Andrew Cooper wrote:
>>> This is an external interface, and I'm not sure it will tolerate finding
>>> more than p2m_size allegedly dirty.  
>> But you do realize that a few lines down from here there already was
>>         policy_stats->dirty_count   = -1;
>> ? Or are you trying to tell me that -1 (documented as indicating
>> "unknown") is okay on subsequent iterations, but not on the first one?
> 
> precopy_policy() gets called twice during each iteration.
> Last time I tried to use this API it was difficult to work with.
> It is required to look at dirty_count and iteration to see the actual state.
> Maybe it was just me who initially failed to fully understand the intent.
> 
> I think as it is right now, the first run with iteration being zero is
> the only way to know the actual p2m_size, in case the consumer really
> wants to know this detail.

But if a field named dirty_count was intended to convey the P2M size
(and then only on the first iteration), this very certainly would
have needed writing down somewhere.

Jan



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/12] libxenguest: deal with log-dirty op stats overflow
  2021-06-28 11:20         ` Jan Beulich
@ 2021-06-28 11:30           ` Olaf Hering
  0 siblings, 0 replies; 36+ messages in thread
From: Olaf Hering @ 2021-06-28 11:30 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, Ian Jackson, xen-devel

[-- Attachment #1: Type: text/plain, Size: 372 bytes --]

Am Mon, 28 Jun 2021 13:20:06 +0200
schrieb Jan Beulich <jbeulich@suse.com>:

> But if a field named dirty_count was intended to convey the P2M size
> (and then only on the first iteration), this very certainly would
> have needed writing down somewhere.

Sure. Right now the only documentation for the precopy_policy callback is send_memory_live itself.


Olaf

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/12] libxenguest: short-circuit "all-dirty" handling
  2021-06-28  8:26     ` Jan Beulich
@ 2021-09-02 17:11       ` Ian Jackson
  0 siblings, 0 replies; 36+ messages in thread
From: Ian Jackson @ 2021-09-02 17:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, xen-devel

Jan Beulich writes ("Re: [PATCH 03/12] libxenguest: short-circuit "all-dirty" handling"):
> On 25.06.2021 19:02, Andrew Cooper wrote:
> > Single all-dirty runs are a debugging technique only.  All production
> > cases are live, and you don't want to fail midway through because a
> > late, large, memory allocation failed.
> 
> I'm afraid I don't understand: I don't move _when_ the allocation
> occurs; I only suppress the allocation (altogether) when the allocated
> memory remains unused.

Thanks for this reply, Jan.  I hope it is satisfactory to Andrew; if
not, I hope Andrew will be able to explain.

I am going to give this a provisional

Acked-by: Ian Jackson <iwj@xenproject.org>

I may of course withdraw that ack if it is explained to me that this
is patch is wrong.

> >> +        : (void *)-1L;
> > 
> > This is a pointer loaded with a timebomb, which doesn't trigger NULL
> > pointer checks, and for which {set,clear}_bit(dirty_bitmap, large_pfn)
> > won't fault and will instead corrupt random areas of the address space.
> 
> Yeah, this isn't very nice, and gets done away with again in a later
> patch. I'd prefer to keep it like it is (assuming the later change
> will also go in), but if really deemed necessary I can move that code
> re-arrangement here, such that the use of (void *)-1L wouldn't be
> necessary anymore. (You may have noticed that all I did this for is
> to be able to pass the !dirty_bitmap later in the function, and that
> I deliberately only update the local variable, not the hbuf, making
> pretty certain that this pointer isn't going to be de-referenced.)

I agree that this -1L is very unpleasant.

I'm not going to say that you need to restructure your series, but
please make sure you don't commit this patch without the fix.

Thanks,
Ian.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/12] libxenguest: avoid allocating unused deferred-pages bitmap
  2021-06-28  8:47     ` Jan Beulich
@ 2021-09-02 17:17       ` Ian Jackson
  0 siblings, 0 replies; 36+ messages in thread
From: Ian Jackson @ 2021-09-02 17:17 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné,
	Juergen Gross, George Dunlap, xen-devel

Jan Beulich writes ("Re: [PATCH 04/12] libxenguest: avoid allocating unused deferred-pages bitmap"):
> [stuff]

I have read this conversation several times and it is not clear to me
whether Andrew was saying Jan's patch is bad, or the existing code is
bad.

I'm hesitant to give an ack for an optimisation without understanding
what the implications might be.  Andrew, can you explain what will go
wrong if we take Jan's patch ?

(I amk not really familiar with this area of the code.  If necessary I
could go and read it to form my own opinion.)

Thanks,
Ian.


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-09-02 17:18 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-25 13:15 [PATCH 00/12] x86: more or less log-dirty related improvements Jan Beulich
2021-06-25 13:17 ` [PATCH 01/12] libxc: split xc_logdirty_control() from xc_shadow_control() Jan Beulich
2021-06-25 14:51   ` Christian Lindig
2021-06-25 15:49   ` Andrew Cooper
2021-06-28  9:40     ` Jan Beulich
2021-06-25 13:18 ` [PATCH 02/12] libxenguest: deal with log-dirty op stats overflow Jan Beulich
2021-06-25 16:36   ` Andrew Cooper
2021-06-28  7:48     ` Jan Beulich
2021-06-28 11:10       ` Olaf Hering
2021-06-28 11:20         ` Jan Beulich
2021-06-28 11:30           ` Olaf Hering
2021-06-25 13:18 ` [PATCH 03/12] libxenguest: short-circuit "all-dirty" handling Jan Beulich
2021-06-25 17:02   ` Andrew Cooper
2021-06-28  8:26     ` Jan Beulich
2021-09-02 17:11       ` Ian Jackson
2021-06-25 13:19 ` [PATCH 04/12] libxenguest: avoid allocating unused deferred-pages bitmap Jan Beulich
2021-06-25 18:08   ` Andrew Cooper
2021-06-28  8:47     ` Jan Beulich
2021-09-02 17:17       ` Ian Jackson
2021-06-25 13:19 ` [PATCH 05/12] libxenguest: complete loops in xc_map_domain_meminfo() Jan Beulich
2021-06-25 18:30   ` Andrew Cooper
2021-06-28  8:53     ` Jan Beulich
2021-06-25 13:20 ` [PATCH 06/12] libxenguest: guard against overflow from too large p2m when checkpointing Jan Beulich
2021-06-25 19:00   ` Andrew Cooper
2021-06-28  9:05     ` Jan Beulich
2021-06-25 13:20 ` [PATCH 07/12] libxenguest: fix off-by-1 in colo-secondary-bitmap merging Jan Beulich
2021-06-25 19:06   ` Andrew Cooper
2021-06-25 13:21 ` [PATCH 08/12] x86/paging: deal with log-dirty stats overflow Jan Beulich
2021-06-25 19:09   ` Andrew Cooper
2021-06-25 13:21 ` [PATCH 09/12] x86/paging: supply more useful log-dirty page count Jan Beulich
2021-06-25 13:22 ` [PATCH 10/12] x86/mm: update log-dirty bitmap when manipulating P2M Jan Beulich
2021-06-25 13:22 ` [PATCH 11/12] x86/mm: pull a sanity check earlier in xenmem_add_to_physmap_one() Jan Beulich
2021-06-25 19:10   ` Andrew Cooper
2021-06-25 13:24 ` [PATCH 12/12] SUPPORT.md: write down restriction of 32-bit tool stacks Jan Beulich
2021-06-25 19:45   ` Andrew Cooper
2021-06-28  9:22     ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.