All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] repeated live migration for VM failed
@ 2017-05-22  6:35 Hao, Xudong
  2017-05-22  9:56 ` Julien Grall
  2017-05-22 10:18 ` George Dunlap
  0 siblings, 2 replies; 8+ messages in thread
From: Hao, Xudong @ 2017-05-22  6:35 UTC (permalink / raw)
  To: xen-devel; +Cc: Lars Kurth, Julien Grall, George Dunlap, Gao, Chao


[-- Attachment #1.1: Type: text/plain, Size: 2683 bytes --]

Bug detailed description:

----------------

Create one RHEL7.3 HVM and do live migration continuously, while doing the 200+ or 300+ times live-migration, tool stack report error and migration failed.



Environment :

----------------

HW: Skylake server

Xen: Xen 4.9.0 RC4

Dom0: Linux 4.11.0



Reproduce steps:

----------------

1.      Compile Xen 4.9 Rc4 and dom0 kernel 4.11.0, boot to dom0

2.      Boot RHEL7.3 HVM guest

3.      Migrate guest to localhost, sleep 10 seconds

4.      Repeat doing the step3.



Current result:

----------------

VM Migration fail.



Base error log:

----------------

xl migrate 24hrs_lm_guest_2 localhost

root@localhost's password:

migration target: Ready to receive domain.

Saving to migration stream new xl format (info 0x3/0x0/1761)

Loading new save file <incoming migration stream> (new xl fmt info 0x3/0x0/1761)

Savefile contains xl domain config in JSON format

Parsing config from <saved>

xc: info: Saving domain 273, type x86 HVM

xc: info: Found x86 HVM domain from Xen 4.9

xc: info: Restoring domain

xc: error: set HVM param 12 = 0x00000000feffe000 (85 = Interrupted system call should ): Internal error

xc: error: Restore failed (85 = Interrupted system call should ): Internal error

libxl: error: libxl_stream_read.c:852:libxl__xc_domain_restore_done: restoring domain: Interrupted system call should be restarted

libxl: error: libxl_create.c:1217:domcreate_rebuild_done: Domain 274:cannot (re-)build domain: -3

libxl: error: libxl_domain.c:1003:libxl__destroy_domid: Domain 274:Non-existant domain

libxl: error: libxl_domain.c:962:domain_destroy_callback: Domain 274:Unable to destroy guest

libxl: error: libxl_domain.c:889:domain_destroy_cb: Domain 274:Destruction of domain failed

migration target: Domain creation failed (code -3).

libxl: error: libxl_utils.c:510:libxl_read_exactly: file/stream truncated reading ready message from migration receiver stream

libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration transport process [19847] exited with error status 1

Migration failed, resuming at sender.




HVM guest config file:
--------------------------------
builder = "hvm"
name = "24hrs_lm_guest"
memory = 8192
vcpus = 4
vif = [ 'type=ioemu, mac=00:16:3e:12:4d:57, bridge=xenbr0' ]
disk = [ '/root/images/img.24hrs_lm_guest,qcow2,hda,rw' ]
device_model_override = '/usr/local/lib/xen/bin/qemu-system-i386'
device_model_version = 'qemu-xen'
sdl=0
vnc=1
stdvga=1
hap=1
acpi=1
gfx_passthru=0
hpet=1
serial='pty'
usb=1
usbdevice=['tablet']



Best Regards,
Xudong


[-- Attachment #1.2: Type: text/html, Size: 11175 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] repeated live migration for VM failed
  2017-05-22  6:35 [BUG] repeated live migration for VM failed Hao, Xudong
@ 2017-05-22  9:56 ` Julien Grall
  2017-05-22 10:18 ` George Dunlap
  1 sibling, 0 replies; 8+ messages in thread
From: Julien Grall @ 2017-05-22  9:56 UTC (permalink / raw)
  To: Hao, Xudong, xen-devel
  Cc: Lars Kurth, Andrew Cooper, George Dunlap, Jan Beulich, Gao, Chao

(CC Jan and Andrew)

On 22/05/17 07:35, Hao, Xudong wrote:
> Bug detailed description:
>
> ----------------
>
> Create one RHEL7.3 HVM and do live migration continuously, while doing
> the 200+ or 300+ times live-migration, tool stack report error and
> migration failed.
>
>
>
> Environment :
>
> ----------------
>
> HW: Skylake server
>
> Xen: Xen 4.9.0 RC4
>
> Dom0: Linux 4.11.0
>
>
>
> Reproduce steps:
>
> ----------------
>
> 1.      Compile Xen 4.9 Rc4 and dom0 kernel 4.11.0, boot to dom0
>
> 2.      Boot RHEL7.3 HVM guest
>
> 3.      Migrate guest to localhost, sleep 10 seconds
>
> 4.      Repeat doing the step3.
>
>
>
> Current result:
>
> ----------------
>
> VM Migration fail.
>
>
>
> Base error log:
>
> ----------------
>
> xl migrate 24hrs_lm_guest_2 localhost
>
> root@localhost's password:
>
> migration target: Ready to receive domain.
>
> Saving to migration stream new xl format (info 0x3/0x0/1761)
>
> Loading new save file <incoming migration stream> (new xl fmt info
> 0x3/0x0/1761)
>
> Savefile contains xl domain config in JSON format
>
> Parsing config from <saved>
>
> xc: info: Saving domain 273, type x86 HVM
>
> xc: info: Found x86 HVM domain from Xen 4.9
>
> xc: info: Restoring domain
>
> xc: error: set HVM param 12 = 0x00000000feffe000 (85 = Interrupted
> system call should ): Internal error
>
> xc: error: Restore failed (85 = Interrupted system call should ):
> Internal error
>
> libxl: error: libxl_stream_read.c:852:libxl__xc_domain_restore_done:
> restoring domain: Interrupted system call should be restarted
>
> libxl: error: libxl_create.c:1217:domcreate_rebuild_done: Domain
> 274:cannot (re-)build domain: -3
>
> libxl: error: libxl_domain.c:1003:libxl__destroy_domid: Domain
> 274:Non-existant domain
>
> libxl: error: libxl_domain.c:962:domain_destroy_callback: Domain
> 274:Unable to destroy guest
>
> libxl: error: libxl_domain.c:889:domain_destroy_cb: Domain
> 274:Destruction of domain failed
>
> migration target: Domain creation failed (code -3).
>
> libxl: error: libxl_utils.c:510:libxl_read_exactly: file/stream
> truncated reading ready message from migration receiver stream
>
> libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration
> transport process [19847] exited with error status 1
>
> Migration failed, resuming at sender.
>
>
>
>
>
> HVM guest config file:
>
> --------------------------------
>
> builder = "hvm"
>
> name = "24hrs_lm_guest"
>
> memory = 8192
>
> vcpus = 4
>
> vif = [ 'type=ioemu, mac=00:16:3e:12:4d:57, bridge=xenbr0' ]
>
> disk = [ '/root/images/img.24hrs_lm_guest,qcow2,hda,rw' ]
>
> device_model_override = '/usr/local/lib/xen/bin/qemu-system-i386'
>
> device_model_version = 'qemu-xen'
>
> sdl=0
>
> vnc=1
>
> stdvga=1
>
> hap=1
>
> acpi=1
>
> gfx_passthru=0
>
> hpet=1
>
> serial='pty'
>
> usb=1
>
> usbdevice=['tablet']
>
>
>
>
>
>
>
> Best Regards,
>
> Xudong
>
>
>

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] repeated live migration for VM failed
  2017-05-22  6:35 [BUG] repeated live migration for VM failed Hao, Xudong
  2017-05-22  9:56 ` Julien Grall
@ 2017-05-22 10:18 ` George Dunlap
  2017-05-22 11:03   ` George Dunlap
  1 sibling, 1 reply; 8+ messages in thread
From: George Dunlap @ 2017-05-22 10:18 UTC (permalink / raw)
  To: Hao, Xudong, xen-devel; +Cc: Lars Kurth, Julien Grall, Gao, Chao

On 22/05/17 07:35, Hao, Xudong wrote:
> Bug detailed description:
> 
> ----------------
> 
> Create one RHEL7.3 HVM and do live migration continuously, while doing the 200+ or 300+ times live-migration, tool stack report error and migration failed.
> 
> 
> 
> Environment :
> 
> ----------------
> 
> HW: Skylake server
> 
> Xen: Xen 4.9.0 RC4
> 
> Dom0: Linux 4.11.0
> 
> 
> 
> Reproduce steps:
> 
> ----------------
> 
> 1.      Compile Xen 4.9 Rc4 and dom0 kernel 4.11.0, boot to dom0
> 
> 2.      Boot RHEL7.3 HVM guest
> 
> 3.      Migrate guest to localhost, sleep 10 seconds
> 
> 4.      Repeat doing the step3.
> 
> 
> 
> Current result:
> 
> ----------------
> 
> VM Migration fail.
> 
> 
> 
> Base error log:
> 
> ----------------
> 
> xl migrate 24hrs_lm_guest_2 localhost
> 
> root@localhost's password:
> 
> migration target: Ready to receive domain.
> 
> Saving to migration stream new xl format (info 0x3/0x0/1761)
> 
> Loading new save file <incoming migration stream> (new xl fmt info 0x3/0x0/1761)
> 
> Savefile contains xl domain config in JSON format
> 
> Parsing config from <saved>
> 
> xc: info: Saving domain 273, type x86 HVM
> 
> xc: info: Found x86 HVM domain from Xen 4.9
> 
> xc: info: Restoring domain
> 
> xc: error: set HVM param 12 = 0x00000000feffe000 (85 = Interrupted system call should ): Internal error
> 
> xc: error: Restore failed (85 = Interrupted system call should ): Internal error

Interesting -- it appears that setting HVM_PARAM_IDENT_PT (#12) can fail
with -ERESTART.  But the comment for ERESTART makes it explicit that it
should be internal only -- it should cause a hypercall continuation (so
that the hypercall restarts automatically), rather than returning to the
guest.

But the hypercall continuation code seems to have disappeared from
do_hvm_op() at some point?

/me digs a bit more...

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] repeated live migration for VM failed
  2017-05-22 10:18 ` George Dunlap
@ 2017-05-22 11:03   ` George Dunlap
  2017-05-22 11:11     ` Andrew Cooper
  2017-05-23  9:22     ` Hao, Xudong
  0 siblings, 2 replies; 8+ messages in thread
From: George Dunlap @ 2017-05-22 11:03 UTC (permalink / raw)
  To: Hao, Xudong, xen-devel
  Cc: Lars Kurth, Andrew Cooper, Julien Grall, Paul Durrant,
	Jan Beulich, Gao, Chao

[-- Attachment #1: Type: text/plain, Size: 2495 bytes --]

On Mon, May 22, 2017 at 11:18 AM, George Dunlap
<george.dunlap@citrix.com> wrote:
> On 22/05/17 07:35, Hao, Xudong wrote:
>> Bug detailed description:
>>
>> ----------------
>>
>> Create one RHEL7.3 HVM and do live migration continuously, while doing the 200+ or 300+ times live-migration, tool stack report error and migration failed.
>>
>>
>>
>> Environment :
>>
>> ----------------
>>
>> HW: Skylake server
>>
>> Xen: Xen 4.9.0 RC4
>>
>> Dom0: Linux 4.11.0
>>
>>
>>
>> Reproduce steps:
>>
>> ----------------
>>
>> 1.      Compile Xen 4.9 Rc4 and dom0 kernel 4.11.0, boot to dom0
>>
>> 2.      Boot RHEL7.3 HVM guest
>>
>> 3.      Migrate guest to localhost, sleep 10 seconds
>>
>> 4.      Repeat doing the step3.
>>
>>
>>
>> Current result:
>>
>> ----------------
>>
>> VM Migration fail.
>>
>>
>>
>> Base error log:
>>
>> ----------------
>>
>> xl migrate 24hrs_lm_guest_2 localhost
>>
>> root@localhost's password:
>>
>> migration target: Ready to receive domain.
>>
>> Saving to migration stream new xl format (info 0x3/0x0/1761)
>>
>> Loading new save file <incoming migration stream> (new xl fmt info 0x3/0x0/1761)
>>
>> Savefile contains xl domain config in JSON format
>>
>> Parsing config from <saved>
>>
>> xc: info: Saving domain 273, type x86 HVM
>>
>> xc: info: Found x86 HVM domain from Xen 4.9
>>
>> xc: info: Restoring domain
>>
>> xc: error: set HVM param 12 = 0x00000000feffe000 (85 = Interrupted system call should ): Internal error
>>
>> xc: error: Restore failed (85 = Interrupted system call should ): Internal error
>
> Interesting -- it appears that setting HVM_PARAM_IDENT_PT (#12) can fail
> with -ERESTART.  But the comment for ERESTART makes it explicit that it
> should be internal only -- it should cause a hypercall continuation (so
> that the hypercall restarts automatically), rather than returning to the
> guest.
>
> But the hypercall continuation code seems to have disappeared from
> do_hvm_op() at some point?
>
> /me digs a bit more...

The problem turns out to be commit ae20ccf ("dm_op: convert
HVMOP_set_mem_type"), which says:

    This patch removes the need for handling HVMOP restarts, so that
    infrastructure is removed.

While it's true that there are no more operations which need iteration
information restored, but there are two operations which may still
need to be restarted to avoid deadlocks with other operations.

Attached is a patch which restores hypercall continuation checking.
Xudong, can you give it a test?

Thanks,
 -George

[-- Attachment #2: 0001-Restore-HVM_OP-hypercall-continuation-partial-revert.patch --]
[-- Type: text/x-diff, Size: 2508 bytes --]

From 3d4ce135ea3b396bb63752c39e6234366d590c16 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Mon, 22 May 2017 11:38:31 +0100
Subject: [PATCH] Restore HVM_OP hypercall continuation (partial revert of
 ae20ccf)

Commit ae20ccf removed the hypercall continuation logic from the end
of do_hvm_op(), claiming:

"This patch removes the need for handling HVMOP restarts, so that
infrastructure is removed."

That turns out to be false.  The removal of HVMOP_set_mem_type removed
the need to store a start iteration value in the hypercall
continuation, but a grep through hvm.c for ERESTART turns up at least
two places where do_hvm_op() may still need a hypercall continuation:

 * HVMOP_set_hvm_param can return -ERESTART when setting
HVM_PARAM_IDENT_PT in the event that it fails to acquire the domctl
lock

 * HVMOP_flush_tlbs can return -ERESTART if several vcpus call it at
   the same time

In both cases, a simple restart (with no stored iteration information)
is necessary.

Add a check for -ERESTART again, along with a comment at the top of
the function regarding the lack of decoding any information from the
op value.

Reported-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
---
CC: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Paul Durrant <paul.durrant@citrix.com>
---
 xen/arch/x86/hvm/hvm.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 81691e2..e3e817d 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -4544,6 +4544,13 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     long rc = 0;
 
+    /* 
+     * NB: hvm_op can be part of a restarted hypercall; but at the
+     * moment the only hypercalls which do continuations don't need to
+     * store any iteration information (since they're just re-trying
+     * the acquisition of a lock).
+     */
+    
     switch ( op )
     {
     case HVMOP_set_evtchn_upcall_vector:
@@ -4636,6 +4643,10 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg)
     }
     }
 
+    if ( rc == -ERESTART )
+        rc = hypercall_create_continuation(__HYPERVISOR_hvm_op, "lh",
+                                           op, arg);
+
     return rc;
 }
 
@@ -4869,4 +4880,3 @@ void hvm_set_segment_register(struct vcpu *v, enum x86_segment seg,
  * indent-tabs-mode: nil
  * End:
  */
-
-- 
2.1.4


[-- Attachment #3: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [BUG] repeated live migration for VM failed
  2017-05-22 11:03   ` George Dunlap
@ 2017-05-22 11:11     ` Andrew Cooper
  2017-05-22 17:38       ` Julien Grall
  2017-05-23  9:22     ` Hao, Xudong
  1 sibling, 1 reply; 8+ messages in thread
From: Andrew Cooper @ 2017-05-22 11:11 UTC (permalink / raw)
  To: George Dunlap, Hao, Xudong, xen-devel
  Cc: Lars Kurth, Julien Grall, Paul Durrant, Jan Beulich, Gao, Chao


[-- Attachment #1.1: Type: text/plain, Size: 5381 bytes --]

On 22/05/17 12:03, George Dunlap wrote:
> On Mon, May 22, 2017 at 11:18 AM, George Dunlap
> <george.dunlap@citrix.com> wrote:
>> On 22/05/17 07:35, Hao, Xudong wrote:
>>> Bug detailed description:
>>>
>>> ----------------
>>>
>>> Create one RHEL7.3 HVM and do live migration continuously, while doing the 200+ or 300+ times live-migration, tool stack report error and migration failed.
>>>
>>>
>>>
>>> Environment :
>>>
>>> ----------------
>>>
>>> HW: Skylake server
>>>
>>> Xen: Xen 4.9.0 RC4
>>>
>>> Dom0: Linux 4.11.0
>>>
>>>
>>>
>>> Reproduce steps:
>>>
>>> ----------------
>>>
>>> 1.      Compile Xen 4.9 Rc4 and dom0 kernel 4.11.0, boot to dom0
>>>
>>> 2.      Boot RHEL7.3 HVM guest
>>>
>>> 3.      Migrate guest to localhost, sleep 10 seconds
>>>
>>> 4.      Repeat doing the step3.
>>>
>>>
>>>
>>> Current result:
>>>
>>> ----------------
>>>
>>> VM Migration fail.
>>>
>>>
>>>
>>> Base error log:
>>>
>>> ----------------
>>>
>>> xl migrate 24hrs_lm_guest_2 localhost
>>>
>>> root@localhost's password:
>>>
>>> migration target: Ready to receive domain.
>>>
>>> Saving to migration stream new xl format (info 0x3/0x0/1761)
>>>
>>> Loading new save file <incoming migration stream> (new xl fmt info 0x3/0x0/1761)
>>>
>>> Savefile contains xl domain config in JSON format
>>>
>>> Parsing config from <saved>
>>>
>>> xc: info: Saving domain 273, type x86 HVM
>>>
>>> xc: info: Found x86 HVM domain from Xen 4.9
>>>
>>> xc: info: Restoring domain
>>>
>>> xc: error: set HVM param 12 = 0x00000000feffe000 (85 = Interrupted system call should ): Internal error
>>>
>>> xc: error: Restore failed (85 = Interrupted system call should ): Internal error
>> Interesting -- it appears that setting HVM_PARAM_IDENT_PT (#12) can fail
>> with -ERESTART.  But the comment for ERESTART makes it explicit that it
>> should be internal only -- it should cause a hypercall continuation (so
>> that the hypercall restarts automatically), rather than returning to the
>> guest.
>>
>> But the hypercall continuation code seems to have disappeared from
>> do_hvm_op() at some point?
>>
>> /me digs a bit more...
> The problem turns out to be commit ae20ccf ("dm_op: convert
> HVMOP_set_mem_type"), which says:
>
>     This patch removes the need for handling HVMOP restarts, so that
>     infrastructure is removed.
>
> While it's true that there are no more operations which need iteration
> information restored, but there are two operations which may still
> need to be restarted to avoid deadlocks with other operations.
>
> Attached is a patch which restores hypercall continuation checking.
> Xudong, can you give it a test?
>
> Thanks,
>  -George
>
> From 3d4ce135ea3b396bb63752c39e6234366d590c16 Mon Sep 17 00:00:00 2001
> From: George Dunlap <george.dunlap@citrix.com>
> Date: Mon, 22 May 2017 11:38:31 +0100
> Subject: [PATCH] Restore HVM_OP hypercall continuation (partial revert of
>  ae20ccf)
>
> Commit ae20ccf removed the hypercall continuation logic from the end
> of do_hvm_op(), claiming:
>
> "This patch removes the need for handling HVMOP restarts, so that
> infrastructure is removed."
>
> That turns out to be false.  The removal of HVMOP_set_mem_type removed
> the need to store a start iteration value in the hypercall
> continuation, but a grep through hvm.c for ERESTART turns up at least
> two places where do_hvm_op() may still need a hypercall continuation:
>
>  * HVMOP_set_hvm_param can return -ERESTART when setting
> HVM_PARAM_IDENT_PT in the event that it fails to acquire the domctl
> lock
>
>  * HVMOP_flush_tlbs can return -ERESTART if several vcpus call it at
>    the same time
>
> In both cases, a simple restart (with no stored iteration information)
> is necessary.
>
> Add a check for -ERESTART again, along with a comment at the top of
> the function regarding the lack of decoding any information from the
> op value.
>
> Reported-by: Xudong Hao <xudong.hao@intel.com>
> Signed-off-by: George Dunlap <george.dunlap@citrix.com>
> ---
> CC: Andrew Cooper <andrew.cooper3@citrix.com>
> CC: Jan Beulich <jbeulich@suse.com>
> CC: Paul Durrant <paul.durrant@citrix.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> (with the final
hunk removed)

> ---
>  xen/arch/x86/hvm/hvm.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 81691e2..e3e817d 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -4544,6 +4544,13 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg)
>  {
>      long rc = 0;
>  
> +    /* 
> +     * NB: hvm_op can be part of a restarted hypercall; but at the
> +     * moment the only hypercalls which do continuations don't need to
> +     * store any iteration information (since they're just re-trying
> +     * the acquisition of a lock).
> +     */
> +    
>      switch ( op )
>      {
>      case HVMOP_set_evtchn_upcall_vector:
> @@ -4636,6 +4643,10 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg)
>      }
>      }
>  
> +    if ( rc == -ERESTART )
> +        rc = hypercall_create_continuation(__HYPERVISOR_hvm_op, "lh",
> +                                           op, arg);
> +
>      return rc;
>  }
>  
> @@ -4869,4 +4880,3 @@ void hvm_set_segment_register(struct vcpu *v, enum x86_segment seg,
>   * indent-tabs-mode: nil
>   * End:
>   */
> -
> -- 2.1.4


[-- Attachment #1.2: Type: text/html, Size: 7181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] repeated live migration for VM failed
  2017-05-22 11:11     ` Andrew Cooper
@ 2017-05-22 17:38       ` Julien Grall
  0 siblings, 0 replies; 8+ messages in thread
From: Julien Grall @ 2017-05-22 17:38 UTC (permalink / raw)
  To: Andrew Cooper, George Dunlap, Hao, Xudong, xen-devel
  Cc: Lars Kurth, Paul Durrant, Jan Beulich, Gao, Chao

Hi,

On 22/05/17 12:11, Andrew Cooper wrote:
>> From 3d4ce135ea3b396bb63752c39e6234366d590c16 Mon Sep 17 00:00:00 2001
>> From: George Dunlap <george.dunlap@citrix.com>
>> Date: Mon, 22 May 2017 11:38:31 +0100
>> Subject: [PATCH] Restore HVM_OP hypercall continuation (partial revert of
>>  ae20ccf)
>>
>> Commit ae20ccf removed the hypercall continuation logic from the end
>> of do_hvm_op(), claiming:
>>
>> "This patch removes the need for handling HVMOP restarts, so that
>> infrastructure is removed."
>>
>> That turns out to be false.  The removal of HVMOP_set_mem_type removed
>> the need to store a start iteration value in the hypercall
>> continuation, but a grep through hvm.c for ERESTART turns up at least
>> two places where do_hvm_op() may still need a hypercall continuation:
>>
>>  * HVMOP_set_hvm_param can return -ERESTART when setting
>> HVM_PARAM_IDENT_PT in the event that it fails to acquire the domctl
>> lock
>>
>>  * HVMOP_flush_tlbs can return -ERESTART if several vcpus call it at
>>    the same time
>>
>> In both cases, a simple restart (with no stored iteration information)
>> is necessary.
>>
>> Add a check for -ERESTART again, along with a comment at the top of
>> the function regarding the lack of decoding any information from the
>> op value.
>>
>> Reported-by: Xudong Hao <xudong.hao@intel.com>
>> Signed-off-by: George Dunlap <george.dunlap@citrix.com>
>> ---
>> CC: Andrew Cooper <andrew.cooper3@citrix.com>
>> CC: Jan Beulich <jbeulich@suse.com>
>> CC: Paul Durrant <paul.durrant@citrix.com>
>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> (with the final
> hunk removed)

Release-acked-by: Julien Grall <julien.grall@arm.com>

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] repeated live migration for VM failed
  2017-05-22 11:03   ` George Dunlap
  2017-05-22 11:11     ` Andrew Cooper
@ 2017-05-23  9:22     ` Hao, Xudong
  2017-05-24 23:06       ` Hao, Xudong
  1 sibling, 1 reply; 8+ messages in thread
From: Hao, Xudong @ 2017-05-23  9:22 UTC (permalink / raw)
  To: George Dunlap, xen-devel
  Cc: Lars Kurth, Andrew Cooper, Julien Grall, Paul Durrant,
	Jan Beulich, Gao, Chao

George, thanks the fixing.
With the patch, the testing is running on 90+ time LM without any error till now, let's wait for the final result.

Thanks,
-Xudong


> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@citrix.com]
> Sent: Monday, May 22, 2017 7:03 PM
> To: Hao, Xudong <xudong.hao@intel.com>; xen-devel@lists.xen.org
> Cc: Lars Kurth <lars.kurth@citrix.com>; Julien Grall <julien.grall@arm.com>; Gao,
> Chao <chao.gao@intel.com>; Paul Durrant <paul.durrant@citrix.com>; Andrew
> Cooper <andrew.cooper3@citrix.com>; Jan Beulich <JBeulich@suse.com>
> Subject: Re: [Xen-devel] [BUG] repeated live migration for VM failed
> 
> On Mon, May 22, 2017 at 11:18 AM, George Dunlap <george.dunlap@citrix.com>
> wrote:
> > On 22/05/17 07:35, Hao, Xudong wrote:
> >> Bug detailed description:
> >>
> >> ----------------
> >>
> >> Create one RHEL7.3 HVM and do live migration continuously, while doing the
> 200+ or 300+ times live-migration, tool stack report error and migration failed.
> >>
> >>
> >>
> >> Environment :
> >>
> >> ----------------
> >>
> >> HW: Skylake server
> >>
> >> Xen: Xen 4.9.0 RC4
> >>
> >> Dom0: Linux 4.11.0
> >>
> >>
> >>
> >> Reproduce steps:
> >>
> >> ----------------
> >>
> >> 1.      Compile Xen 4.9 Rc4 and dom0 kernel 4.11.0, boot to dom0
> >>
> >> 2.      Boot RHEL7.3 HVM guest
> >>
> >> 3.      Migrate guest to localhost, sleep 10 seconds
> >>
> >> 4.      Repeat doing the step3.
> >>
> >>
> >>
> >> Current result:
> >>
> >> ----------------
> >>
> >> VM Migration fail.
> >>
> >>
> >>
> >> Base error log:
> >>
> >> ----------------
> >>
> >> xl migrate 24hrs_lm_guest_2 localhost
> >>
> >> root@localhost's password:
> >>
> >> migration target: Ready to receive domain.
> >>
> >> Saving to migration stream new xl format (info 0x3/0x0/1761)
> >>
> >> Loading new save file <incoming migration stream> (new xl fmt info
> >> 0x3/0x0/1761)
> >>
> >> Savefile contains xl domain config in JSON format
> >>
> >> Parsing config from <saved>
> >>
> >> xc: info: Saving domain 273, type x86 HVM
> >>
> >> xc: info: Found x86 HVM domain from Xen 4.9
> >>
> >> xc: info: Restoring domain
> >>
> >> xc: error: set HVM param 12 = 0x00000000feffe000 (85 = Interrupted
> >> system call should ): Internal error
> >>
> >> xc: error: Restore failed (85 = Interrupted system call should ):
> >> Internal error
> >
> > Interesting -- it appears that setting HVM_PARAM_IDENT_PT (#12) can
> > fail with -ERESTART.  But the comment for ERESTART makes it explicit
> > that it should be internal only -- it should cause a hypercall
> > continuation (so that the hypercall restarts automatically), rather
> > than returning to the guest.
> >
> > But the hypercall continuation code seems to have disappeared from
> > do_hvm_op() at some point?
> >
> > /me digs a bit more...
> 
> The problem turns out to be commit ae20ccf ("dm_op: convert
> HVMOP_set_mem_type"), which says:
> 
>     This patch removes the need for handling HVMOP restarts, so that
>     infrastructure is removed.
> 
> While it's true that there are no more operations which need iteration
> information restored, but there are two operations which may still need to be
> restarted to avoid deadlocks with other operations.
> 
> Attached is a patch which restores hypercall continuation checking.
> Xudong, can you give it a test?
> 
> Thanks,
>  -George
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] repeated live migration for VM failed
  2017-05-23  9:22     ` Hao, Xudong
@ 2017-05-24 23:06       ` Hao, Xudong
  0 siblings, 0 replies; 8+ messages in thread
From: Hao, Xudong @ 2017-05-24 23:06 UTC (permalink / raw)
  To: George Dunlap, xen-devel
  Cc: Lars Kurth, Andrew Cooper, Julien Grall, Paul Durrant,
	Jan Beulich, Gao, Chao

George, 
The live migrate pass over 500+ times with this patch, I think it's fine to merge it into Xen 4.9. 

Tested-by: Xudong Hao <xudong.hao@intel.com>

Thanks,
-Xudong


> -----Original Message-----
> From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Hao,
> Xudong
> Sent: Tuesday, May 23, 2017 5:23 PM
> To: George Dunlap <george.dunlap@citrix.com>; xen-devel@lists.xen.org
> Cc: Lars Kurth <lars.kurth@citrix.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Julien Grall <julien.grall@arm.com>; Paul
> Durrant <paul.durrant@citrix.com>; Jan Beulich <JBeulich@suse.com>; Gao,
> Chao <chao.gao@intel.com>
> Subject: Re: [Xen-devel] [BUG] repeated live migration for VM failed
> 
> George, thanks the fixing.
> With the patch, the testing is running on 90+ time LM without any error till now,
> let's wait for the final result.
> 
> Thanks,
> -Xudong
> 
> 
> > -----Original Message-----
> > From: George Dunlap [mailto:george.dunlap@citrix.com]
> > Sent: Monday, May 22, 2017 7:03 PM
> > To: Hao, Xudong <xudong.hao@intel.com>; xen-devel@lists.xen.org
> > Cc: Lars Kurth <lars.kurth@citrix.com>; Julien Grall
> > <julien.grall@arm.com>; Gao, Chao <chao.gao@intel.com>; Paul Durrant
> > <paul.durrant@citrix.com>; Andrew Cooper <andrew.cooper3@citrix.com>;
> > Jan Beulich <JBeulich@suse.com>
> > Subject: Re: [Xen-devel] [BUG] repeated live migration for VM failed
> >
> > On Mon, May 22, 2017 at 11:18 AM, George Dunlap
> > <george.dunlap@citrix.com>
> > wrote:
> > > On 22/05/17 07:35, Hao, Xudong wrote:
> > >> Bug detailed description:
> > >>
> > >> ----------------
> > >>
> > >> Create one RHEL7.3 HVM and do live migration continuously, while
> > >> doing the
> > 200+ or 300+ times live-migration, tool stack report error and migration failed.
> > >>
> > >>
> > >>
> > >> Environment :
> > >>
> > >> ----------------
> > >>
> > >> HW: Skylake server
> > >>
> > >> Xen: Xen 4.9.0 RC4
> > >>
> > >> Dom0: Linux 4.11.0
> > >>
> > >>
> > >>
> > >> Reproduce steps:
> > >>
> > >> ----------------
> > >>
> > >> 1.      Compile Xen 4.9 Rc4 and dom0 kernel 4.11.0, boot to dom0
> > >>
> > >> 2.      Boot RHEL7.3 HVM guest
> > >>
> > >> 3.      Migrate guest to localhost, sleep 10 seconds
> > >>
> > >> 4.      Repeat doing the step3.
> > >>
> > >>
> > >>
> > >> Current result:
> > >>
> > >> ----------------
> > >>
> > >> VM Migration fail.
> > >>
> > >>
> > >>
> > >> Base error log:
> > >>
> > >> ----------------
> > >>
> > >> xl migrate 24hrs_lm_guest_2 localhost
> > >>
> > >> root@localhost's password:
> > >>
> > >> migration target: Ready to receive domain.
> > >>
> > >> Saving to migration stream new xl format (info 0x3/0x0/1761)
> > >>
> > >> Loading new save file <incoming migration stream> (new xl fmt info
> > >> 0x3/0x0/1761)
> > >>
> > >> Savefile contains xl domain config in JSON format
> > >>
> > >> Parsing config from <saved>
> > >>
> > >> xc: info: Saving domain 273, type x86 HVM
> > >>
> > >> xc: info: Found x86 HVM domain from Xen 4.9
> > >>
> > >> xc: info: Restoring domain
> > >>
> > >> xc: error: set HVM param 12 = 0x00000000feffe000 (85 = Interrupted
> > >> system call should ): Internal error
> > >>
> > >> xc: error: Restore failed (85 = Interrupted system call should ):
> > >> Internal error
> > >
> > > Interesting -- it appears that setting HVM_PARAM_IDENT_PT (#12) can
> > > fail with -ERESTART.  But the comment for ERESTART makes it explicit
> > > that it should be internal only -- it should cause a hypercall
> > > continuation (so that the hypercall restarts automatically), rather
> > > than returning to the guest.
> > >
> > > But the hypercall continuation code seems to have disappeared from
> > > do_hvm_op() at some point?
> > >
> > > /me digs a bit more...
> >
> > The problem turns out to be commit ae20ccf ("dm_op: convert
> > HVMOP_set_mem_type"), which says:
> >
> >     This patch removes the need for handling HVMOP restarts, so that
> >     infrastructure is removed.
> >
> > While it's true that there are no more operations which need iteration
> > information restored, but there are two operations which may still
> > need to be restarted to avoid deadlocks with other operations.
> >
> > Attached is a patch which restores hypercall continuation checking.
> > Xudong, can you give it a test?
> >
> > Thanks,
> >  -George
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-05-24 23:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-22  6:35 [BUG] repeated live migration for VM failed Hao, Xudong
2017-05-22  9:56 ` Julien Grall
2017-05-22 10:18 ` George Dunlap
2017-05-22 11:03   ` George Dunlap
2017-05-22 11:11     ` Andrew Cooper
2017-05-22 17:38       ` Julien Grall
2017-05-23  9:22     ` Hao, Xudong
2017-05-24 23:06       ` Hao, Xudong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.