From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, HTML_MESSAGE,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B9D38C433E0 for ; Fri, 15 May 2020 03:30:06 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 7C486206F4 for ; Fri, 15 May 2020 03:30:06 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7C486206F4 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=huawei.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:47482 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jZR2z-0008Db-NY for qemu-devel@archiver.kernel.org; Thu, 14 May 2020 23:30:05 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:49588) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jZR2D-0007OM-Dk for qemu-devel@nongnu.org; Thu, 14 May 2020 23:29:17 -0400 Received: from szxga01-in.huawei.com ([45.249.212.187]:2465 helo=huawei.com) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jZR2B-0003Ai-2u for qemu-devel@nongnu.org; Thu, 14 May 2020 23:29:16 -0400 Received: from DGGEMM404-HUB.china.huawei.com (unknown [172.30.72.55]) by Forcepoint Email with ESMTP id B358D67089E0417AE9A4; Fri, 15 May 2020 11:29:08 +0800 (CST) Received: from dggeme755-chm.china.huawei.com (10.3.19.101) by DGGEMM404-HUB.china.huawei.com (10.3.20.212) with Microsoft SMTP Server (TLS) id 14.3.487.0; Fri, 15 May 2020 11:28:43 +0800 Received: from dggeme756-chm.china.huawei.com (10.3.19.102) by dggeme755-chm.china.huawei.com (10.3.19.101) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.1913.5; Fri, 15 May 2020 11:28:43 +0800 Received: from dggeme756-chm.china.huawei.com ([10.6.80.68]) by dggeme756-chm.china.huawei.com ([10.6.80.68]) with mapi id 15.01.1913.007; Fri, 15 May 2020 11:28:43 +0800 From: Zhanghailiang To: "Zhang, Chen" , "Dr . David Alan Gilbert" , qemu-devel , Li Zhijian Subject: RE: About migration/colo issue Thread-Topic: About migration/colo issue Thread-Index: AdYqZjudkmlVb8B9TvKOADpJ4VTuWAAAYZqA Date: Fri, 15 May 2020 03:28:43 +0000 Message-ID: <02d979c3d0004d07abed10bb6ddeba26@huawei.com> References: <7a26ed7efed94d2dbff591521d31076a@intel.com> In-Reply-To: <7a26ed7efed94d2dbff591521d31076a@intel.com> Accept-Language: zh-CN, en-US Content-Language: zh-CN X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.173.220.30] Content-Type: multipart/alternative; boundary="_000_02d979c3d0004d07abed10bb6ddeba26huaweicom_" MIME-Version: 1.0 X-CFilter-Loop: Reflected Received-SPF: pass client-ip=45.249.212.187; envelope-from=zhang.zhanghailiang@huawei.com; helo=huawei.com X-detected-operating-system: by eggs.gnu.org: First seen = 2020/05/14 21:45:54 X-ACL-Warn: Detected OS = Linux 3.11 and newer [fuzzy] X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001 autolearn=_AUTOLEARN X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Jason Wang , Lukas Straub Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" --_000_02d979c3d0004d07abed10bb6ddeba26huaweicom_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hi Zhang Chen, >From your tracing log, it seems to be hanged in colo_flush_ram_cache()? Does it come across a dead loop there ? I'll test it by using the new qemu. Thanks, Hailiang From: Zhang, Chen [mailto:chen.zhang@intel.com] Sent: Friday, May 15, 2020 11:16 AM To: Zhanghailiang ; Dr . David Alan Gilbert= ; qemu-devel ; Li Zhijian Cc: Jason Wang ; Lukas Straub Subject: About migration/colo issue Hi Hailiang/Dave. I found a urgent problem in current upstream code, COLO will stuck on secon= dary checkpoint and later. The guest will stuck by this issue. I have bisect upstream code, this issue caused by Hailiang's optimize patch= : >From 0393031a16735835a441b6d6e0495a1bd14adb90 Mon Sep 17 00:00:00 2001 From: zhanghailiang > Date: Mon, 24 Feb 2020 14:54:10 +0800 Subject: [PATCH] COLO: Optimize memory back-up process This patch will reduce the downtime of VM for the initial process, Previously, we copied all these memory in preparing stage of COLO while we need to stop VM, which is a time-consuming process. Here we optimize it by a trick, back-up every page while in migration process while COLO is enabled, though it affects the speed of the migration, but it obviously reduce the downtime of back-up all SVM'S memory in COLO preparing stage. Signed-off-by: zhanghailiang > Message-Id: <20200224065414.36524-5-zhang.zhanghailiang@huawei.com> Signed-off-by: Dr. David Alan Gilbert > minor typo fixes Hailiang, do you have time to look into it? The detail log: Primary node: 13322@1589511271.917346:colo_receive_message Receive 'checkpoint-ready' message {"timestamp": {"seconds": 1589511271, "microseconds": 917406}, "event": "RE= SUME"} 13322@1589511271.917842:colo_vm_state_change Change 'stop' =3D> 'run' 13322@1589511291.243346:colo_send_message Send 'checkpoint-request' message 13322@1589511291.243978:colo_receive_message Receive 'checkpoint-reply' message {"timestamp": {"seconds": 1589511291, "microseconds": 244096}, "event": "ST= OP"} 13322@1589511291.244444:colo_vm_state_change Change 'run' =3D> 'stop' 13322@1589511291.244561:colo_send_message Send 'vmstate-send' message 13322@1589511291.258594:colo_send_message Send 'vmstate-size' message 13322@1589511305.412479:colo_receive_message Receive 'vmstate-received' message 13322@1589511309.031826:colo_receive_message Receive 'vmstate-loaded' message {"timestamp": {"seconds": 1589511309, "microseconds": 31862}, "event": "RES= UME"} 13322@1589511309.033075:colo_vm_state_change Change 'stop' =3D> 'run' {"timestamp": {"seconds": 1589511311, "microseconds": 111617}, "event": "VN= C_CONNECTED", "data": {"server": {"auth": "none", "family": "ipv4", "servic= e": "5907", "host": "0.0.0.0", "websocket": false}, "client": {"family": "i= pv4", "service": "51564", "host": "10.239.13.19", "websocket": false}}} {"timestamp": {"seconds": 1589511311, "microseconds": 116197}, "event": "VN= C_INITIALIZED", "data": {"server": {"auth": "none", "family": "ipv4", "serv= ice": "5907", "host": "0.0.0.0", "websocket": false}, "client": {"family": = "ipv4", "service": "51564", "host": "10.239.13.19", "websocket": false}}} 13322@1589511311.243271:colo_send_message Send 'checkpoint-request' message 13322@1589511311.351361:colo_receive_message Receive 'checkpoint-reply' message {"timestamp": {"seconds": 1589511311, "microseconds": 351439}, "event": "ST= OP"} 13322@1589511311.415779:colo_vm_state_change Change 'run' =3D> 'stop' 13322@1589511311.416001:colo_send_message Send 'vmstate-send' message 13322@1589511311.418620:colo_send_message Send 'vmstate-size' message Secondary node: {"timestamp": {"seconds": 1589510920, "microseconds": 778207}, "event": "RE= SUME"} 23619@1589510920.778835:colo_vm_state_change Change 'stop' =3D> 'run' 23619@1589510920.778891:colo_send_message Send 'checkpoint-ready' message 23619@1589510940.105539:colo_receive_message Receive 'checkpoint-request' message {"timestamp": {"seconds": 1589510940, "microseconds": 105712}, "event": "ST= OP"} 23619@1589510940.105917:colo_vm_state_change Change 'run' =3D> 'stop' 23619@1589510940.105971:colo_send_message Send 'checkpoint-reply' message 23619@1589510940.106767:colo_receive_message Receive 'vmstate-send' message 23619@1589510940.122808:colo_flush_ram_cache_begin dirty_pages 2456 23619@1589510953.618672:colo_flush_ram_cache_end 23619@1589510953.945083:colo_receive_message Receive 'vmstate-size' message 23619@1589510954.274816:colo_send_message Send 'vmstate-received' message qemu-system-x86_64: warning: TSC frequency mismatch between VM (2792980 kHz= ) and host (2925999 kHz), and TSC scaling unavailable {"timestamp": {"seconds": 1589510957, "microseconds": 754184}, "event": "RE= SUME"} 23619@1589510957.894113:colo_vm_state_change Change 'stop' =3D> 'run' 23619@1589510957.894162:colo_send_message Send 'vmstate-loaded' message 23619@1589510960.105977:colo_receive_message Receive 'checkpoint-request' message {"timestamp": {"seconds": 1589510960, "microseconds": 106148}, "event": "ST= OP"} 23619@1589510960.213773:colo_vm_state_change Change 'run' =3D> 'stop' 23619@1589510960.213797:colo_send_message Send 'checkpoint-reply' message 23619@1589510960.278771:colo_receive_message Receive 'vmstate-send' message 23619@1589510960.423268:colo_flush_ram_cache_begin dirty_pages 25 --_000_02d979c3d0004d07abed10bb6ddeba26huaweicom_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Hi Zhang Chen,

 

From your tracing log, it seems to be hanged in colo_flush_ram_ca= che()?

Does it come across a dead loop there ?

I’ll test it by using the new qemu.

 

Thanks,

Hailiang

 

From: Zhang, Chen [mailto:chen.zhang@intel.com]
Sent: Friday, May 15, 2020 11:16 AM
To: Zhanghailiang <zhang.zhanghailiang@huawei.com>; Dr . David= Alan Gilbert <dgilbert@redhat.com>; qemu-devel <qemu-devel@nongnu= .org>; Li Zhijian <lizhijian@cn.fujitsu.com>
Cc: Jason Wang <jasowang@redhat.com>; Lukas Straub <lukasst= raub2@web.de>
Subject: About migration/colo issue

 

Hi Hailiang/Dave.

 

I found a urgent problem in cur= rent upstream code, COLO will stuck on secondary checkpoint and later.=

The guest will stuck by this is= sue.

I have bisect upstream code, th= is issue caused by Hailiang’s optimize patch:

 

From 0393031a16735835a441b6d6e0= 495a1bd14adb90 Mon Sep 17 00:00:00 2001
From: zhanghailiang <z= hang.zhanghailiang@huawei.com>
Date: Mon, 24 Feb 2020 14:54:10 +0800
Subject: [PATCH] COLO: Optimize memory back-up process

This patch will reduce the downtime of VM for the initial process,
Previously, we copied all these memory in preparing stage of COLO
while we need to stop VM, which is a time-consuming process.
Here we optimize it by a trick, back-up every page while in migration
process while COLO is enabled, though it affects the speed of the
migration, but it obviously reduce the downtime of back-up all SVM'S
memory in COLO preparing stage.

Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Message-Id: <20200224065414.36524-5-zhang.zhanghailiang@huawei.com>=
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
  minor typo fixes

 

Hailiang, do you have time to l= ook into it?

 

 

The detail log:

Primary node:=

13322@1589511271.917346:colo_receive_messa= ge Receive 'checkpoint-ready' message
{"timestamp": {"seconds": 1589511271, "microsecond= s": 917406}, "event": "RESUME"}
13322@15895= 11271.917842:colo_vm_state_change Change 'stop' =3D> 'run'
13322@15895112= 91.243346:colo_send_message Send 'checkpoint-request' message
13322@15895= 11291.243978:colo_receive_message Receive 'checkpoint-reply' message {"timestamp": {"seconds": 1589511291, "microsecond= s": 244096}, "event": "STOP"}
13322@15895= 11291.244444:colo_vm_state_change Change 'run' =3D> 'stop'
13322@15895112= 91.244561:colo_send_message Send 'vmstate-send' message
13322@15895112= 91.258594:colo_send_message Send 'vmstate-size' message
13322@15895= 11305.412479:colo_receive_message Receive 'vmstate-received' message 13322@15895= 11309.031826:colo_receive_message Receive 'vmstate-loaded' message
{"timestamp": {"seconds": 1589511309, "microsecond= s": 31862}, "event": "RESUME"}
13322@15895= 11309.033075:colo_vm_state_change Change 'stop' =3D> 'run'
{"timestamp": {"seconds": 1589511311, "microsecond= s": 111617}, "event": "VNC_CONNECTED", "data&= quot;: {"server": {"auth": "none", "fami= ly": "ipv4", "service": "5907", "ho= st": "0.0.0.0", "websocket": false}, "client&= quot;: {"family": "ipv4", "service": "51= 564", "host": "10.239.13.19", "websocket": false}}}
{"timestamp": {"seconds": 1589511311, "microsecond= s": 116197}, "event": "VNC_INITIALIZED", "dat= a": {"server": {"auth": "none", "fa= mily": "ipv4", "service": "5907", "= host": "0.0.0.0", "websocket": false}, "clien= t": {"family": "ipv4", "service": "= 51564", "host": "10.239.13.19", "websocket": false}}}
13322@15895113= 11.243271:colo_send_message Send 'checkpoint-request' message
13322@15895= 11311.351361:colo_receive_message Receive 'checkpoint-reply' message {"timestamp": {"seconds": 1589511311, "microsecond= s": 351439}, "event": "STOP"}
13322@15895= 11311.415779:colo_vm_state_change Change 'run' =3D> 'stop'
13322@15895113= 11.416001:colo_send_message Send 'vmstate-send' message
13322@15895113= 11.418620:colo_send_message Send 'vmstate-size' message

 

Secondary node:

{"timestamp": {"= seconds": 1589510920, "microseconds": 778207}, "event&q= uot;: "RESUME"}
23619@15895= 10920.778835:colo_vm_state_change Change 'stop' =3D> 'run'
23619@15895109= 20.778891:colo_send_message Send 'checkpoint-ready' message
23619@15895= 10940.105539:colo_receive_message Receive 'checkpoint-request' message<= br> {"timestamp": {"seconds": 1589510940, "microsecond= s": 105712}, "event": "STOP"}
23619@15895= 10940.105917:colo_vm_state_change Change 'run' =3D> 'stop'
23619@15895109= 40.105971:colo_send_message Send 'checkpoint-reply' message
23619@15895= 10940.106767:colo_receive_message Receive 'vmstate-send' message
23619= @1589510940.122808:colo_flush_ram_cache_begin dirty_pages 2456
23619@1= 589510953.618672:colo_flush_ram_cache_end
23619@15895= 10953.945083:colo_receive_message Receive 'vmstate-size' message
23619@15895109= 54.274816:colo_send_message Send 'vmstate-received' message
qemu-system-x86_64: warning: TSC frequency mismatch between VM (2792980 kHz= ) and host (2925999 kHz), and TSC scaling unavailable
{"timestamp": {"seconds": 1589510957, "microsecond= s": 754184}, "event": "RESUME"}
23619@15895= 10957.894113:colo_vm_state_change Change 'stop' =3D> 'run'
23619@15895109= 57.894162:colo_send_message Send 'vmstate-loaded' message
23619@15895= 10960.105977:colo_receive_message Receive 'checkpoint-request' message<= br> {"timestamp": {"seconds": 1589510960, "microsecond= s": 106148}, "event": "STOP"}
23619@15895= 10960.213773:colo_vm_state_change Change 'run' =3D> 'stop'
23619@15895109= 60.213797:colo_send_message Send 'checkpoint-reply' message
23619@15895= 10960.278771:colo_receive_message Receive 'vmstate-send' message
23619= @1589510960.423268:colo_flush_ram_cache_begin dirty_pages 25=

 

 

 

 

 

 

 

--_000_02d979c3d0004d07abed10bb6ddeba26huaweicom_--