From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43957) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YbRra-0006AO-90 for qemu-devel@nongnu.org; Fri, 27 Mar 2015 06:51:43 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YbRrV-0005cJ-VU for qemu-devel@nongnu.org; Fri, 27 Mar 2015 06:51:42 -0400 Received: from mx1.redhat.com ([209.132.183.28]:47845) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YbRrV-0005c4-Oi for qemu-devel@nongnu.org; Fri, 27 Mar 2015 06:51:37 -0400 From: Juan Quintela In-Reply-To: <55152D5B.1090906@huawei.com> (zhanghailiang's message of "Fri, 27 Mar 2015 18:13:47 +0800") References: <55128084.2040304@huawei.com> <87a8z12yot.fsf@neno.neno> <5513793B.6020909@cn.fujitsu.com> <5513826D.2010505@cn.fujitsu.com> <55152D5B.1090906@huawei.com> Date: Fri, 27 Mar 2015 11:51:24 +0100 Message-ID: <87k2y2vhlf.fsf@neno.neno> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration Reply-To: quintela@redhat.com List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: zhanghailiang Cc: hangaohuai@huawei.com, Li Zhijian , qemu-devel@nongnu.org, peter.huangpeng@huawei.com, "Gonglei (Arei)" , Amit Shah , "Dr. David Alan Gilbert (git)" , david@gibson.dropbear.id.au zhanghailiang wrote: > On 2015/3/26 11:52, Li Zhijian wrote: >> On 03/26/2015 11:12 AM, Wen Congyang wrote: >>> On 03/25/2015 05:50 PM, Juan Quintela wrote: >>>> zhanghailiang wrote: >>>>> Hi all, >>>>> >>>>> We found that, sometimes, the content of VM's memory is >>>>> inconsistent between Source side and Destination side >>>>> when we check it just after finishing migration but before VM continu= e to Run. >>>>> >>>>> We use a patch like bellow to find this issue, you can find it from a= ffix, >>>>> and Steps to reprduce: >>>>> >>>>> (1) Compile QEMU: >>>>> ./configure --target-list=3Dx86_64-softmmu --extra-ldflags=3D"-lss= l" && make >>>>> >>>>> (2) Command and output: >>>>> SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu >>>>> qemu64,-kvmclock -netdev tap,id=3Dhn0-device >>>>> virtio-net-pci,id=3Dnet-pci0,netdev=3Dhn0 -boot c -drive >>>>> file=3D/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=3Dnone,id=3Ddrive-vir= tio-disk0,cache=3Dunsafe >>>>> -device >>>>> virtio-blk-pci,bus=3Dpci.0,addr=3D0x4,drive=3Ddrive-virtio-disk0,id= =3Dvirtio-disk0 >>>>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet >>>>> -monitor stdio >>>> Could you try to reproduce: >>>> - without vhost >>>> - without virtio-net >>>> - cache=3Dunsafe is going to give you trouble, but trouble should only >>>> happen after migration of pages have finished. >>> If I use ide disk, it doesn't happen. >>> Even if I use virtio-net with vhost=3Don, it still doesn't happen. I gu= ess >>> it is because I migrate the guest when it is booting. The virtio net >>> device is not used in this case. >> Er=EF=BD=9E=EF=BD=9E >> it reproduces in my ide disk >> there is no any virtio device, my command line like below >> >> x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock -net= none >> -boot c -drive file=3D/home/lizj/ubuntu.raw -vnc :7 -m 2048 -smp 2 -mach= ine >> usb=3Doff -no-user-config -nodefaults -monitor stdio -vga std >> >> it seems easily to reproduce this issue by following steps in _ubuntu_ g= uest >> 1. in source side, choose memtest in grub >> 2. do live migration >> 3. exit memtest(type Esc in when memory testing) >> 4. wait migration complete >> > > Yes=EF=BC=8Cit is a thorny problem. It is indeed easy to reproduce, just = as > your steps in the above. Thanks for the test case. I will try to give a try on Monday. Now that we have a test case, it should be able to instrument things. As the problem is on memtest, it can't be the disk, clearly :p Later, Juan. > > This is my test result: (I also test accel=3Dtcg, it can be reproduced al= so.) > Source side: > # x86_64-softmmu/qemu-system-x86_64 -machine > pc-i440fx-2.3,accel=3Dkvm,usb=3Doff -no-user-config -nodefaults -cpu > qemu64,-kvmclock -boot c -drive > file=3D/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw -device > cirrus-vga,id=3Dvideo0,vgamem_mb=3D8 -vnc :7 -m 2048 -smp 2 -monitor stdio > (qemu) ACPI_BUILD: init ACPI tables > ACPI_BUILD: init ACPI tables > migrate tcp:9.61.1.8:3004 > ACPI_BUILD: init ACPI tables > before cpu_synchronize_all_states > 5a8f72d66732cac80d6a0d5713654c0e > md_host : before saving ram complete > 5a8f72d66732cac80d6a0d5713654c0e > md_host : after saving ram complete > 5a8f72d66732cac80d6a0d5713654c0e > (qemu) > > Destination side: > # x86_64-softmmu/qemu-system-x86_64 -machine > pc-i440fx-2.3,accel=3Dkvm,usb=3Doff -no-user-config -nodefaults -cpu > qemu64,-kvmclock -boot c -drive > file=3D/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw -device > cirrus-vga,id=3Dvideo0,vgamem_mb=3D8 -vnc :7 -m 2048 -smp 2 -monitor stdio > -incoming tcp:0:3004 > (qemu) QEMU_VM_SECTION_END, after loading ram > d7cb0d8a4bdd1557fb0e78baee50c986 > md_host : after loading all vmstate > d7cb0d8a4bdd1557fb0e78baee50c986 > md_host : after cpu_synchronize_all_post_init > d7cb0d8a4bdd1557fb0e78baee50c986 > > > Thanks, > zhang > >>> >>>> What kind of load were you having when reproducing this issue? >>>> Just to confirm, you have been able to reproduce this without COLO >>>> patches, right? >>>> >>>>> (qemu) migrate tcp:192.168.3.8:3004 >>>>> before saving ram complete >>>>> ff703f6889ab8701e4e040872d079a28 >>>>> md_host : after saving ram complete >>>>> ff703f6889ab8701e4e040872d079a28 >>>>> >>>>> DST: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu >>>>> qemu64,-kvmclock -netdev tap,id=3Dhn0,vhost=3Don -device >>>>> virtio-net-pci,id=3Dnet-pci0,netdev=3Dhn0 -boot c -drive >>>>> file=3D/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=3Dnone,id=3Ddrive-vir= tio-disk0,cache=3Dunsafe >>>>> -device >>>>> virtio-blk-pci,bus=3Dpci.0,addr=3D0x4,drive=3Ddrive-virtio-disk0,id= =3Dvirtio-disk0 >>>>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet >>>>> -monitor stdio -incoming tcp:0:3004 >>>>> (qemu) QEMU_VM_SECTION_END, after loading ram >>>>> 230e1e68ece9cd4e769630e1bcb5ddfb >>>>> md_host : after loading all vmstate >>>>> 230e1e68ece9cd4e769630e1bcb5ddfb >>>>> md_host : after cpu_synchronize_all_post_init >>>>> 230e1e68ece9cd4e769630e1bcb5ddfb >>>>> >>>>> This happens occasionally, and it is more easy to reproduce when >>>>> issue migration command during VM's startup time. >>>> OK, a couple of things. Memory don't have to be exactly identical. >>>> Virtio devices in particular do funny things on "post-load". There >>>> aren't warantees for that as far as I know, we should end with an >>>> equivalent device state in memory. >>>> >>>>> We have done further test and found that some pages has been >>>>> dirtied but its corresponding migration_bitmap is not set. >>>>> We can't figure out which modules of QEMU has missed setting >>>>> bitmap when dirty page of VM, >>>>> it is very difficult for us to trace all the actions of dirtying VM's= pages. >>>> This seems to point to a bug in one of the devices. >>>> >>>>> Actually, the first time we found this problem was in the COLO FT >>>>> development, and it triggered some strange issues in >>>>> VM which all pointed to the issue of inconsistent of VM's >>>>> memory. (We have try to save all memory of VM to slave side every >>>>> time >>>>> when do checkpoint in COLO FT, and everything will be OK.) >>>>> >>>>> Is it OK for some pages that not transferred to destination when >>>>> do migration ? Or is it a bug? >>>> Pages transferred should be the same, after device state transmission = is >>>> when things could change. >>>> >>>>> This issue has blocked our COLO development... :( >>>>> >>>>> Any help will be greatly appreciated! >>>> Later, Juan. >>>> >>> . >>> >> >>