From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:43957)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <quintela@redhat.com>) id 1YbRra-0006AO-90
	for qemu-devel@nongnu.org; Fri, 27 Mar 2015 06:51:43 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <quintela@redhat.com>) id 1YbRrV-0005cJ-VU
	for qemu-devel@nongnu.org; Fri, 27 Mar 2015 06:51:42 -0400
Received: from mx1.redhat.com ([209.132.183.28]:47845)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <quintela@redhat.com>) id 1YbRrV-0005c4-Oi
	for qemu-devel@nongnu.org; Fri, 27 Mar 2015 06:51:37 -0400
From: Juan Quintela <quintela@redhat.com>
In-Reply-To: <55152D5B.1090906@huawei.com> (zhanghailiang's message of "Fri,
	27 Mar 2015 18:13:47 +0800")
References: <55128084.2040304@huawei.com> <87a8z12yot.fsf@neno.neno>
	<5513793B.6020909@cn.fujitsu.com> <5513826D.2010505@cn.fujitsu.com>
	<55152D5B.1090906@huawei.com>
Date: Fri, 27 Mar 2015 11:51:24 +0100
Message-ID: <87k2y2vhlf.fsf@neno.neno>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [Migration Bug? ] Occasionally,
	the content of VM's memory is inconsistent between Source and
	Destination of migration
Reply-To: quintela@redhat.com
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: zhanghailiang <zhang.zhanghailiang@huawei.com>
Cc: hangaohuai@huawei.com, Li Zhijian <lizhijian@cn.fujitsu.com>, qemu-devel@nongnu.org, peter.huangpeng@huawei.com, "Gonglei (Arei)" <arei.gonglei@huawei.com>, Amit Shah <amit.shah@redhat.com>, "Dr. David Alan Gilbert (git)" <dgilbert@redhat.com>, david@gibson.dropbear.id.au

zhanghailiang <zhang.zhanghailiang@huawei.com> wrote:
> On 2015/3/26 11:52, Li Zhijian wrote:
>> On 03/26/2015 11:12 AM, Wen Congyang wrote:
>>> On 03/25/2015 05:50 PM, Juan Quintela wrote:
>>>> zhanghailiang<zhang.zhanghailiang@huawei.com>  wrote:
>>>>> Hi all,
>>>>>
>>>>> We found that, sometimes, the content of VM's memory is
>>>>> inconsistent between Source side and Destination side
>>>>> when we check it just after finishing migration but before VM continu=
e to Run.
>>>>>
>>>>> We use a patch like bellow to find this issue, you can find it from a=
ffix,
>>>>> and Steps to reprduce:
>>>>>
>>>>> (1) Compile QEMU:
>>>>>   ./configure --target-list=3Dx86_64-softmmu  --extra-ldflags=3D"-lss=
l" && make
>>>>>
>>>>> (2) Command and output:
>>>>> SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu
>>>>> qemu64,-kvmclock -netdev tap,id=3Dhn0-device
>>>>> virtio-net-pci,id=3Dnet-pci0,netdev=3Dhn0 -boot c -drive
>>>>> file=3D/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=3Dnone,id=3Ddrive-vir=
tio-disk0,cache=3Dunsafe
>>>>> -device
>>>>> virtio-blk-pci,bus=3Dpci.0,addr=3D0x4,drive=3Ddrive-virtio-disk0,id=
=3Dvirtio-disk0
>>>>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet
>>>>> -monitor stdio
>>>> Could you try to reproduce:
>>>> - without vhost
>>>> - without virtio-net
>>>> - cache=3Dunsafe is going to give you trouble, but trouble should only
>>>>    happen after migration of pages have finished.
>>> If I use ide disk, it doesn't happen.
>>> Even if I use virtio-net with vhost=3Don, it still doesn't happen. I gu=
ess
>>> it is because I migrate the guest when it is booting. The virtio net
>>> device is not used in this case.
>> Er=EF=BD=9E=EF=BD=9E
>> it reproduces in my ide disk
>> there is no any virtio device, my command line like below
>>
>> x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock -net=
 none
>> -boot c -drive file=3D/home/lizj/ubuntu.raw -vnc :7 -m 2048 -smp 2 -mach=
ine
>> usb=3Doff -no-user-config -nodefaults -monitor stdio -vga std
>>
>> it seems easily to reproduce this issue by following steps in _ubuntu_ g=
uest
>> 1.  in source side, choose memtest in grub
>> 2. do live migration
>> 3. exit memtest(type Esc in when memory testing)
>> 4. wait migration complete
>>
>
> Yes=EF=BC=8Cit is a thorny problem. It is indeed easy to reproduce, just =
as
> your steps in the above.

Thanks for the test case.  I will try to give a try on Monday.  Now that
we have a test case, it should be able to instrument things.  As the
problem is on memtest, it can't be the disk, clearly :p

Later, Juan.


>
> This is my test result: (I also test accel=3Dtcg, it can be reproduced al=
so.)
> Source side:
> # x86_64-softmmu/qemu-system-x86_64 -machine
> pc-i440fx-2.3,accel=3Dkvm,usb=3Doff -no-user-config -nodefaults -cpu
> qemu64,-kvmclock -boot c -drive
> file=3D/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw -device
> cirrus-vga,id=3Dvideo0,vgamem_mb=3D8 -vnc :7 -m 2048 -smp 2 -monitor stdio
> (qemu) ACPI_BUILD: init ACPI tables
> ACPI_BUILD: init ACPI tables
> migrate tcp:9.61.1.8:3004
> ACPI_BUILD: init ACPI tables
> before cpu_synchronize_all_states
> 5a8f72d66732cac80d6a0d5713654c0e
> md_host : before saving ram complete
> 5a8f72d66732cac80d6a0d5713654c0e
> md_host : after saving ram complete
> 5a8f72d66732cac80d6a0d5713654c0e
> (qemu)
>
> Destination side:
> # x86_64-softmmu/qemu-system-x86_64 -machine
> pc-i440fx-2.3,accel=3Dkvm,usb=3Doff -no-user-config -nodefaults -cpu
> qemu64,-kvmclock -boot c -drive
> file=3D/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw -device
> cirrus-vga,id=3Dvideo0,vgamem_mb=3D8 -vnc :7 -m 2048 -smp 2 -monitor stdio
> -incoming tcp:0:3004
> (qemu) QEMU_VM_SECTION_END, after loading ram
> d7cb0d8a4bdd1557fb0e78baee50c986
> md_host : after loading all vmstate
> d7cb0d8a4bdd1557fb0e78baee50c986
> md_host : after cpu_synchronize_all_post_init
> d7cb0d8a4bdd1557fb0e78baee50c986
>
>
> Thanks,
> zhang
>
>>>
>>>> What kind of load were you having when reproducing this issue?
>>>> Just to confirm, you have been able to reproduce this without COLO
>>>> patches, right?
>>>>
>>>>> (qemu) migrate tcp:192.168.3.8:3004
>>>>> before saving ram complete
>>>>> ff703f6889ab8701e4e040872d079a28
>>>>> md_host : after saving ram complete
>>>>> ff703f6889ab8701e4e040872d079a28
>>>>>
>>>>> DST: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu
>>>>> qemu64,-kvmclock -netdev tap,id=3Dhn0,vhost=3Don -device
>>>>> virtio-net-pci,id=3Dnet-pci0,netdev=3Dhn0 -boot c -drive
>>>>> file=3D/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=3Dnone,id=3Ddrive-vir=
tio-disk0,cache=3Dunsafe
>>>>> -device
>>>>> virtio-blk-pci,bus=3Dpci.0,addr=3D0x4,drive=3Ddrive-virtio-disk0,id=
=3Dvirtio-disk0
>>>>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet
>>>>> -monitor stdio -incoming tcp:0:3004
>>>>> (qemu) QEMU_VM_SECTION_END, after loading ram
>>>>> 230e1e68ece9cd4e769630e1bcb5ddfb
>>>>> md_host : after loading all vmstate
>>>>> 230e1e68ece9cd4e769630e1bcb5ddfb
>>>>> md_host : after cpu_synchronize_all_post_init
>>>>> 230e1e68ece9cd4e769630e1bcb5ddfb
>>>>>
>>>>> This happens occasionally, and it is more easy to reproduce when
>>>>> issue migration command during VM's startup time.
>>>> OK, a couple of things.  Memory don't have to be exactly identical.
>>>> Virtio devices in particular do funny things on "post-load".  There
>>>> aren't warantees for that as far as I know, we should end with an
>>>> equivalent device state in memory.
>>>>
>>>>> We have done further test and found that some pages has been
>>>>> dirtied but its corresponding migration_bitmap is not set.
>>>>> We can't figure out which modules of QEMU has missed setting
>>>>> bitmap when dirty page of VM,
>>>>> it is very difficult for us to trace all the actions of dirtying VM's=
 pages.
>>>> This seems to point to a bug in one of the devices.
>>>>
>>>>> Actually, the first time we found this problem was in the COLO FT
>>>>> development, and it triggered some strange issues in
>>>>> VM which all pointed to the issue of inconsistent of VM's
>>>>> memory. (We have try to save all memory of VM to slave side every
>>>>> time
>>>>> when do checkpoint in COLO FT, and everything will be OK.)
>>>>>
>>>>> Is it OK for some pages that not transferred to destination when
>>>>> do migration ? Or is it a bug?
>>>> Pages transferred should be the same, after device state transmission =
is
>>>> when things could change.
>>>>
>>>>> This issue has blocked our COLO development... :(
>>>>>
>>>>> Any help will be greatly appreciated!
>>>> Later, Juan.
>>>>
>>> .
>>>
>>
>>