All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] broken incoming migration
@ 2013-05-30  7:44 Alexey Kardashevskiy
  2013-05-30  7:49 ` Alexey Kardashevskiy
                   ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-30  7:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-ppc, David Gibson

Hi!

I found the migration broken on pseries platform, specifically, this patch
broke it:

f1c72795af573b24a7da5eb52375c9aba8a37972
migration: do not sent zero pages in bulk stage

The idea is not to send zero pages to the destination guest which is
expected to have 100% empty RAM.

However on pseries plaftorm the guest always has some stuff in the RAM as a
part of initialization (device tree, system firmware and rtas (?)) so it is
not completely empty. As the source guest cannot detect this, it skips some
pages during migration and we get a broken destination guest. Bug.

While the idea is ok in general, I do not see any easy way to fix it as
neither QEMUMachine::init nor QEMUMachine::reset callbacks has information
about whether we are about to receive a migration or not (-incoming
parameter) and we cannot move device-tree and system firmware
initialization anywhere else.

ram_bulk_stage is static and cannot be disabled from the platform
initialization code.

So what would the community suggest? Thanks!


-- 
Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-05-30  7:44 [Qemu-devel] broken incoming migration Alexey Kardashevskiy
@ 2013-05-30  7:49 ` Alexey Kardashevskiy
  2013-05-30  7:49 ` Paolo Bonzini
  2013-05-30 10:18 ` Peter Maydell
  2 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-30  7:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-ppc, Paul Mackerras, David Gibson

Forgot some cc:.

On 05/30/2013 05:44 PM, Alexey Kardashevskiy wrote:
> Hi!
> 
> I found the migration broken on pseries platform, specifically, this patch
> broke it:
> 
> f1c72795af573b24a7da5eb52375c9aba8a37972
> migration: do not sent zero pages in bulk stage
> 
> The idea is not to send zero pages to the destination guest which is
> expected to have 100% empty RAM.
> 
> However on pseries plaftorm the guest always has some stuff in the RAM as a
> part of initialization (device tree, system firmware and rtas (?)) so it is
> not completely empty. As the source guest cannot detect this, it skips some
> pages during migration and we get a broken destination guest. Bug.
> 
> While the idea is ok in general, I do not see any easy way to fix it as
> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information
> about whether we are about to receive a migration or not (-incoming
> parameter) and we cannot move device-tree and system firmware
> initialization anywhere else.
> 
> ram_bulk_stage is static and cannot be disabled from the platform
> initialization code.
> 
> So what would the community suggest? Thanks!
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-05-30  7:44 [Qemu-devel] broken incoming migration Alexey Kardashevskiy
  2013-05-30  7:49 ` Alexey Kardashevskiy
@ 2013-05-30  7:49 ` Paolo Bonzini
  2013-05-30  8:18   ` Alexey Kardashevskiy
  2013-05-30 10:18 ` Peter Maydell
  2 siblings, 1 reply; 49+ messages in thread
From: Paolo Bonzini @ 2013-05-30  7:49 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, David Gibson

Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto:
> Hi!
> 
> I found the migration broken on pseries platform, specifically, this patch
> broke it:
> 
> f1c72795af573b24a7da5eb52375c9aba8a37972
> migration: do not sent zero pages in bulk stage
> 
> The idea is not to send zero pages to the destination guest which is
> expected to have 100% empty RAM.
> 
> However on pseries plaftorm the guest always has some stuff in the RAM as a
> part of initialization (device tree, system firmware and rtas (?)) so it is
> not completely empty. As the source guest cannot detect this, it skips some
> pages during migration and we get a broken destination guest. Bug.
> 
> While the idea is ok in general, I do not see any easy way to fix it as
> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information
> about whether we are about to receive a migration or not (-incoming
> parameter) and we cannot move device-tree and system firmware
> initialization anywhere else.
> 
> ram_bulk_stage is static and cannot be disabled from the platform
> initialization code.
> 
> So what would the community suggest?

Revert the patch. :)

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-05-30  7:49 ` Paolo Bonzini
@ 2013-05-30  8:18   ` Alexey Kardashevskiy
  2013-05-30  9:08     ` Peter Lieven
  0 siblings, 1 reply; 49+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-30  8:18 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Peter Lieven, qemu-ppc, qemu-devel, David Gibson

On 05/30/2013 05:49 PM, Paolo Bonzini wrote:
> Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto:
>> Hi!
>>
>> I found the migration broken on pseries platform, specifically, this patch
>> broke it:
>>
>> f1c72795af573b24a7da5eb52375c9aba8a37972
>> migration: do not sent zero pages in bulk stage
>>
>> The idea is not to send zero pages to the destination guest which is
>> expected to have 100% empty RAM.
>>
>> However on pseries plaftorm the guest always has some stuff in the RAM as a
>> part of initialization (device tree, system firmware and rtas (?)) so it is
>> not completely empty. As the source guest cannot detect this, it skips some
>> pages during migration and we get a broken destination guest. Bug.
>>
>> While the idea is ok in general, I do not see any easy way to fix it as
>> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information
>> about whether we are about to receive a migration or not (-incoming
>> parameter) and we cannot move device-tree and system firmware
>> initialization anywhere else.
>>
>> ram_bulk_stage is static and cannot be disabled from the platform
>> initialization code.
>>
>> So what would the community suggest?
> 
> Revert the patch. :)

I'll wait for 24 hours (forgot to cc: the author) and then post a revert
patch :)



-- 
Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-05-30  8:18   ` Alexey Kardashevskiy
@ 2013-05-30  9:08     ` Peter Lieven
  2013-05-30  9:31       ` Alexey Kardashevskiy
  2013-05-30 13:00       ` Paolo Bonzini
  0 siblings, 2 replies; 49+ messages in thread
From: Peter Lieven @ 2013-05-30  9:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson

Am 30.05.2013 10:18, schrieb Alexey Kardashevskiy:
> On 05/30/2013 05:49 PM, Paolo Bonzini wrote:
>> Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto:
>>> Hi!
>>>
>>> I found the migration broken on pseries platform, specifically, this patch
>>> broke it:
>>>
>>> f1c72795af573b24a7da5eb52375c9aba8a37972
>>> migration: do not sent zero pages in bulk stage
>>>
>>> The idea is not to send zero pages to the destination guest which is
>>> expected to have 100% empty RAM.
>>>
>>> However on pseries plaftorm the guest always has some stuff in the RAM as a
>>> part of initialization (device tree, system firmware and rtas (?)) so it is
>>> not completely empty. As the source guest cannot detect this, it skips some
>>> pages during migration and we get a broken destination guest. Bug.
>>>
>>> While the idea is ok in general, I do not see any easy way to fix it as
>>> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information
>>> about whether we are about to receive a migration or not (-incoming
>>> parameter) and we cannot move device-tree and system firmware
>>> initialization anywhere else.
>>>
>>> ram_bulk_stage is static and cannot be disabled from the platform
>>> initialization code.
>>>
>>> So what would the community suggest?
>> Revert the patch. :)
> I'll wait for 24 hours (forgot to cc: the author) and then post a revert
> patch :)
>
>
>
does this problem only occur on pseries emulation?

not sending zero pages is not only a performance benefit it also makes
overcomitted memory usable. the madv_dontneed seems to kick in asynchronously
and memory is not available immediately.

what I do not understand if the a memory region is not empty at destination
due to device tree, firmware etc. it shouldn't be empty at the source as well so
in theory this should not be a problem.

Peter

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-05-30  9:08     ` Peter Lieven
@ 2013-05-30  9:31       ` Alexey Kardashevskiy
  2013-05-30 13:00       ` Paolo Bonzini
  1 sibling, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-30  9:31 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson

On 05/30/2013 07:08 PM, Peter Lieven wrote:
> Am 30.05.2013 10:18, schrieb Alexey Kardashevskiy:
>> On 05/30/2013 05:49 PM, Paolo Bonzini wrote:
>>> Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto:
>>>> Hi!
>>>>
>>>> I found the migration broken on pseries platform, specifically, this patch
>>>> broke it:
>>>>
>>>> f1c72795af573b24a7da5eb52375c9aba8a37972
>>>> migration: do not sent zero pages in bulk stage
>>>>
>>>> The idea is not to send zero pages to the destination guest which is
>>>> expected to have 100% empty RAM.
>>>>
>>>> However on pseries plaftorm the guest always has some stuff in the RAM as a
>>>> part of initialization (device tree, system firmware and rtas (?)) so it is
>>>> not completely empty. As the source guest cannot detect this, it skips some
>>>> pages during migration and we get a broken destination guest. Bug.
>>>>
>>>> While the idea is ok in general, I do not see any easy way to fix it as
>>>> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information
>>>> about whether we are about to receive a migration or not (-incoming
>>>> parameter) and we cannot move device-tree and system firmware
>>>> initialization anywhere else.
>>>>
>>>> ram_bulk_stage is static and cannot be disabled from the platform
>>>> initialization code.
>>>>
>>>> So what would the community suggest?
>>> Revert the patch. :)
>> I'll wait for 24 hours (forgot to cc: the author) and then post a revert
>> patch :)
>>
>>
>>
> does this problem only occur on pseries emulation?

No idea, really.

> not sending zero pages is not only a performance benefit it also makes
> overcomitted memory usable. the madv_dontneed seems to kick in asynchronously
> and memory is not available immediately.

Ok, I do not mind :)

> what I do not understand if the a memory region is not empty at destination
> due to device tree, firmware etc. it shouldn't be empty at the source as well so
> in theory this should not be a problem.

This is how it works - first QEMU allocates RAM and put devicetree+firmware
somewhere. Then QEMU starts the guest so the firmware starts, loads the
kernel and then the kernel zeroes the whole (most of?) RAM including the
area where the firmware used to be.

Now we migrate. If the source guest is in the kernel already, then it does
not know about the memory area previously occupied by the firmware, it is
just an empty page. If the source guest is still in the firmware, then
those pages are not empty and they are perfectly migrated.


-- 
Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-05-30  7:44 [Qemu-devel] broken incoming migration Alexey Kardashevskiy
  2013-05-30  7:49 ` Alexey Kardashevskiy
  2013-05-30  7:49 ` Paolo Bonzini
@ 2013-05-30 10:18 ` Peter Maydell
  2 siblings, 0 replies; 49+ messages in thread
From: Peter Maydell @ 2013-05-30 10:18 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, David Gibson

On 30 May 2013 08:44, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> I found the migration broken on pseries platform, specifically, this patch
> broke it:
>
> f1c72795af573b24a7da5eb52375c9aba8a37972
> migration: do not sent zero pages in bulk stage
>
> The idea is not to send zero pages to the destination guest which is
> expected to have 100% empty RAM.

This just seems like a broken assumption to me. If the migration
protocol is going to skip sending zero pages it should do it by saying
"there's a zero page here" and have the destination memset the
page to zero.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-05-30  9:08     ` Peter Lieven
  2013-05-30  9:31       ` Alexey Kardashevskiy
@ 2013-05-30 13:00       ` Paolo Bonzini
  2013-05-30 13:38         ` Alexey Kardashevskiy
                           ` (2 more replies)
  1 sibling, 3 replies; 49+ messages in thread
From: Paolo Bonzini @ 2013-05-30 13:00 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

Il 30/05/2013 11:08, Peter Lieven ha scritto:
> Am 30.05.2013 10:18, schrieb Alexey Kardashevskiy:
>> On 05/30/2013 05:49 PM, Paolo Bonzini wrote:
>>> Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto:
>>>> Hi!
>>>>
>>>> I found the migration broken on pseries platform, specifically, this patch
>>>> broke it:
>>>>
>>>> f1c72795af573b24a7da5eb52375c9aba8a37972
>>>> migration: do not sent zero pages in bulk stage
>>>>
>>>> The idea is not to send zero pages to the destination guest which is
>>>> expected to have 100% empty RAM.
>>>>
>>>> However on pseries plaftorm the guest always has some stuff in the RAM as a
>>>> part of initialization (device tree, system firmware and rtas (?)) so it is
>>>> not completely empty. As the source guest cannot detect this, it skips some
>>>> pages during migration and we get a broken destination guest. Bug.
>>>>
>>>> While the idea is ok in general, I do not see any easy way to fix it as
>>>> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information
>>>> about whether we are about to receive a migration or not (-incoming
>>>> parameter) and we cannot move device-tree and system firmware
>>>> initialization anywhere else.
>>>>
>>>> ram_bulk_stage is static and cannot be disabled from the platform
>>>> initialization code.
>>>>
>>>> So what would the community suggest?
>>> Revert the patch. :)
>> I'll wait for 24 hours (forgot to cc: the author) and then post a revert
>> patch :)
>>
> does this problem only occur on pseries emulation?

Probably not.  On a PC, it would occur if you had 4K of zeros in the
source BIOS but not in the destination BIOS.  When you reboot, the BIOS
image is wrong.

> not sending zero pages is not only a performance benefit it also makes
> overcomitted memory usable. the madv_dontneed seems to kick in asynchronously
> and memory is not available immediately.

You could also scan the page for nonzero values before writing it.

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-05-30 13:00       ` Paolo Bonzini
@ 2013-05-30 13:38         ` Alexey Kardashevskiy
  2013-05-30 14:08           ` Paolo Bonzini
  2013-05-30 14:38         ` Peter Lieven
  2013-06-08  8:24         ` Wenchao Xia
  2 siblings, 1 reply; 49+ messages in thread
From: Alexey Kardashevskiy @ 2013-05-30 13:38 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-ppc, Peter Lieven, qemu-devel, David Gibson

On 05/30/2013 11:00 PM, Paolo Bonzini wrote:
> Il 30/05/2013 11:08, Peter Lieven ha scritto:
>> Am 30.05.2013 10:18, schrieb Alexey Kardashevskiy:
>>> On 05/30/2013 05:49 PM, Paolo Bonzini wrote:
>>>> Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto:
>>>>> Hi!
>>>>>
>>>>> I found the migration broken on pseries platform, specifically, this patch
>>>>> broke it:
>>>>>
>>>>> f1c72795af573b24a7da5eb52375c9aba8a37972
>>>>> migration: do not sent zero pages in bulk stage
>>>>>
>>>>> The idea is not to send zero pages to the destination guest which is
>>>>> expected to have 100% empty RAM.
>>>>>
>>>>> However on pseries plaftorm the guest always has some stuff in the RAM as a
>>>>> part of initialization (device tree, system firmware and rtas (?)) so it is
>>>>> not completely empty. As the source guest cannot detect this, it skips some
>>>>> pages during migration and we get a broken destination guest. Bug.
>>>>>
>>>>> While the idea is ok in general, I do not see any easy way to fix it as
>>>>> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information
>>>>> about whether we are about to receive a migration or not (-incoming
>>>>> parameter) and we cannot move device-tree and system firmware
>>>>> initialization anywhere else.
>>>>>
>>>>> ram_bulk_stage is static and cannot be disabled from the platform
>>>>> initialization code.
>>>>>
>>>>> So what would the community suggest?
>>>> Revert the patch. :)
>>> I'll wait for 24 hours (forgot to cc: the author) and then post a revert
>>> patch :)
>>>
>> does this problem only occur on pseries emulation?
> 
> Probably not.  On a PC, it would occur if you had 4K of zeros in the
> source BIOS but not in the destination BIOS.  When you reboot, the BIOS
> image is wrong.
> 
>> not sending zero pages is not only a performance benefit it also makes
>> overcomitted memory usable. the madv_dontneed seems to kick in asynchronously
>> and memory is not available immediately.
> 
> You could also scan the page for nonzero values before writing it.

Scan where? On the source? It is there already. On the destination? Won't
just trying to read a page kill all effect from MADV_DONTNEED?



-- 
Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-05-30 13:38         ` Alexey Kardashevskiy
@ 2013-05-30 14:08           ` Paolo Bonzini
  0 siblings, 0 replies; 49+ messages in thread
From: Paolo Bonzini @ 2013-05-30 14:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, Peter Lieven, qemu-devel, David Gibson

Il 30/05/2013 15:38, Alexey Kardashevskiy ha scritto:
>>> does this problem only occur on pseries emulation?
>>
>> Probably not.  On a PC, it would occur if you had 4K of zeros in the
>> source BIOS but not in the destination BIOS.  When you reboot, the BIOS
>> image is wrong.
>>
>>> not sending zero pages is not only a performance benefit it also makes
>>> overcomitted memory usable. the madv_dontneed seems to kick in asynchronously
>>> and memory is not available immediately.
>>
>> You could also scan the page for nonzero values before writing it.
> 
> Scan where? On the source? It is there already. On the destination? Won't
> just trying to read a page kill all effect from MADV_DONTNEED?

No, zero pages are handled with special copy-on-write magic by the
kernel. :)  So, reading works but writing doesn't.

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-05-30 13:00       ` Paolo Bonzini
  2013-05-30 13:38         ` Alexey Kardashevskiy
@ 2013-05-30 14:38         ` Peter Lieven
  2013-05-30 14:41           ` Paolo Bonzini
  2013-06-03 10:04           ` [Qemu-devel] " Alexey Kardashevskiy
  2013-06-08  8:24         ` Wenchao Xia
  2 siblings, 2 replies; 49+ messages in thread
From: Peter Lieven @ 2013-05-30 14:38 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson




Am 30.05.2013 um 15:41 schrieb "Paolo Bonzini" <pbonzini@redhat.com>:

> Il 30/05/2013 11:08, Peter Lieven ha scritto:
>> Am 30.05.2013 10:18, schrieb Alexey Kardashevskiy:
>>> On 05/30/2013 05:49 PM, Paolo Bonzini wrote:
>>>> Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto:
>>>>> Hi!
>>>>> 
>>>>> I found the migration broken on pseries platform, specifically, this patch
>>>>> broke it:
>>>>> 
>>>>> f1c72795af573b24a7da5eb52375c9aba8a37972
>>>>> migration: do not sent zero pages in bulk stage
>>>>> 
>>>>> The idea is not to send zero pages to the destination guest which is
>>>>> expected to have 100% empty RAM.
>>>>> 
>>>>> However on pseries plaftorm the guest always has some stuff in the RAM as a
>>>>> part of initialization (device tree, system firmware and rtas (?)) so it is
>>>>> not completely empty. As the source guest cannot detect this, it skips some
>>>>> pages during migration and we get a broken destination guest. Bug.
>>>>> 
>>>>> While the idea is ok in general, I do not see any easy way to fix it as
>>>>> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information
>>>>> about whether we are about to receive a migration or not (-incoming
>>>>> parameter) and we cannot move device-tree and system firmware
>>>>> initialization anywhere else.
>>>>> 
>>>>> ram_bulk_stage is static and cannot be disabled from the platform
>>>>> initialization code.
>>>>> 
>>>>> So what would the community suggest?
>>>> Revert the patch. :)
>>> I'll wait for 24 hours (forgot to cc: the author) and then post a revert
>>> patch :)
>> does this problem only occur on pseries emulation?
> 
> Probably not.  On a PC, it would occur if you had 4K of zeros in the
> source BIOS but not in the destination BIOS.  When you reboot, the BIOS
> image is wrong.
> 
>> not sending zero pages is not only a performance benefit it also makes
>> overcomitted memory usable. the madv_dontneed seems to kick in asynchronously
>> and memory is not available immediately.
> 
> You could also scan the page for nonzero values before writing it.

i had this in mind, but then choosed the other approach.... turned out to be a bad idea.

alexey: i will prepare a patch later today, could you then please verify it fixes your problem.

paolo: would we still need the madvise or is it enough to not write the zeroes?

Peter

> 
> Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-05-30 14:38         ` Peter Lieven
@ 2013-05-30 14:41           ` Paolo Bonzini
  2013-06-04 13:52             ` Peter Lieven
  2013-06-03 10:04           ` [Qemu-devel] " Alexey Kardashevskiy
  1 sibling, 1 reply; 49+ messages in thread
From: Paolo Bonzini @ 2013-05-30 14:41 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

Il 30/05/2013 16:38, Peter Lieven ha scritto:
>> > You could also scan the page for nonzero values before writing it.
> i had this in mind, but then choosed the other approach.... turned out to be a bad idea.
> 
> alexey: i will prepare a patch later today, could you then please verify it fixes your problem.
> 
> paolo: would we still need the madvise or is it enough to not write the zeroes?

It should be enough to not write them.

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-05-30 14:38         ` Peter Lieven
  2013-05-30 14:41           ` Paolo Bonzini
@ 2013-06-03 10:04           ` Alexey Kardashevskiy
  2013-06-04 10:56             ` Peter Lieven
  1 sibling, 1 reply; 49+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-03 10:04 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson

On 05/31/2013 12:38 AM, Peter Lieven wrote:
> 
> 
> 
> Am 30.05.2013 um 15:41 schrieb "Paolo Bonzini" <pbonzini@redhat.com>:
> 
>> Il 30/05/2013 11:08, Peter Lieven ha scritto:
>>> Am 30.05.2013 10:18, schrieb Alexey Kardashevskiy:
>>>> On 05/30/2013 05:49 PM, Paolo Bonzini wrote:
>>>>> Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto:
>>>>>> Hi!
>>>>>>
>>>>>> I found the migration broken on pseries platform, specifically, this patch
>>>>>> broke it:
>>>>>>
>>>>>> f1c72795af573b24a7da5eb52375c9aba8a37972
>>>>>> migration: do not sent zero pages in bulk stage
>>>>>>
>>>>>> The idea is not to send zero pages to the destination guest which is
>>>>>> expected to have 100% empty RAM.
>>>>>>
>>>>>> However on pseries plaftorm the guest always has some stuff in the RAM as a
>>>>>> part of initialization (device tree, system firmware and rtas (?)) so it is
>>>>>> not completely empty. As the source guest cannot detect this, it skips some
>>>>>> pages during migration and we get a broken destination guest. Bug.
>>>>>>
>>>>>> While the idea is ok in general, I do not see any easy way to fix it as
>>>>>> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information
>>>>>> about whether we are about to receive a migration or not (-incoming
>>>>>> parameter) and we cannot move device-tree and system firmware
>>>>>> initialization anywhere else.
>>>>>>
>>>>>> ram_bulk_stage is static and cannot be disabled from the platform
>>>>>> initialization code.
>>>>>>
>>>>>> So what would the community suggest?
>>>>> Revert the patch. :)
>>>> I'll wait for 24 hours (forgot to cc: the author) and then post a revert
>>>> patch :)
>>> does this problem only occur on pseries emulation?
>>
>> Probably not.  On a PC, it would occur if you had 4K of zeros in the
>> source BIOS but not in the destination BIOS.  When you reboot, the BIOS
>> image is wrong.
>>
>>> not sending zero pages is not only a performance benefit it also makes
>>> overcomitted memory usable. the madv_dontneed seems to kick in asynchronously
>>> and memory is not available immediately.
>>
>> You could also scan the page for nonzero values before writing it.
> 
> i had this in mind, but then choosed the other approach.... turned out to be a bad idea.
> 
> alexey: i will prepare a patch later today, could you then please verify it fixes your problem.


Yes I can, where is the patch? :)



> paolo: would we still need the madvise or is it enough to not write the zeroes?
> 
> Peter
> 
>>
>> Paolo
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-03 10:04           ` [Qemu-devel] " Alexey Kardashevskiy
@ 2013-06-04 10:56             ` Peter Lieven
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Lieven @ 2013-06-04 10:56 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson

On 03.06.2013 12:04, Alexey Kardashevskiy wrote:
> On 05/31/2013 12:38 AM, Peter Lieven wrote:
>>
>>
>> Am 30.05.2013 um 15:41 schrieb "Paolo Bonzini" <pbonzini@redhat.com>:
>>
>>> Il 30/05/2013 11:08, Peter Lieven ha scritto:
>>>> Am 30.05.2013 10:18, schrieb Alexey Kardashevskiy:
>>>>> On 05/30/2013 05:49 PM, Paolo Bonzini wrote:
>>>>>> Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto:
>>>>>>> Hi!
>>>>>>>
>>>>>>> I found the migration broken on pseries platform, specifically, this patch
>>>>>>> broke it:
>>>>>>>
>>>>>>> f1c72795af573b24a7da5eb52375c9aba8a37972
>>>>>>> migration: do not sent zero pages in bulk stage
>>>>>>>
>>>>>>> The idea is not to send zero pages to the destination guest which is
>>>>>>> expected to have 100% empty RAM.
>>>>>>>
>>>>>>> However on pseries plaftorm the guest always has some stuff in the RAM as a
>>>>>>> part of initialization (device tree, system firmware and rtas (?)) so it is
>>>>>>> not completely empty. As the source guest cannot detect this, it skips some
>>>>>>> pages during migration and we get a broken destination guest. Bug.
>>>>>>>
>>>>>>> While the idea is ok in general, I do not see any easy way to fix it as
>>>>>>> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information
>>>>>>> about whether we are about to receive a migration or not (-incoming
>>>>>>> parameter) and we cannot move device-tree and system firmware
>>>>>>> initialization anywhere else.
>>>>>>>
>>>>>>> ram_bulk_stage is static and cannot be disabled from the platform
>>>>>>> initialization code.
>>>>>>>
>>>>>>> So what would the community suggest?
>>>>>> Revert the patch. :)
>>>>> I'll wait for 24 hours (forgot to cc: the author) and then post a revert
>>>>> patch :)
>>>> does this problem only occur on pseries emulation?
>>> Probably not.  On a PC, it would occur if you had 4K of zeros in the
>>> source BIOS but not in the destination BIOS.  When you reboot, the BIOS
>>> image is wrong.
>>>
>>>> not sending zero pages is not only a performance benefit it also makes
>>>> overcomitted memory usable. the madv_dontneed seems to kick in asynchronously
>>>> and memory is not available immediately.
>>> You could also scan the page for nonzero values before writing it.
>> i had this in mind, but then choosed the other approach.... turned out to be a bad idea.
>>
>> alexey: i will prepare a patch later today, could you then please verify it fixes your problem.
>
> Yes I can, where is the patch? :)

its on my todo for today. sorry, have been a bit busy lately.

Peter

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-05-30 14:41           ` Paolo Bonzini
@ 2013-06-04 13:52             ` Peter Lieven
  2013-06-04 14:14               ` Paolo Bonzini
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Lieven @ 2013-06-04 13:52 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

On 30.05.2013 16:41, Paolo Bonzini wrote:
> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>> You could also scan the page for nonzero values before writing it.
>> i had this in mind, but then choosed the other approach.... turned out to be a bad idea.
>>
>> alexey: i will prepare a patch later today, could you then please verify it fixes your problem.
>>
>> paolo: would we still need the madvise or is it enough to not write the zeroes?
> It should be enough to not write them.

Problem: checking the pages for zero allocates them. even at the source.

Peter

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-04 13:52             ` Peter Lieven
@ 2013-06-04 14:14               ` Paolo Bonzini
  2013-06-04 14:38                 ` Peter Lieven
  0 siblings, 1 reply; 49+ messages in thread
From: Paolo Bonzini @ 2013-06-04 14:14 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

Il 04/06/2013 15:52, Peter Lieven ha scritto:
> On 30.05.2013 16:41, Paolo Bonzini wrote:
>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>> You could also scan the page for nonzero values before writing it.
>>> i had this in mind, but then choosed the other approach.... turned
>>> out to be a bad idea.
>>>
>>> alexey: i will prepare a patch later today, could you then please
>>> verify it fixes your problem.
>>>
>>> paolo: would we still need the madvise or is it enough to not write
>>> the zeroes?
>> It should be enough to not write them.
> 
> Problem: checking the pages for zero allocates them. even at the source.

It doesn't look like.  I tried this program and top doesn't show an
increasing amount of reserved memory:

#include <stdio.h>
#include <stdlib.h>
int main()
{
    char *x = malloc(500 << 20);
    int i, j;
    for (i = 0; i < 500; i += 10) {
        for (j = 0; j < 10 << 20; j += 4096) {
             *(volatile char*) (x + (i << 20) + j);
        }
        getchar();
    }
}

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-04 14:14               ` Paolo Bonzini
@ 2013-06-04 14:38                 ` Peter Lieven
  2013-06-04 14:40                   ` Paolo Bonzini
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Lieven @ 2013-06-04 14:38 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

On 04.06.2013 16:14, Paolo Bonzini wrote:
> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>> You could also scan the page for nonzero values before writing it.
>>>> i had this in mind, but then choosed the other approach.... turned
>>>> out to be a bad idea.
>>>>
>>>> alexey: i will prepare a patch later today, could you then please
>>>> verify it fixes your problem.
>>>>
>>>> paolo: would we still need the madvise or is it enough to not write
>>>> the zeroes?
>>> It should be enough to not write them.
>> Problem: checking the pages for zero allocates them. even at the source.
> It doesn't look like.  I tried this program and top doesn't show an
> increasing amount of reserved memory:
>
> #include <stdio.h>
> #include <stdlib.h>
> int main()
> {
>      char *x = malloc(500 << 20);
>      int i, j;
>      for (i = 0; i < 500; i += 10) {
>          for (j = 0; j < 10 << 20; j += 4096) {
>               *(volatile char*) (x + (i << 20) + j);
>          }
>          getchar();
>      }
> }
strange. we are talking about RSS size, right?
is the malloc above using mmapped memory?
which kernel version do you use?

what avoids allocating the memory for me is the following (with
whatever side effects it has ;-))

diff --git a/arch_init.c b/arch_init.c
index 642f241..25d20a9 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -148,6 +148,10 @@ int qemu_read_default_config_files(bool userconfig)

  static inline bool is_zero_page(uint8_t *p)
  {
+    uint8_t ret;
+    if (mincore(p, TARGET_PAGE_SIZE, &ret) == 0 && !(ret&0x1)) {
+        return 1;
+    }
      return buffer_find_nonzero_offset(p, TARGET_PAGE_SIZE) ==
          TARGET_PAGE_SIZE;
  }


Peter




>
> Paolo


-- 

Mit freundlichen Grüßen

Peter Lieven

...........................................................

   KAMP Netzwerkdienste GmbH
   Vestische Str. 89-91 | 46117 Oberhausen
   Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40
   pl@kamp.de | http://www.kamp.de

   Geschäftsführer: Heiner Lante | Michael Lante
   Amtsgericht Duisburg | HRB Nr. 12154
   USt-Id-Nr.: DE 120607556

...........................................................

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-04 14:38                 ` Peter Lieven
@ 2013-06-04 14:40                   ` Paolo Bonzini
  2013-06-04 14:48                     ` Peter Lieven
  2013-06-04 15:10                     ` Peter Lieven
  0 siblings, 2 replies; 49+ messages in thread
From: Paolo Bonzini @ 2013-06-04 14:40 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

Il 04/06/2013 16:38, Peter Lieven ha scritto:
> On 04.06.2013 16:14, Paolo Bonzini wrote:
>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>> You could also scan the page for nonzero values before writing it.
>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>> out to be a bad idea.
>>>>>
>>>>> alexey: i will prepare a patch later today, could you then please
>>>>> verify it fixes your problem.
>>>>>
>>>>> paolo: would we still need the madvise or is it enough to not write
>>>>> the zeroes?
>>>> It should be enough to not write them.
>>> Problem: checking the pages for zero allocates them. even at the source.
>> It doesn't look like.  I tried this program and top doesn't show an
>> increasing amount of reserved memory:
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> int main()
>> {
>>      char *x = malloc(500 << 20);
>>      int i, j;
>>      for (i = 0; i < 500; i += 10) {
>>          for (j = 0; j < 10 << 20; j += 4096) {
>>               *(volatile char*) (x + (i << 20) + j);
>>          }
>>          getchar();
>>      }
>> }
> strange. we are talking about RSS size, right?

None of the three top values change, and only VIRT is >500 MB.

> is the malloc above using mmapped memory?

Yes.

> which kernel version do you use?

3.9.

> what avoids allocating the memory for me is the following (with
> whatever side effects it has ;-))

This would also fail to migrate any page that is swapped out, breaking
overcommit in a more subtle way. :)

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-04 14:40                   ` Paolo Bonzini
@ 2013-06-04 14:48                     ` Peter Lieven
  2013-06-04 15:17                       ` Paolo Bonzini
  2013-06-04 15:10                     ` Peter Lieven
  1 sibling, 1 reply; 49+ messages in thread
From: Peter Lieven @ 2013-06-04 14:48 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

On 04.06.2013 16:40, Paolo Bonzini wrote:
> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>> You could also scan the page for nonzero values before writing it.
>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>> out to be a bad idea.
>>>>>>
>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>> verify it fixes your problem.
>>>>>>
>>>>>> paolo: would we still need the madvise or is it enough to not write
>>>>>> the zeroes?
>>>>> It should be enough to not write them.
>>>> Problem: checking the pages for zero allocates them. even at the source.
>>> It doesn't look like.  I tried this program and top doesn't show an
>>> increasing amount of reserved memory:
>>>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> int main()
>>> {
>>>       char *x = malloc(500 << 20);
>>>       int i, j;
>>>       for (i = 0; i < 500; i += 10) {
>>>           for (j = 0; j < 10 << 20; j += 4096) {
>>>                *(volatile char*) (x + (i << 20) + j);
>>>           }
>>>           getchar();
>>>       }
>>> }
>> strange. we are talking about RSS size, right?
> None of the three top values change, and only VIRT is >500 MB.
>
>> is the malloc above using mmapped memory?
> Yes.
>
>> which kernel version do you use?
> 3.9.
Still using 3.2, but strange enough the above example is also not increasing RSS size for me.

Can you try the following:
qemu git master with 1G of memory (hanging in bios with no boot device) and migrate it. Before migration RSS Size os somewhat
around 16MB. After migration its RSS size is in the order of 1G.

Peter

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-04 14:40                   ` Paolo Bonzini
  2013-06-04 14:48                     ` Peter Lieven
@ 2013-06-04 15:10                     ` Peter Lieven
  2013-06-08  8:27                       ` Wenchao Xia
  1 sibling, 1 reply; 49+ messages in thread
From: Peter Lieven @ 2013-06-04 15:10 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

On 04.06.2013 16:40, Paolo Bonzini wrote:
> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>> You could also scan the page for nonzero values before writing it.
>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>> out to be a bad idea.
>>>>>>
>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>> verify it fixes your problem.
>>>>>>
>>>>>> paolo: would we still need the madvise or is it enough to not write
>>>>>> the zeroes?
>>>>> It should be enough to not write them.
>>>> Problem: checking the pages for zero allocates them. even at the source.
>>> It doesn't look like.  I tried this program and top doesn't show an
>>> increasing amount of reserved memory:
>>>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> int main()
>>> {
>>>       char *x = malloc(500 << 20);
>>>       int i, j;
>>>       for (i = 0; i < 500; i += 10) {
>>>           for (j = 0; j < 10 << 20; j += 4096) {
>>>                *(volatile char*) (x + (i << 20) + j);
>>>           }
>>>           getchar();
>>>       }
>>> }
>> strange. we are talking about RSS size, right?
> None of the three top values change, and only VIRT is >500 MB.
>
>> is the malloc above using mmapped memory?
> Yes.
>
>> which kernel version do you use?
> 3.9.
>
>> what avoids allocating the memory for me is the following (with
>> whatever side effects it has ;-))
> This would also fail to migrate any page that is swapped out, breaking
> overcommit in a more subtle way. :)
>
> Paolo
the following does also not allocate memory, but qemu does...

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <unistd.h>
#include <sys/resource.h>
#include <inttypes.h>
#include <string.h>
#include <sys/mman.h>
#include <errno.h>

#if defined __SSE2__
#include <emmintrin.h>
#define VECTYPE        __m128i
#define SPLAT(p)       _mm_set1_epi8(*(p))
#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
#else
#define VECTYPE        unsigned long
#define SPLAT(p)       (*(p) * (~0UL / 255))
#define ALL_EQ(v1, v2) ((v1) == (v2))
#endif

#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8

/* Round number down to multiple */
#define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))

/* Round number up to multiple */
#define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))

#define QEMU_VMALLOC_ALIGN (256 * 4096)

/* alloc shared memory pages */
void *qemu_anon_ram_alloc(size_t size)
{
     size_t align = QEMU_VMALLOC_ALIGN;
     size_t total = size + align - getpagesize();
     void *ptr = mmap(0, total, PROT_READ | PROT_WRITE,
                      MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
     size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr;

     if (ptr == MAP_FAILED) {
         fprintf(stderr, "Failed to allocate %zu B: %s\n",
                 size, strerror(errno));
         abort();
     }

     ptr += offset;
     total -= offset;

     if (offset > 0) {
         munmap(ptr - offset, offset);
     }
     if (total > size) {
         munmap(ptr + size, total - size);
     }

     return ptr;
}

static inline int
can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
{
     return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
                    * sizeof(VECTYPE)) == 0
             && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
}

size_t buffer_find_nonzero_offset(const void *buf, size_t len)
{
     const VECTYPE *p = buf;
     const VECTYPE zero = (VECTYPE){0};
     size_t i;

     if (!len) {
         return 0;
     }

     assert(can_use_buffer_find_nonzero_offset(buf, len));

     for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
         if (!ALL_EQ(p[i], zero)) {
             return i * sizeof(VECTYPE);
         }
     }

     for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
          i < len / sizeof(VECTYPE);
          i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
         VECTYPE tmp0 = p[i + 0] | p[i + 1];
         VECTYPE tmp1 = p[i + 2] | p[i + 3];
         VECTYPE tmp2 = p[i + 4] | p[i + 5];
         VECTYPE tmp3 = p[i + 6] | p[i + 7];
         VECTYPE tmp01 = tmp0 | tmp1;
         VECTYPE tmp23 = tmp2 | tmp3;
         if (!ALL_EQ(tmp01 | tmp23, zero)) {
             break;
         }
     }

     return i * sizeof(VECTYPE);
}

int main()
{
      //char *x = malloc(1024 << 20);
      char *x = qemu_anon_ram_alloc(1024 << 20);

      int i, j;
      int ret = 0;
      struct rusage rusage;
      for (i = 0; i < 500; i ++) {
          for (j = 0; j < 10 << 20; j += 4096) {
               ret += buffer_find_nonzero_offset((char*) (x + (i << 20) + j), 4096);
          }
          getrusage( RUSAGE_SELF, &rusage );
          printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10), rusage.ru_maxrss);
          getchar();
      }
      printf("%d zero pages\n", ret);
}

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-04 14:48                     ` Peter Lieven
@ 2013-06-04 15:17                       ` Paolo Bonzini
  2013-06-04 19:15                         ` Peter Lieven
  0 siblings, 1 reply; 49+ messages in thread
From: Paolo Bonzini @ 2013-06-04 15:17 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

Il 04/06/2013 16:48, Peter Lieven ha scritto:
> Still using 3.2, but strange enough the above example is also not
> increasing RSS size for me.
> 
> Can you try the following:
> qemu git master with 1G of memory (hanging in bios with no boot device)
> and migrate it. Before migration RSS Size os somewhat
> around 16MB. After migration its RSS size is in the order of 1G.

That may be a kernel bug.  The kernel did not do the copy-on-write trick
on huge zero pages.  It was fixed last year, maybe 3.2 is not enough.
Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces.

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-04 15:17                       ` Paolo Bonzini
@ 2013-06-04 19:15                         ` Peter Lieven
  2013-06-05  3:37                           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Lieven @ 2013-06-04 19:15 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson


Am 04.06.2013 um 17:17 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 04/06/2013 16:48, Peter Lieven ha scritto:
>> Still using 3.2, but strange enough the above example is also not
>> increasing RSS size for me.
>> 
>> Can you try the following:
>> qemu git master with 1G of memory (hanging in bios with no boot device)
>> and migrate it. Before migration RSS Size os somewhat
>> around 16MB. After migration its RSS size is in the order of 1G.
> 
> That may be a kernel bug.  The kernel did not do the copy-on-write trick
> on huge zero pages.  It was fixed last year, maybe 3.2 is not enough.
> Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces.

that's it. thanks for the pointer. the huge zero page was introduced in 3.8.

paolo, alexey: can you please verify the following works for you:
https://github.com/plieven/qemu/tree/fix-migration

thanks
peter


> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-04 19:15                         ` Peter Lieven
@ 2013-06-05  3:37                           ` Alexey Kardashevskiy
  2013-06-05  6:09                             ` Peter Lieven
  0 siblings, 1 reply; 49+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-05  3:37 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson

On 06/05/2013 05:15 AM, Peter Lieven wrote:
> 
> Am 04.06.2013 um 17:17 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
>> Il 04/06/2013 16:48, Peter Lieven ha scritto:
>>> Still using 3.2, but strange enough the above example is also not
>>> increasing RSS size for me.
>>>
>>> Can you try the following:
>>> qemu git master with 1G of memory (hanging in bios with no boot device)
>>> and migrate it. Before migration RSS Size os somewhat
>>> around 16MB. After migration its RSS size is in the order of 1G.
>>
>> That may be a kernel bug.  The kernel did not do the copy-on-write trick
>> on huge zero pages.  It was fixed last year, maybe 3.2 is not enough.
>> Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces.
> 
> that's it. thanks for the pointer. the huge zero page was introduced in 3.8.
> 
> paolo, alexey: can you please verify the following works for you:
> https://github.com/plieven/qemu/tree/fix-migration

These two?
848b796 Tue Jun 4 14:43:04 2013 +0200 Peter Lieven migration: do not
overwrite zero pages
2206ac8 Tue Jun 4 14:25:33 2013 +0200 Peter Lieven Revert "migration: do
not sent zero pages in bulk stage"

That works for me (qemu 1.5, kernel 3.9-rc2).
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>



-- 
Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-05  3:37                           ` Alexey Kardashevskiy
@ 2013-06-05  6:09                             ` Peter Lieven
  2013-06-09  4:12                               ` liu ping fan
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Lieven @ 2013-06-05  6:09 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson


Am 05.06.2013 um 05:37 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:

> On 06/05/2013 05:15 AM, Peter Lieven wrote:
>> 
>> Am 04.06.2013 um 17:17 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>> 
>>> Il 04/06/2013 16:48, Peter Lieven ha scritto:
>>>> Still using 3.2, but strange enough the above example is also not
>>>> increasing RSS size for me.
>>>> 
>>>> Can you try the following:
>>>> qemu git master with 1G of memory (hanging in bios with no boot device)
>>>> and migrate it. Before migration RSS Size os somewhat
>>>> around 16MB. After migration its RSS size is in the order of 1G.
>>> 
>>> That may be a kernel bug.  The kernel did not do the copy-on-write trick
>>> on huge zero pages.  It was fixed last year, maybe 3.2 is not enough.
>>> Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces.
>> 
>> that's it. thanks for the pointer. the huge zero page was introduced in 3.8.
>> 
>> paolo, alexey: can you please verify the following works for you:
>> https://github.com/plieven/qemu/tree/fix-migration
> 
> These two?
> 848b796 Tue Jun 4 14:43:04 2013 +0200 Peter Lieven migration: do not
> overwrite zero pages
> 2206ac8 Tue Jun 4 14:25:33 2013 +0200 Peter Lieven Revert "migration: do
> not sent zero pages in bulk stage"

Yes, sorry forgot to mention this.

> 
> That works for me (qemu 1.5, kernel 3.9-rc2).
> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Thank you,
Peter

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-05-30 13:00       ` Paolo Bonzini
  2013-05-30 13:38         ` Alexey Kardashevskiy
  2013-05-30 14:38         ` Peter Lieven
@ 2013-06-08  8:24         ` Wenchao Xia
  2 siblings, 0 replies; 49+ messages in thread
From: Wenchao Xia @ 2013-06-08  8:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Alexey Kardashevskiy, qemu-ppc, Peter Lieven, qemu-devel, David Gibson

于 2013-5-30 21:00, Paolo Bonzini 写道:
> Il 30/05/2013 11:08, Peter Lieven ha scritto:
>> Am 30.05.2013 10:18, schrieb Alexey Kardashevskiy:
>>> On 05/30/2013 05:49 PM, Paolo Bonzini wrote:
>>>> Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto:
>>>>> Hi!
>>>>>
>>>>> I found the migration broken on pseries platform, specifically, this patch
>>>>> broke it:
>>>>>
>>>>> f1c72795af573b24a7da5eb52375c9aba8a37972
>>>>> migration: do not sent zero pages in bulk stage
>>>>>
>>>>> The idea is not to send zero pages to the destination guest which is
>>>>> expected to have 100% empty RAM.
>>>>>
>>>>> However on pseries plaftorm the guest always has some stuff in the RAM as a
>>>>> part of initialization (device tree, system firmware and rtas (?)) so it is
>>>>> not completely empty. As the source guest cannot detect this, it skips some
>>>>> pages during migration and we get a broken destination guest. Bug.
>>>>>
>>>>> While the idea is ok in general, I do not see any easy way to fix it as
>>>>> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information
>>>>> about whether we are about to receive a migration or not (-incoming
>>>>> parameter) and we cannot move device-tree and system firmware
>>>>> initialization anywhere else.
>>>>>
>>>>> ram_bulk_stage is static and cannot be disabled from the platform
>>>>> initialization code.
>>>>>
>>>>> So what would the community suggest?
>>>> Revert the patch. :)
>>> I'll wait for 24 hours (forgot to cc: the author) and then post a revert
>>> patch :)
>>>
>> does this problem only occur on pseries emulation?
>
> Probably not.  On a PC, it would occur if you had 4K of zeros in the
> source BIOS but not in the destination BIOS.  When you reboot, the BIOS
> image is wrong.
>
   So one way is to migrate BIOS contents together?

>> not sending zero pages is not only a performance benefit it also makes
>> overcomitted memory usable. the madv_dontneed seems to kick in asynchronously
>> and memory is not available immediately.
>
> You could also scan the page for nonzero values before writing it.
>
> Paolo
>


-- 
Best Regards

Wenchao Xia

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-04 15:10                     ` Peter Lieven
@ 2013-06-08  8:27                       ` Wenchao Xia
  2013-06-08  8:30                         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 49+ messages in thread
From: Wenchao Xia @ 2013-06-08  8:27 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Alexey Kardashevskiy, Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson

 > On 04.06.2013 16:40, Paolo Bonzini wrote:
>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>> You could also scan the page for nonzero values before writing it.
>>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>>> out to be a bad idea.
>>>>>>>
>>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>>> verify it fixes your problem.
>>>>>>>
>>>>>>> paolo: would we still need the madvise or is it enough to not write
>>>>>>> the zeroes?
>>>>>> It should be enough to not write them.
>>>>> Problem: checking the pages for zero allocates them. even at the
>>>>> source.
>>>> It doesn't look like.  I tried this program and top doesn't show an
>>>> increasing amount of reserved memory:
>>>>
>>>> #include <stdio.h>
>>>> #include <stdlib.h>
>>>> int main()
>>>> {
>>>>       char *x = malloc(500 << 20);
>>>>       int i, j;
>>>>       for (i = 0; i < 500; i += 10) {
>>>>           for (j = 0; j < 10 << 20; j += 4096) {
>>>>                *(volatile char*) (x + (i << 20) + j);
>>>>           }
>>>>           getchar();
>>>>       }
>>>> }
>>> strange. we are talking about RSS size, right?
>> None of the three top values change, and only VIRT is >500 MB.
>>
>>> is the malloc above using mmapped memory?
>> Yes.
>>
>>> which kernel version do you use?
>> 3.9.
>>
>>> what avoids allocating the memory for me is the following (with
>>> whatever side effects it has ;-))
>> This would also fail to migrate any page that is swapped out, breaking
>> overcommit in a more subtle way. :)
>>
>> Paolo
> the following does also not allocate memory, but qemu does...
>
Hi, Peter
   As the patch writes

"not sending zero pages breaks migration if a page is zero
at the source but not at the destination."

   I don't understand why it would be trouble, shouldn't all page
not received in dest be treated as zero pages?

Also, you mean following code is from qemu and it does not allocate
memory with you gcc right? Maybe it is related to KVM, how about
turn off KVM and retry following code in qemu?

> #include <stdio.h>
> #include <stdlib.h>
> #include <assert.h>
> #include <unistd.h>
> #include <sys/resource.h>
> #include <inttypes.h>
> #include <string.h>
> #include <sys/mman.h>
> #include <errno.h>
>
> #if defined __SSE2__
> #include <emmintrin.h>
> #define VECTYPE        __m128i
> #define SPLAT(p)       _mm_set1_epi8(*(p))
> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
> 0xFFFF)
> #else
> #define VECTYPE        unsigned long
> #define SPLAT(p)       (*(p) * (~0UL / 255))
> #define ALL_EQ(v1, v2) ((v1) == (v2))
> #endif
>
> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>
> /* Round number down to multiple */
> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
>
> /* Round number up to multiple */
> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))
>
> #define QEMU_VMALLOC_ALIGN (256 * 4096)
>
> /* alloc shared memory pages */
> void *qemu_anon_ram_alloc(size_t size)
> {
>      size_t align = QEMU_VMALLOC_ALIGN;
>      size_t total = size + align - getpagesize();
>      void *ptr = mmap(0, total, PROT_READ | PROT_WRITE,
>                       MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>      size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr;
>
>      if (ptr == MAP_FAILED) {
>          fprintf(stderr, "Failed to allocate %zu B: %s\n",
>                  size, strerror(errno));
>          abort();
>      }
>
>      ptr += offset;
>      total -= offset;
>
>      if (offset > 0) {
>          munmap(ptr - offset, offset);
>      }
>      if (total > size) {
>          munmap(ptr + size, total - size);
>      }
>
>      return ptr;
> }
>
> static inline int
> can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
> {
>      return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>                     * sizeof(VECTYPE)) == 0
>              && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
> }
>
> size_t buffer_find_nonzero_offset(const void *buf, size_t len)
> {
>      const VECTYPE *p = buf;
>      const VECTYPE zero = (VECTYPE){0};
>      size_t i;
>
>      if (!len) {
>          return 0;
>      }
>
>      assert(can_use_buffer_find_nonzero_offset(buf, len));
>
>      for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
>          if (!ALL_EQ(p[i], zero)) {
>              return i * sizeof(VECTYPE);
>          }
>      }
>
>      for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
>           i < len / sizeof(VECTYPE);
>           i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>          VECTYPE tmp0 = p[i + 0] | p[i + 1];
>          VECTYPE tmp1 = p[i + 2] | p[i + 3];
>          VECTYPE tmp2 = p[i + 4] | p[i + 5];
>          VECTYPE tmp3 = p[i + 6] | p[i + 7];
>          VECTYPE tmp01 = tmp0 | tmp1;
>          VECTYPE tmp23 = tmp2 | tmp3;
>          if (!ALL_EQ(tmp01 | tmp23, zero)) {
>              break;
>          }
>      }
>
>      return i * sizeof(VECTYPE);
> }
>
> int main()
> {
>       //char *x = malloc(1024 << 20);
>       char *x = qemu_anon_ram_alloc(1024 << 20);
>
>       int i, j;
>       int ret = 0;
>       struct rusage rusage;
>       for (i = 0; i < 500; i ++) {
>           for (j = 0; j < 10 << 20; j += 4096) {
>                ret += buffer_find_nonzero_offset((char*) (x + (i << 20)
> + j), 4096);
>           }
>           getrusage( RUSAGE_SELF, &rusage );
>           printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10),
> rusage.ru_maxrss);
>           getchar();
>       }
>       printf("%d zero pages\n", ret);
> }
>


-- 
Best Regards

Wenchao Xia

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-08  8:27                       ` Wenchao Xia
@ 2013-06-08  8:30                         ` Alexey Kardashevskiy
  2013-06-09  2:16                           ` Wenchao Xia
  0 siblings, 1 reply; 49+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-08  8:30 UTC (permalink / raw)
  To: Wenchao Xia
  Cc: qemu-ppc, Paolo Bonzini, Peter Lieven, qemu-devel, David Gibson

On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>> You could also scan the page for nonzero values before writing it.
>>>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>>>> out to be a bad idea.
>>>>>>>>
>>>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>>>> verify it fixes your problem.
>>>>>>>>
>>>>>>>> paolo: would we still need the madvise or is it enough to not write
>>>>>>>> the zeroes?
>>>>>>> It should be enough to not write them.
>>>>>> Problem: checking the pages for zero allocates them. even at the
>>>>>> source.
>>>>> It doesn't look like.  I tried this program and top doesn't show an
>>>>> increasing amount of reserved memory:
>>>>>
>>>>> #include <stdio.h>
>>>>> #include <stdlib.h>
>>>>> int main()
>>>>> {
>>>>>       char *x = malloc(500 << 20);
>>>>>       int i, j;
>>>>>       for (i = 0; i < 500; i += 10) {
>>>>>           for (j = 0; j < 10 << 20; j += 4096) {
>>>>>                *(volatile char*) (x + (i << 20) + j);
>>>>>           }
>>>>>           getchar();
>>>>>       }
>>>>> }
>>>> strange. we are talking about RSS size, right?
>>> None of the three top values change, and only VIRT is >500 MB.
>>>
>>>> is the malloc above using mmapped memory?
>>> Yes.
>>>
>>>> which kernel version do you use?
>>> 3.9.
>>>
>>>> what avoids allocating the memory for me is the following (with
>>>> whatever side effects it has ;-))
>>> This would also fail to migrate any page that is swapped out, breaking
>>> overcommit in a more subtle way. :)
>>>
>>> Paolo
>> the following does also not allocate memory, but qemu does...
>>
> Hi, Peter
>   As the patch writes
> 
> "not sending zero pages breaks migration if a page is zero
> at the source but not at the destination."
> 
>   I don't understand why it would be trouble, shouldn't all page
> not received in dest be treated as zero pages?


How would the destination guest know if some page must be cleared? The
previous patch (which Peter reverted) did not send anything for the pages
which were zero on the source side.



> Also, you mean following code is from qemu and it does not allocate
> memory with you gcc right? Maybe it is related to KVM, how about
> turn off KVM and retry following code in qemu?
> 
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <assert.h>
>> #include <unistd.h>
>> #include <sys/resource.h>
>> #include <inttypes.h>
>> #include <string.h>
>> #include <sys/mman.h>
>> #include <errno.h>
>>
>> #if defined __SSE2__
>> #include <emmintrin.h>
>> #define VECTYPE        __m128i
>> #define SPLAT(p)       _mm_set1_epi8(*(p))
>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>> 0xFFFF)
>> #else
>> #define VECTYPE        unsigned long
>> #define SPLAT(p)       (*(p) * (~0UL / 255))
>> #define ALL_EQ(v1, v2) ((v1) == (v2))
>> #endif
>>
>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>>
>> /* Round number down to multiple */
>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
>>
>> /* Round number up to multiple */
>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))
>>
>> #define QEMU_VMALLOC_ALIGN (256 * 4096)
>>
>> /* alloc shared memory pages */
>> void *qemu_anon_ram_alloc(size_t size)
>> {
>>      size_t align = QEMU_VMALLOC_ALIGN;
>>      size_t total = size + align - getpagesize();
>>      void *ptr = mmap(0, total, PROT_READ | PROT_WRITE,
>>                       MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>      size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr;
>>
>>      if (ptr == MAP_FAILED) {
>>          fprintf(stderr, "Failed to allocate %zu B: %s\n",
>>                  size, strerror(errno));
>>          abort();
>>      }
>>
>>      ptr += offset;
>>      total -= offset;
>>
>>      if (offset > 0) {
>>          munmap(ptr - offset, offset);
>>      }
>>      if (total > size) {
>>          munmap(ptr + size, total - size);
>>      }
>>
>>      return ptr;
>> }
>>
>> static inline int
>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
>> {
>>      return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>>                     * sizeof(VECTYPE)) == 0
>>              && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
>> }
>>
>> size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>> {
>>      const VECTYPE *p = buf;
>>      const VECTYPE zero = (VECTYPE){0};
>>      size_t i;
>>
>>      if (!len) {
>>          return 0;
>>      }
>>
>>      assert(can_use_buffer_find_nonzero_offset(buf, len));
>>
>>      for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
>>          if (!ALL_EQ(p[i], zero)) {
>>              return i * sizeof(VECTYPE);
>>          }
>>      }
>>
>>      for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
>>           i < len / sizeof(VECTYPE);
>>           i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>>          VECTYPE tmp0 = p[i + 0] | p[i + 1];
>>          VECTYPE tmp1 = p[i + 2] | p[i + 3];
>>          VECTYPE tmp2 = p[i + 4] | p[i + 5];
>>          VECTYPE tmp3 = p[i + 6] | p[i + 7];
>>          VECTYPE tmp01 = tmp0 | tmp1;
>>          VECTYPE tmp23 = tmp2 | tmp3;
>>          if (!ALL_EQ(tmp01 | tmp23, zero)) {
>>              break;
>>          }
>>      }
>>
>>      return i * sizeof(VECTYPE);
>> }
>>
>> int main()
>> {
>>       //char *x = malloc(1024 << 20);
>>       char *x = qemu_anon_ram_alloc(1024 << 20);
>>
>>       int i, j;
>>       int ret = 0;
>>       struct rusage rusage;
>>       for (i = 0; i < 500; i ++) {
>>           for (j = 0; j < 10 << 20; j += 4096) {
>>                ret += buffer_find_nonzero_offset((char*) (x + (i << 20)
>> + j), 4096);
>>           }
>>           getrusage( RUSAGE_SELF, &rusage );
>>           printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10),
>> rusage.ru_maxrss);
>>           getchar();
>>       }
>>       printf("%d zero pages\n", ret);
>> }
>>
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-08  8:30                         ` Alexey Kardashevskiy
@ 2013-06-09  2:16                           ` Wenchao Xia
  2013-06-09  2:34                             ` Alexey Kardashevskiy
  2013-06-09  2:53                             ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 49+ messages in thread
From: Wenchao Xia @ 2013-06-09  2:16 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-ppc, Paolo Bonzini, Peter Lieven, qemu-devel, David Gibson

于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>> You could also scan the page for nonzero values before writing it.
>>>>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>>>>> out to be a bad idea.
>>>>>>>>>
>>>>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>>>>> verify it fixes your problem.
>>>>>>>>>
>>>>>>>>> paolo: would we still need the madvise or is it enough to not write
>>>>>>>>> the zeroes?
>>>>>>>> It should be enough to not write them.
>>>>>>> Problem: checking the pages for zero allocates them. even at the
>>>>>>> source.
>>>>>> It doesn't look like.  I tried this program and top doesn't show an
>>>>>> increasing amount of reserved memory:
>>>>>>
>>>>>> #include <stdio.h>
>>>>>> #include <stdlib.h>
>>>>>> int main()
>>>>>> {
>>>>>>        char *x = malloc(500 << 20);
>>>>>>        int i, j;
>>>>>>        for (i = 0; i < 500; i += 10) {
>>>>>>            for (j = 0; j < 10 << 20; j += 4096) {
>>>>>>                 *(volatile char*) (x + (i << 20) + j);
>>>>>>            }
>>>>>>            getchar();
>>>>>>        }
>>>>>> }
>>>>> strange. we are talking about RSS size, right?
>>>> None of the three top values change, and only VIRT is >500 MB.
>>>>
>>>>> is the malloc above using mmapped memory?
>>>> Yes.
>>>>
>>>>> which kernel version do you use?
>>>> 3.9.
>>>>
>>>>> what avoids allocating the memory for me is the following (with
>>>>> whatever side effects it has ;-))
>>>> This would also fail to migrate any page that is swapped out, breaking
>>>> overcommit in a more subtle way. :)
>>>>
>>>> Paolo
>>> the following does also not allocate memory, but qemu does...
>>>
>> Hi, Peter
>>    As the patch writes
>>
>> "not sending zero pages breaks migration if a page is zero
>> at the source but not at the destination."
>>
>>    I don't understand why it would be trouble, shouldn't all page
>> not received in dest be treated as zero pages?
>
>
> How would the destination guest know if some page must be cleared? The
> previous patch (which Peter reverted) did not send anything for the pages
> which were zero on the source side.
>
>
   If an page was not received and destination knows that page should
exist according to total size, fill it with zero at destination, would
it solve the problem?

>
>> Also, you mean following code is from qemu and it does not allocate
>> memory with you gcc right? Maybe it is related to KVM, how about
>> turn off KVM and retry following code in qemu?
>>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <assert.h>
>>> #include <unistd.h>
>>> #include <sys/resource.h>
>>> #include <inttypes.h>
>>> #include <string.h>
>>> #include <sys/mman.h>
>>> #include <errno.h>
>>>
>>> #if defined __SSE2__
>>> #include <emmintrin.h>
>>> #define VECTYPE        __m128i
>>> #define SPLAT(p)       _mm_set1_epi8(*(p))
>>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>>> 0xFFFF)
>>> #else
>>> #define VECTYPE        unsigned long
>>> #define SPLAT(p)       (*(p) * (~0UL / 255))
>>> #define ALL_EQ(v1, v2) ((v1) == (v2))
>>> #endif
>>>
>>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>>>
>>> /* Round number down to multiple */
>>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
>>>
>>> /* Round number up to multiple */
>>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))
>>>
>>> #define QEMU_VMALLOC_ALIGN (256 * 4096)
>>>
>>> /* alloc shared memory pages */
>>> void *qemu_anon_ram_alloc(size_t size)
>>> {
>>>       size_t align = QEMU_VMALLOC_ALIGN;
>>>       size_t total = size + align - getpagesize();
>>>       void *ptr = mmap(0, total, PROT_READ | PROT_WRITE,
>>>                        MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>>       size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr;
>>>
>>>       if (ptr == MAP_FAILED) {
>>>           fprintf(stderr, "Failed to allocate %zu B: %s\n",
>>>                   size, strerror(errno));
>>>           abort();
>>>       }
>>>
>>>       ptr += offset;
>>>       total -= offset;
>>>
>>>       if (offset > 0) {
>>>           munmap(ptr - offset, offset);
>>>       }
>>>       if (total > size) {
>>>           munmap(ptr + size, total - size);
>>>       }
>>>
>>>       return ptr;
>>> }
>>>
>>> static inline int
>>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
>>> {
>>>       return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>>>                      * sizeof(VECTYPE)) == 0
>>>               && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
>>> }
>>>
>>> size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>>> {
>>>       const VECTYPE *p = buf;
>>>       const VECTYPE zero = (VECTYPE){0};
>>>       size_t i;
>>>
>>>       if (!len) {
>>>           return 0;
>>>       }
>>>
>>>       assert(can_use_buffer_find_nonzero_offset(buf, len));
>>>
>>>       for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
>>>           if (!ALL_EQ(p[i], zero)) {
>>>               return i * sizeof(VECTYPE);
>>>           }
>>>       }
>>>
>>>       for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
>>>            i < len / sizeof(VECTYPE);
>>>            i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>>>           VECTYPE tmp0 = p[i + 0] | p[i + 1];
>>>           VECTYPE tmp1 = p[i + 2] | p[i + 3];
>>>           VECTYPE tmp2 = p[i + 4] | p[i + 5];
>>>           VECTYPE tmp3 = p[i + 6] | p[i + 7];
>>>           VECTYPE tmp01 = tmp0 | tmp1;
>>>           VECTYPE tmp23 = tmp2 | tmp3;
>>>           if (!ALL_EQ(tmp01 | tmp23, zero)) {
>>>               break;
>>>           }
>>>       }
>>>
>>>       return i * sizeof(VECTYPE);
>>> }
>>>
>>> int main()
>>> {
>>>        //char *x = malloc(1024 << 20);
>>>        char *x = qemu_anon_ram_alloc(1024 << 20);
>>>
>>>        int i, j;
>>>        int ret = 0;
>>>        struct rusage rusage;
>>>        for (i = 0; i < 500; i ++) {
>>>            for (j = 0; j < 10 << 20; j += 4096) {
>>>                 ret += buffer_find_nonzero_offset((char*) (x + (i << 20)
>>> + j), 4096);
>>>            }
>>>            getrusage( RUSAGE_SELF, &rusage );
>>>            printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10),
>>> rusage.ru_maxrss);
>>>            getchar();
>>>        }
>>>        printf("%d zero pages\n", ret);
>>> }
>>>
>>
>>
>
>


-- 
Best Regards

Wenchao Xia

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-09  2:16                           ` Wenchao Xia
@ 2013-06-09  2:34                             ` Alexey Kardashevskiy
  2013-06-09  2:52                               ` [Qemu-devel] [Qemu-ppc] " Benjamin Herrenschmidt
  2013-06-09  3:01                               ` [Qemu-devel] " Wenchao Xia
  2013-06-09  2:53                             ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-09  2:34 UTC (permalink / raw)
  To: Wenchao Xia
  Cc: qemu-ppc, Paolo Bonzini, Peter Lieven, qemu-devel, David Gibson

On 06/09/2013 12:16 PM, Wenchao Xia wrote:
> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>> You could also scan the page for nonzero values before writing it.
>>>>>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>>>>>> out to be a bad idea.
>>>>>>>>>>
>>>>>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>>>>>> verify it fixes your problem.
>>>>>>>>>>
>>>>>>>>>> paolo: would we still need the madvise or is it enough to not write
>>>>>>>>>> the zeroes?
>>>>>>>>> It should be enough to not write them.
>>>>>>>> Problem: checking the pages for zero allocates them. even at the
>>>>>>>> source.
>>>>>>> It doesn't look like.  I tried this program and top doesn't show an
>>>>>>> increasing amount of reserved memory:
>>>>>>>
>>>>>>> #include <stdio.h>
>>>>>>> #include <stdlib.h>
>>>>>>> int main()
>>>>>>> {
>>>>>>>        char *x = malloc(500 << 20);
>>>>>>>        int i, j;
>>>>>>>        for (i = 0; i < 500; i += 10) {
>>>>>>>            for (j = 0; j < 10 << 20; j += 4096) {
>>>>>>>                 *(volatile char*) (x + (i << 20) + j);
>>>>>>>            }
>>>>>>>            getchar();
>>>>>>>        }
>>>>>>> }
>>>>>> strange. we are talking about RSS size, right?
>>>>> None of the three top values change, and only VIRT is >500 MB.
>>>>>
>>>>>> is the malloc above using mmapped memory?
>>>>> Yes.
>>>>>
>>>>>> which kernel version do you use?
>>>>> 3.9.
>>>>>
>>>>>> what avoids allocating the memory for me is the following (with
>>>>>> whatever side effects it has ;-))
>>>>> This would also fail to migrate any page that is swapped out, breaking
>>>>> overcommit in a more subtle way. :)
>>>>>
>>>>> Paolo
>>>> the following does also not allocate memory, but qemu does...
>>>>
>>> Hi, Peter
>>>    As the patch writes
>>>
>>> "not sending zero pages breaks migration if a page is zero
>>> at the source but not at the destination."
>>>
>>>    I don't understand why it would be trouble, shouldn't all page
>>> not received in dest be treated as zero pages?
>>
>>
>> How would the destination guest know if some page must be cleared? The
>> previous patch (which Peter reverted) did not send anything for the pages
>> which were zero on the source side.
>>
>>
>   If an page was not received and destination knows that page should
> exist according to total size, fill it with zero at destination, would
> it solve the problem?

It is _live_ migration, the source sends changes, same pages can change and
be sent several times. So we would need to turn tracking on on the
destination to know if some page was received from the source or changed by
the destination itself (by writing there bios/firmware images, etc) and
then clear pages which were touched by the destination and were not sent by
the source.

Or we do not make guesses, the source sends everything and the destination
simply checks if a page which is empty on the source is empty on the
destination and avoid writing zeroes to it. Looks simpler to me and this is
what the new patch does.


> 
>>
>>> Also, you mean following code is from qemu and it does not allocate
>>> memory with you gcc right? Maybe it is related to KVM, how about
>>> turn off KVM and retry following code in qemu?
>>>
>>>> #include <stdio.h>
>>>> #include <stdlib.h>
>>>> #include <assert.h>
>>>> #include <unistd.h>
>>>> #include <sys/resource.h>
>>>> #include <inttypes.h>
>>>> #include <string.h>
>>>> #include <sys/mman.h>
>>>> #include <errno.h>
>>>>
>>>> #if defined __SSE2__
>>>> #include <emmintrin.h>
>>>> #define VECTYPE        __m128i
>>>> #define SPLAT(p)       _mm_set1_epi8(*(p))
>>>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>>>> 0xFFFF)
>>>> #else
>>>> #define VECTYPE        unsigned long
>>>> #define SPLAT(p)       (*(p) * (~0UL / 255))
>>>> #define ALL_EQ(v1, v2) ((v1) == (v2))
>>>> #endif
>>>>
>>>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>>>>
>>>> /* Round number down to multiple */
>>>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
>>>>
>>>> /* Round number up to multiple */
>>>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))
>>>>
>>>> #define QEMU_VMALLOC_ALIGN (256 * 4096)
>>>>
>>>> /* alloc shared memory pages */
>>>> void *qemu_anon_ram_alloc(size_t size)
>>>> {
>>>>       size_t align = QEMU_VMALLOC_ALIGN;
>>>>       size_t total = size + align - getpagesize();
>>>>       void *ptr = mmap(0, total, PROT_READ | PROT_WRITE,
>>>>                        MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>>>       size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) -
>>>> (uintptr_t)ptr;
>>>>
>>>>       if (ptr == MAP_FAILED) {
>>>>           fprintf(stderr, "Failed to allocate %zu B: %s\n",
>>>>                   size, strerror(errno));
>>>>           abort();
>>>>       }
>>>>
>>>>       ptr += offset;
>>>>       total -= offset;
>>>>
>>>>       if (offset > 0) {
>>>>           munmap(ptr - offset, offset);
>>>>       }
>>>>       if (total > size) {
>>>>           munmap(ptr + size, total - size);
>>>>       }
>>>>
>>>>       return ptr;
>>>> }
>>>>
>>>> static inline int
>>>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
>>>> {
>>>>       return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>>>>                      * sizeof(VECTYPE)) == 0
>>>>               && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
>>>> }
>>>>
>>>> size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>>>> {
>>>>       const VECTYPE *p = buf;
>>>>       const VECTYPE zero = (VECTYPE){0};
>>>>       size_t i;
>>>>
>>>>       if (!len) {
>>>>           return 0;
>>>>       }
>>>>
>>>>       assert(can_use_buffer_find_nonzero_offset(buf, len));
>>>>
>>>>       for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
>>>>           if (!ALL_EQ(p[i], zero)) {
>>>>               return i * sizeof(VECTYPE);
>>>>           }
>>>>       }
>>>>
>>>>       for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
>>>>            i < len / sizeof(VECTYPE);
>>>>            i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>>>>           VECTYPE tmp0 = p[i + 0] | p[i + 1];
>>>>           VECTYPE tmp1 = p[i + 2] | p[i + 3];
>>>>           VECTYPE tmp2 = p[i + 4] | p[i + 5];
>>>>           VECTYPE tmp3 = p[i + 6] | p[i + 7];
>>>>           VECTYPE tmp01 = tmp0 | tmp1;
>>>>           VECTYPE tmp23 = tmp2 | tmp3;
>>>>           if (!ALL_EQ(tmp01 | tmp23, zero)) {
>>>>               break;
>>>>           }
>>>>       }
>>>>
>>>>       return i * sizeof(VECTYPE);
>>>> }
>>>>
>>>> int main()
>>>> {
>>>>        //char *x = malloc(1024 << 20);
>>>>        char *x = qemu_anon_ram_alloc(1024 << 20);
>>>>
>>>>        int i, j;
>>>>        int ret = 0;
>>>>        struct rusage rusage;
>>>>        for (i = 0; i < 500; i ++) {
>>>>            for (j = 0; j < 10 << 20; j += 4096) {
>>>>                 ret += buffer_find_nonzero_offset((char*) (x + (i << 20)
>>>> + j), 4096);
>>>>            }
>>>>            getrusage( RUSAGE_SELF, &rusage );
>>>>            printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10),
>>>> rusage.ru_maxrss);
>>>>            getchar();
>>>>        }
>>>>        printf("%d zero pages\n", ret);
>>>> }
>>>>
>>>
>>>
>>
>>
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc]  broken incoming migration
  2013-06-09  2:34                             ` Alexey Kardashevskiy
@ 2013-06-09  2:52                               ` Benjamin Herrenschmidt
  2013-06-09  3:01                                 ` Alexey Kardashevskiy
  2013-06-09  3:01                               ` [Qemu-devel] " Wenchao Xia
  1 sibling, 1 reply; 49+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-09  2:52 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Peter Lieven, qemu-devel, qemu-ppc, Paolo Bonzini, Wenchao Xia,
	David Gibson

On Sun, 2013-06-09 at 12:34 +1000, Alexey Kardashevskiy wrote:

> It is _live_ migration, the source sends changes, same pages can change and
> be sent several times. So we would need to turn tracking on on the
> destination to know if some page was received from the source or changed by
> the destination itself (by writing there bios/firmware images, etc) and
> then clear pages which were touched by the destination and were not sent by
> the source.

Or we can set some kind of flag so that when creating a "migration
target" VM we don't load all these things into memory.

> Or we do not make guesses, the source sends everything and the destination
> simply checks if a page which is empty on the source is empty on the
> destination and avoid writing zeroes to it. Looks simpler to me and this is
> what the new patch does.

But you end up sending a lot of zero's ... is the migration compressed
(I am not familiar with it at all) ? If it is, that shouldn't be a big
deal, but else it feels to me that you should be able to send a special
packet instead that says "all zeros" because you'll potentially have an
awful lot of these.

Ben.

> > 
> >>
> >>> Also, you mean following code is from qemu and it does not allocate
> >>> memory with you gcc right? Maybe it is related to KVM, how about
> >>> turn off KVM and retry following code in qemu?
> >>>
> >>>> #include <stdio.h>
> >>>> #include <stdlib.h>
> >>>> #include <assert.h>
> >>>> #include <unistd.h>
> >>>> #include <sys/resource.h>
> >>>> #include <inttypes.h>
> >>>> #include <string.h>
> >>>> #include <sys/mman.h>
> >>>> #include <errno.h>
> >>>>
> >>>> #if defined __SSE2__
> >>>> #include <emmintrin.h>
> >>>> #define VECTYPE        __m128i
> >>>> #define SPLAT(p)       _mm_set1_epi8(*(p))
> >>>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
> >>>> 0xFFFF)
> >>>> #else
> >>>> #define VECTYPE        unsigned long
> >>>> #define SPLAT(p)       (*(p) * (~0UL / 255))
> >>>> #define ALL_EQ(v1, v2) ((v1) == (v2))
> >>>> #endif
> >>>>
> >>>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
> >>>>
> >>>> /* Round number down to multiple */
> >>>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
> >>>>
> >>>> /* Round number up to multiple */
> >>>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))
> >>>>
> >>>> #define QEMU_VMALLOC_ALIGN (256 * 4096)
> >>>>
> >>>> /* alloc shared memory pages */
> >>>> void *qemu_anon_ram_alloc(size_t size)
> >>>> {
> >>>>       size_t align = QEMU_VMALLOC_ALIGN;
> >>>>       size_t total = size + align - getpagesize();
> >>>>       void *ptr = mmap(0, total, PROT_READ | PROT_WRITE,
> >>>>                        MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> >>>>       size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) -
> >>>> (uintptr_t)ptr;
> >>>>
> >>>>       if (ptr == MAP_FAILED) {
> >>>>           fprintf(stderr, "Failed to allocate %zu B: %s\n",
> >>>>                   size, strerror(errno));
> >>>>           abort();
> >>>>       }
> >>>>
> >>>>       ptr += offset;
> >>>>       total -= offset;
> >>>>
> >>>>       if (offset > 0) {
> >>>>           munmap(ptr - offset, offset);
> >>>>       }
> >>>>       if (total > size) {
> >>>>           munmap(ptr + size, total - size);
> >>>>       }
> >>>>
> >>>>       return ptr;
> >>>> }
> >>>>
> >>>> static inline int
> >>>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
> >>>> {
> >>>>       return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> >>>>                      * sizeof(VECTYPE)) == 0
> >>>>               && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
> >>>> }
> >>>>
> >>>> size_t buffer_find_nonzero_offset(const void *buf, size_t len)
> >>>> {
> >>>>       const VECTYPE *p = buf;
> >>>>       const VECTYPE zero = (VECTYPE){0};
> >>>>       size_t i;
> >>>>
> >>>>       if (!len) {
> >>>>           return 0;
> >>>>       }
> >>>>
> >>>>       assert(can_use_buffer_find_nonzero_offset(buf, len));
> >>>>
> >>>>       for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
> >>>>           if (!ALL_EQ(p[i], zero)) {
> >>>>               return i * sizeof(VECTYPE);
> >>>>           }
> >>>>       }
> >>>>
> >>>>       for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
> >>>>            i < len / sizeof(VECTYPE);
> >>>>            i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
> >>>>           VECTYPE tmp0 = p[i + 0] | p[i + 1];
> >>>>           VECTYPE tmp1 = p[i + 2] | p[i + 3];
> >>>>           VECTYPE tmp2 = p[i + 4] | p[i + 5];
> >>>>           VECTYPE tmp3 = p[i + 6] | p[i + 7];
> >>>>           VECTYPE tmp01 = tmp0 | tmp1;
> >>>>           VECTYPE tmp23 = tmp2 | tmp3;
> >>>>           if (!ALL_EQ(tmp01 | tmp23, zero)) {
> >>>>               break;
> >>>>           }
> >>>>       }
> >>>>
> >>>>       return i * sizeof(VECTYPE);
> >>>> }
> >>>>
> >>>> int main()
> >>>> {
> >>>>        //char *x = malloc(1024 << 20);
> >>>>        char *x = qemu_anon_ram_alloc(1024 << 20);
> >>>>
> >>>>        int i, j;
> >>>>        int ret = 0;
> >>>>        struct rusage rusage;
> >>>>        for (i = 0; i < 500; i ++) {
> >>>>            for (j = 0; j < 10 << 20; j += 4096) {
> >>>>                 ret += buffer_find_nonzero_offset((char*) (x + (i << 20)
> >>>> + j), 4096);
> >>>>            }
> >>>>            getrusage( RUSAGE_SELF, &rusage );
> >>>>            printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10),
> >>>> rusage.ru_maxrss);
> >>>>            getchar();
> >>>>        }
> >>>>        printf("%d zero pages\n", ret);
> >>>> }
> >>>>
> >>>
> >>>
> >>
> >>
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc]  broken incoming migration
  2013-06-09  2:16                           ` Wenchao Xia
  2013-06-09  2:34                             ` Alexey Kardashevskiy
@ 2013-06-09  2:53                             ` Benjamin Herrenschmidt
  2013-06-12 14:00                               ` Paolo Bonzini
  1 sibling, 1 reply; 49+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-09  2:53 UTC (permalink / raw)
  To: Wenchao Xia
  Cc: Alexey Kardashevskiy, Peter Lieven, qemu-devel, qemu-ppc,
	Paolo Bonzini, David Gibson

On Sun, 2013-06-09 at 10:16 +0800, Wenchao Xia wrote:
>    If an page was not received and destination knows that page should
> exist according to total size, fill it with zero at destination, would
> it solve the problem?

The easiest way to do that is to not write to those pages at the
destination to begin with, when initializing the VM... Is there any way
to know that a VM is being setup as a migration target or not ?

Ben.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-09  2:34                             ` Alexey Kardashevskiy
  2013-06-09  2:52                               ` [Qemu-devel] [Qemu-ppc] " Benjamin Herrenschmidt
@ 2013-06-09  3:01                               ` Wenchao Xia
  2013-06-09  3:09                                 ` Alexey Kardashevskiy
  1 sibling, 1 reply; 49+ messages in thread
From: Wenchao Xia @ 2013-06-09  3:01 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-ppc, Paolo Bonzini, Peter Lieven, qemu-devel, David Gibson

于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>> You could also scan the page for nonzero values before writing it.
>>>>>>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>>>>>>> out to be a bad idea.
>>>>>>>>>>>
>>>>>>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>>>>>>> verify it fixes your problem.
>>>>>>>>>>>
>>>>>>>>>>> paolo: would we still need the madvise or is it enough to not write
>>>>>>>>>>> the zeroes?
>>>>>>>>>> It should be enough to not write them.
>>>>>>>>> Problem: checking the pages for zero allocates them. even at the
>>>>>>>>> source.
>>>>>>>> It doesn't look like.  I tried this program and top doesn't show an
>>>>>>>> increasing amount of reserved memory:
>>>>>>>>
>>>>>>>> #include <stdio.h>
>>>>>>>> #include <stdlib.h>
>>>>>>>> int main()
>>>>>>>> {
>>>>>>>>         char *x = malloc(500 << 20);
>>>>>>>>         int i, j;
>>>>>>>>         for (i = 0; i < 500; i += 10) {
>>>>>>>>             for (j = 0; j < 10 << 20; j += 4096) {
>>>>>>>>                  *(volatile char*) (x + (i << 20) + j);
>>>>>>>>             }
>>>>>>>>             getchar();
>>>>>>>>         }
>>>>>>>> }
>>>>>>> strange. we are talking about RSS size, right?
>>>>>> None of the three top values change, and only VIRT is >500 MB.
>>>>>>
>>>>>>> is the malloc above using mmapped memory?
>>>>>> Yes.
>>>>>>
>>>>>>> which kernel version do you use?
>>>>>> 3.9.
>>>>>>
>>>>>>> what avoids allocating the memory for me is the following (with
>>>>>>> whatever side effects it has ;-))
>>>>>> This would also fail to migrate any page that is swapped out, breaking
>>>>>> overcommit in a more subtle way. :)
>>>>>>
>>>>>> Paolo
>>>>> the following does also not allocate memory, but qemu does...
>>>>>
>>>> Hi, Peter
>>>>     As the patch writes
>>>>
>>>> "not sending zero pages breaks migration if a page is zero
>>>> at the source but not at the destination."
>>>>
>>>>     I don't understand why it would be trouble, shouldn't all page
>>>> not received in dest be treated as zero pages?
>>>
>>>
>>> How would the destination guest know if some page must be cleared? The
>>> previous patch (which Peter reverted) did not send anything for the pages
>>> which were zero on the source side.
>>>
>>>
>>    If an page was not received and destination knows that page should
>> exist according to total size, fill it with zero at destination, would
>> it solve the problem?
>
> It is _live_ migration, the source sends changes, same pages can change and
> be sent several times. So we would need to turn tracking on on the
> destination to know if some page was received from the source or changed by
> the destination itself (by writing there bios/firmware images, etc) and
> then clear pages which were touched by the destination and were not sent by
> the source.
   OK, I can understand the problem is, for example:
Destination boots up with 0x0000-0xFFFF filled with bios image.
Source forgot to send zero pages in 0x0000-0xFFFF.
After migration destination got 0x0000-0xFFFF dirty(different with
source)
   Thanks for explain.

   This seems refer to the migration protocol: how should the guest treat
unsent pages. The patch causing the problem, actually treat zero pages
as "not to sent" at source, but another half is missing: treat "not
received" as zero pages at destination. I guess if second half is added,
problem is gone:
after page transfer completed, before destination resume,
fill zero in "not received" pages.

>
> Or we do not make guesses, the source sends everything and the destination
> simply checks if a page which is empty on the source is empty on the
> destination and avoid writing zeroes to it. Looks simpler to me and this is
> what the new patch does.
>
>
>>
>>>
>>>> Also, you mean following code is from qemu and it does not allocate
>>>> memory with you gcc right? Maybe it is related to KVM, how about
>>>> turn off KVM and retry following code in qemu?
>>>>
>>>>> #include <stdio.h>
>>>>> #include <stdlib.h>
>>>>> #include <assert.h>
>>>>> #include <unistd.h>
>>>>> #include <sys/resource.h>
>>>>> #include <inttypes.h>
>>>>> #include <string.h>
>>>>> #include <sys/mman.h>
>>>>> #include <errno.h>
>>>>>
>>>>> #if defined __SSE2__
>>>>> #include <emmintrin.h>
>>>>> #define VECTYPE        __m128i
>>>>> #define SPLAT(p)       _mm_set1_epi8(*(p))
>>>>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>>>>> 0xFFFF)
>>>>> #else
>>>>> #define VECTYPE        unsigned long
>>>>> #define SPLAT(p)       (*(p) * (~0UL / 255))
>>>>> #define ALL_EQ(v1, v2) ((v1) == (v2))
>>>>> #endif
>>>>>
>>>>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>>>>>
>>>>> /* Round number down to multiple */
>>>>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
>>>>>
>>>>> /* Round number up to multiple */
>>>>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))
>>>>>
>>>>> #define QEMU_VMALLOC_ALIGN (256 * 4096)
>>>>>
>>>>> /* alloc shared memory pages */
>>>>> void *qemu_anon_ram_alloc(size_t size)
>>>>> {
>>>>>        size_t align = QEMU_VMALLOC_ALIGN;
>>>>>        size_t total = size + align - getpagesize();
>>>>>        void *ptr = mmap(0, total, PROT_READ | PROT_WRITE,
>>>>>                         MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>>>>        size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) -
>>>>> (uintptr_t)ptr;
>>>>>
>>>>>        if (ptr == MAP_FAILED) {
>>>>>            fprintf(stderr, "Failed to allocate %zu B: %s\n",
>>>>>                    size, strerror(errno));
>>>>>            abort();
>>>>>        }
>>>>>
>>>>>        ptr += offset;
>>>>>        total -= offset;
>>>>>
>>>>>        if (offset > 0) {
>>>>>            munmap(ptr - offset, offset);
>>>>>        }
>>>>>        if (total > size) {
>>>>>            munmap(ptr + size, total - size);
>>>>>        }
>>>>>
>>>>>        return ptr;
>>>>> }
>>>>>
>>>>> static inline int
>>>>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
>>>>> {
>>>>>        return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>>>>>                       * sizeof(VECTYPE)) == 0
>>>>>                && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
>>>>> }
>>>>>
>>>>> size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>>>>> {
>>>>>        const VECTYPE *p = buf;
>>>>>        const VECTYPE zero = (VECTYPE){0};
>>>>>        size_t i;
>>>>>
>>>>>        if (!len) {
>>>>>            return 0;
>>>>>        }
>>>>>
>>>>>        assert(can_use_buffer_find_nonzero_offset(buf, len));
>>>>>
>>>>>        for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
>>>>>            if (!ALL_EQ(p[i], zero)) {
>>>>>                return i * sizeof(VECTYPE);
>>>>>            }
>>>>>        }
>>>>>
>>>>>        for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
>>>>>             i < len / sizeof(VECTYPE);
>>>>>             i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>>>>>            VECTYPE tmp0 = p[i + 0] | p[i + 1];
>>>>>            VECTYPE tmp1 = p[i + 2] | p[i + 3];
>>>>>            VECTYPE tmp2 = p[i + 4] | p[i + 5];
>>>>>            VECTYPE tmp3 = p[i + 6] | p[i + 7];
>>>>>            VECTYPE tmp01 = tmp0 | tmp1;
>>>>>            VECTYPE tmp23 = tmp2 | tmp3;
>>>>>            if (!ALL_EQ(tmp01 | tmp23, zero)) {
>>>>>                break;
>>>>>            }
>>>>>        }
>>>>>
>>>>>        return i * sizeof(VECTYPE);
>>>>> }
>>>>>
>>>>> int main()
>>>>> {
>>>>>         //char *x = malloc(1024 << 20);
>>>>>         char *x = qemu_anon_ram_alloc(1024 << 20);
>>>>>
>>>>>         int i, j;
>>>>>         int ret = 0;
>>>>>         struct rusage rusage;
>>>>>         for (i = 0; i < 500; i ++) {
>>>>>             for (j = 0; j < 10 << 20; j += 4096) {
>>>>>                  ret += buffer_find_nonzero_offset((char*) (x + (i << 20)
>>>>> + j), 4096);
>>>>>             }
>>>>>             getrusage( RUSAGE_SELF, &rusage );
>>>>>             printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10),
>>>>> rusage.ru_maxrss);
>>>>>             getchar();
>>>>>         }
>>>>>         printf("%d zero pages\n", ret);
>>>>> }
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>


-- 
Best Regards

Wenchao Xia

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc]  broken incoming migration
  2013-06-09  2:52                               ` [Qemu-devel] [Qemu-ppc] " Benjamin Herrenschmidt
@ 2013-06-09  3:01                                 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-09  3:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Peter Lieven, qemu-devel, qemu-ppc, Paolo Bonzini, Wenchao Xia,
	David Gibson

On 06/09/2013 12:52 PM, Benjamin Herrenschmidt wrote:
> On Sun, 2013-06-09 at 12:34 +1000, Alexey Kardashevskiy wrote:
> 
>> It is _live_ migration, the source sends changes, same pages can change and
>> be sent several times. So we would need to turn tracking on on the
>> destination to know if some page was received from the source or changed by
>> the destination itself (by writing there bios/firmware images, etc) and
>> then clear pages which were touched by the destination and were not sent by
>> the source.
> 
> Or we can set some kind of flag so that when creating a "migration
> target" VM we don't load all these things into memory.

How would we do that? The platform initialization code does not have a clue
whether it is going to receive a migrated host or not.


>> Or we do not make guesses, the source sends everything and the destination
>> simply checks if a page which is empty on the source is empty on the
>> destination and avoid writing zeroes to it. Looks simpler to me and this is
>> what the new patch does.
> 
> But you end up sending a lot of zero's ... is the migration compressed
> (I am not familiar with it at all) ? If it is, that shouldn't be a big
> deal, but else it feels to me that you should be able to send a special
> packet instead that says "all zeros" because you'll potentially have an
> awful lot of these.

It is compressed exactly as you described..


> 
> Ben.
> 
>>>
>>>>
>>>>> Also, you mean following code is from qemu and it does not allocate
>>>>> memory with you gcc right? Maybe it is related to KVM, how about
>>>>> turn off KVM and retry following code in qemu?
>>>>>
>>>>>> #include <stdio.h>
>>>>>> #include <stdlib.h>
>>>>>> #include <assert.h>
>>>>>> #include <unistd.h>
>>>>>> #include <sys/resource.h>
>>>>>> #include <inttypes.h>
>>>>>> #include <string.h>
>>>>>> #include <sys/mman.h>
>>>>>> #include <errno.h>
>>>>>>
>>>>>> #if defined __SSE2__
>>>>>> #include <emmintrin.h>
>>>>>> #define VECTYPE        __m128i
>>>>>> #define SPLAT(p)       _mm_set1_epi8(*(p))
>>>>>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>>>>>> 0xFFFF)
>>>>>> #else
>>>>>> #define VECTYPE        unsigned long
>>>>>> #define SPLAT(p)       (*(p) * (~0UL / 255))
>>>>>> #define ALL_EQ(v1, v2) ((v1) == (v2))
>>>>>> #endif
>>>>>>
>>>>>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>>>>>>
>>>>>> /* Round number down to multiple */
>>>>>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
>>>>>>
>>>>>> /* Round number up to multiple */
>>>>>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))
>>>>>>
>>>>>> #define QEMU_VMALLOC_ALIGN (256 * 4096)
>>>>>>
>>>>>> /* alloc shared memory pages */
>>>>>> void *qemu_anon_ram_alloc(size_t size)
>>>>>> {
>>>>>>       size_t align = QEMU_VMALLOC_ALIGN;
>>>>>>       size_t total = size + align - getpagesize();
>>>>>>       void *ptr = mmap(0, total, PROT_READ | PROT_WRITE,
>>>>>>                        MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>>>>>       size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) -
>>>>>> (uintptr_t)ptr;
>>>>>>
>>>>>>       if (ptr == MAP_FAILED) {
>>>>>>           fprintf(stderr, "Failed to allocate %zu B: %s\n",
>>>>>>                   size, strerror(errno));
>>>>>>           abort();
>>>>>>       }
>>>>>>
>>>>>>       ptr += offset;
>>>>>>       total -= offset;
>>>>>>
>>>>>>       if (offset > 0) {
>>>>>>           munmap(ptr - offset, offset);
>>>>>>       }
>>>>>>       if (total > size) {
>>>>>>           munmap(ptr + size, total - size);
>>>>>>       }
>>>>>>
>>>>>>       return ptr;
>>>>>> }
>>>>>>
>>>>>> static inline int
>>>>>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
>>>>>> {
>>>>>>       return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>>>>>>                      * sizeof(VECTYPE)) == 0
>>>>>>               && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
>>>>>> }
>>>>>>
>>>>>> size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>>>>>> {
>>>>>>       const VECTYPE *p = buf;
>>>>>>       const VECTYPE zero = (VECTYPE){0};
>>>>>>       size_t i;
>>>>>>
>>>>>>       if (!len) {
>>>>>>           return 0;
>>>>>>       }
>>>>>>
>>>>>>       assert(can_use_buffer_find_nonzero_offset(buf, len));
>>>>>>
>>>>>>       for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
>>>>>>           if (!ALL_EQ(p[i], zero)) {
>>>>>>               return i * sizeof(VECTYPE);
>>>>>>           }
>>>>>>       }
>>>>>>
>>>>>>       for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
>>>>>>            i < len / sizeof(VECTYPE);
>>>>>>            i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>>>>>>           VECTYPE tmp0 = p[i + 0] | p[i + 1];
>>>>>>           VECTYPE tmp1 = p[i + 2] | p[i + 3];
>>>>>>           VECTYPE tmp2 = p[i + 4] | p[i + 5];
>>>>>>           VECTYPE tmp3 = p[i + 6] | p[i + 7];
>>>>>>           VECTYPE tmp01 = tmp0 | tmp1;
>>>>>>           VECTYPE tmp23 = tmp2 | tmp3;
>>>>>>           if (!ALL_EQ(tmp01 | tmp23, zero)) {
>>>>>>               break;
>>>>>>           }
>>>>>>       }
>>>>>>
>>>>>>       return i * sizeof(VECTYPE);
>>>>>> }
>>>>>>
>>>>>> int main()
>>>>>> {
>>>>>>        //char *x = malloc(1024 << 20);
>>>>>>        char *x = qemu_anon_ram_alloc(1024 << 20);
>>>>>>
>>>>>>        int i, j;
>>>>>>        int ret = 0;
>>>>>>        struct rusage rusage;
>>>>>>        for (i = 0; i < 500; i ++) {
>>>>>>            for (j = 0; j < 10 << 20; j += 4096) {
>>>>>>                 ret += buffer_find_nonzero_offset((char*) (x + (i << 20)
>>>>>> + j), 4096);
>>>>>>            }
>>>>>>            getrusage( RUSAGE_SELF, &rusage );
>>>>>>            printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10),
>>>>>> rusage.ru_maxrss);
>>>>>>            getchar();
>>>>>>        }
>>>>>>        printf("%d zero pages\n", ret);
>>>>>> }
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-09  3:01                               ` [Qemu-devel] " Wenchao Xia
@ 2013-06-09  3:09                                 ` Alexey Kardashevskiy
  2013-06-09  3:31                                   ` Wenchao Xia
  2013-06-09  7:27                                   ` Peter Lieven
  0 siblings, 2 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-09  3:09 UTC (permalink / raw)
  To: Wenchao Xia
  Cc: qemu-ppc, Paolo Bonzini, Peter Lieven, qemu-devel, David Gibson

On 06/09/2013 01:01 PM, Wenchao Xia wrote:
> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
>> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>> You could also scan the page for nonzero values before
>>>>>>>>>>>>>> writing it.
>>>>>>>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>>>>>>>> out to be a bad idea.
>>>>>>>>>>>>
>>>>>>>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>>>>>>>> verify it fixes your problem.
>>>>>>>>>>>>
>>>>>>>>>>>> paolo: would we still need the madvise or is it enough to not
>>>>>>>>>>>> write
>>>>>>>>>>>> the zeroes?
>>>>>>>>>>> It should be enough to not write them.
>>>>>>>>>> Problem: checking the pages for zero allocates them. even at the
>>>>>>>>>> source.
>>>>>>>>> It doesn't look like.  I tried this program and top doesn't show an
>>>>>>>>> increasing amount of reserved memory:
>>>>>>>>>
>>>>>>>>> #include <stdio.h>
>>>>>>>>> #include <stdlib.h>
>>>>>>>>> int main()
>>>>>>>>> {
>>>>>>>>>         char *x = malloc(500 << 20);
>>>>>>>>>         int i, j;
>>>>>>>>>         for (i = 0; i < 500; i += 10) {
>>>>>>>>>             for (j = 0; j < 10 << 20; j += 4096) {
>>>>>>>>>                  *(volatile char*) (x + (i << 20) + j);
>>>>>>>>>             }
>>>>>>>>>             getchar();
>>>>>>>>>         }
>>>>>>>>> }
>>>>>>>> strange. we are talking about RSS size, right?
>>>>>>> None of the three top values change, and only VIRT is >500 MB.
>>>>>>>
>>>>>>>> is the malloc above using mmapped memory?
>>>>>>> Yes.
>>>>>>>
>>>>>>>> which kernel version do you use?
>>>>>>> 3.9.
>>>>>>>
>>>>>>>> what avoids allocating the memory for me is the following (with
>>>>>>>> whatever side effects it has ;-))
>>>>>>> This would also fail to migrate any page that is swapped out, breaking
>>>>>>> overcommit in a more subtle way. :)
>>>>>>>
>>>>>>> Paolo
>>>>>> the following does also not allocate memory, but qemu does...
>>>>>>
>>>>> Hi, Peter
>>>>>     As the patch writes
>>>>>
>>>>> "not sending zero pages breaks migration if a page is zero
>>>>> at the source but not at the destination."
>>>>>
>>>>>     I don't understand why it would be trouble, shouldn't all page
>>>>> not received in dest be treated as zero pages?
>>>>
>>>>
>>>> How would the destination guest know if some page must be cleared? The
>>>> previous patch (which Peter reverted) did not send anything for the pages
>>>> which were zero on the source side.
>>>>
>>>>
>>>    If an page was not received and destination knows that page should
>>> exist according to total size, fill it with zero at destination, would
>>> it solve the problem?
>>
>> It is _live_ migration, the source sends changes, same pages can change and
>> be sent several times. So we would need to turn tracking on on the
>> destination to know if some page was received from the source or changed by
>> the destination itself (by writing there bios/firmware images, etc) and
>> then clear pages which were touched by the destination and were not sent by
>> the source.
>   OK, I can understand the problem is, for example:
> Destination boots up with 0x0000-0xFFFF filled with bios image.
> Source forgot to send zero pages in 0x0000-0xFFFF.


The source did not forget, instead it zeroed these pages during its life
and thought that they must be zeroed at the destination already (as the
destination did not start and did not have a chance to write something there).


> After migration destination got 0x0000-0xFFFF dirty(different with
> source)

Yep. And those pages were empty on the source what made debugging very easy :)


>   Thanks for explain.
> 
>   This seems refer to the migration protocol: how should the guest treat
> unsent pages. The patch causing the problem, actually treat zero pages
> as "not to sent" at source, but another half is missing: treat "not
> received" as zero pages at destination. I guess if second half is added,
> problem is gone:
> after page transfer completed, before destination resume,
> fill zero in "not received" pages.



Make a working patch, we'll discuss it :) I do not see much acceleration
coming from there.


-- 
Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-09  3:09                                 ` Alexey Kardashevskiy
@ 2013-06-09  3:31                                   ` Wenchao Xia
  2013-06-09  7:27                                   ` Peter Lieven
  1 sibling, 0 replies; 49+ messages in thread
From: Wenchao Xia @ 2013-06-09  3:31 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-ppc, Paolo Bonzini, Peter Lieven, qemu-devel, David Gibson

于 2013-6-9 11:09, Alexey Kardashevskiy 写道:
> On 06/09/2013 01:01 PM, Wenchao Xia wrote:
>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>>> You could also scan the page for nonzero values before
>>>>>>>>>>>>>>> writing it.
>>>>>>>>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>>>>>>>>> out to be a bad idea.
>>>>>>>>>>>>>
>>>>>>>>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>>>>>>>>> verify it fixes your problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>> paolo: would we still need the madvise or is it enough to not
>>>>>>>>>>>>> write
>>>>>>>>>>>>> the zeroes?
>>>>>>>>>>>> It should be enough to not write them.
>>>>>>>>>>> Problem: checking the pages for zero allocates them. even at the
>>>>>>>>>>> source.
>>>>>>>>>> It doesn't look like.  I tried this program and top doesn't show an
>>>>>>>>>> increasing amount of reserved memory:
>>>>>>>>>>
>>>>>>>>>> #include <stdio.h>
>>>>>>>>>> #include <stdlib.h>
>>>>>>>>>> int main()
>>>>>>>>>> {
>>>>>>>>>>          char *x = malloc(500 << 20);
>>>>>>>>>>          int i, j;
>>>>>>>>>>          for (i = 0; i < 500; i += 10) {
>>>>>>>>>>              for (j = 0; j < 10 << 20; j += 4096) {
>>>>>>>>>>                   *(volatile char*) (x + (i << 20) + j);
>>>>>>>>>>              }
>>>>>>>>>>              getchar();
>>>>>>>>>>          }
>>>>>>>>>> }
>>>>>>>>> strange. we are talking about RSS size, right?
>>>>>>>> None of the three top values change, and only VIRT is >500 MB.
>>>>>>>>
>>>>>>>>> is the malloc above using mmapped memory?
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>> which kernel version do you use?
>>>>>>>> 3.9.
>>>>>>>>
>>>>>>>>> what avoids allocating the memory for me is the following (with
>>>>>>>>> whatever side effects it has ;-))
>>>>>>>> This would also fail to migrate any page that is swapped out, breaking
>>>>>>>> overcommit in a more subtle way. :)
>>>>>>>>
>>>>>>>> Paolo
>>>>>>> the following does also not allocate memory, but qemu does...
>>>>>>>
>>>>>> Hi, Peter
>>>>>>      As the patch writes
>>>>>>
>>>>>> "not sending zero pages breaks migration if a page is zero
>>>>>> at the source but not at the destination."
>>>>>>
>>>>>>      I don't understand why it would be trouble, shouldn't all page
>>>>>> not received in dest be treated as zero pages?
>>>>>
>>>>>
>>>>> How would the destination guest know if some page must be cleared? The
>>>>> previous patch (which Peter reverted) did not send anything for the pages
>>>>> which were zero on the source side.
>>>>>
>>>>>
>>>>     If an page was not received and destination knows that page should
>>>> exist according to total size, fill it with zero at destination, would
>>>> it solve the problem?
>>>
>>> It is _live_ migration, the source sends changes, same pages can change and
>>> be sent several times. So we would need to turn tracking on on the
>>> destination to know if some page was received from the source or changed by
>>> the destination itself (by writing there bios/firmware images, etc) and
>>> then clear pages which were touched by the destination and were not sent by
>>> the source.
>>    OK, I can understand the problem is, for example:
>> Destination boots up with 0x0000-0xFFFF filled with bios image.
>> Source forgot to send zero pages in 0x0000-0xFFFF.
>
>
> The source did not forget, instead it zeroed these pages during its life
> and thought that they must be zeroed at the destination already (as the
> destination did not start and did not have a chance to write something there).
>
>
>> After migration destination got 0x0000-0xFFFF dirty(different with
>> source)
>
> Yep. And those pages were empty on the source what made debugging very easy :)
>
>
>>    Thanks for explain.
>>
>>    This seems refer to the migration protocol: how should the guest treat
>> unsent pages. The patch causing the problem, actually treat zero pages
>> as "not to sent" at source, but another half is missing: treat "not
>> received" as zero pages at destination. I guess if second half is added,
>> problem is gone:
>> after page transfer completed, before destination resume,
>> fill zero in "not received" pages.
>
>
>
> Make a working patch, we'll discuss it :) I do not see much acceleration
> coming from there.
>
>
   4k zero page is compressed into header: 8 bytes flag + 1 byte tail +
( 1 + strlen(idstr) when ramblock is a new one), so take 10 bytes
as average, ram:network flow is 4000:10 = 400:1
   Then for a typical 4GB guest, sending the zero pages will take about
10M network flow, indeed not much acceleration. I think current method
is already good enough, unless there are other benefits in not sending
zero pages.

-- 
Best Regards

Wenchao Xia

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-05  6:09                             ` Peter Lieven
@ 2013-06-09  4:12                               ` liu ping fan
  2013-06-09  7:22                                 ` Peter Lieven
  0 siblings, 1 reply; 49+ messages in thread
From: liu ping fan @ 2013-06-09  4:12 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Alexey Kardashevskiy, Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson

Hi Peter,

Is it that sending zero page mostly service the first iteration, ie
bluk-stage? And for the subsequent iteration, dirty pages are normally
not zero.

Thanks

On Wed, Jun 5, 2013 at 2:09 PM, Peter Lieven <pl@kamp.de> wrote:
>
> Am 05.06.2013 um 05:37 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>
>> On 06/05/2013 05:15 AM, Peter Lieven wrote:
>>>
>>> Am 04.06.2013 um 17:17 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>>
>>>> Il 04/06/2013 16:48, Peter Lieven ha scritto:
>>>>> Still using 3.2, but strange enough the above example is also not
>>>>> increasing RSS size for me.
>>>>>
>>>>> Can you try the following:
>>>>> qemu git master with 1G of memory (hanging in bios with no boot device)
>>>>> and migrate it. Before migration RSS Size os somewhat
>>>>> around 16MB. After migration its RSS size is in the order of 1G.
>>>>
>>>> That may be a kernel bug.  The kernel did not do the copy-on-write trick
>>>> on huge zero pages.  It was fixed last year, maybe 3.2 is not enough.
>>>> Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces.
>>>
>>> that's it. thanks for the pointer. the huge zero page was introduced in 3.8.
>>>
>>> paolo, alexey: can you please verify the following works for you:
>>> https://github.com/plieven/qemu/tree/fix-migration
>>
>> These two?
>> 848b796 Tue Jun 4 14:43:04 2013 +0200 Peter Lieven migration: do not
>> overwrite zero pages
>> 2206ac8 Tue Jun 4 14:25:33 2013 +0200 Peter Lieven Revert "migration: do
>> not sent zero pages in bulk stage"
>
> Yes, sorry forgot to mention this.
>
>>
>> That works for me (qemu 1.5, kernel 3.9-rc2).
>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
> Thank you,
> Peter

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-09  4:12                               ` liu ping fan
@ 2013-06-09  7:22                                 ` Peter Lieven
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Lieven @ 2013-06-09  7:22 UTC (permalink / raw)
  To: liu ping fan
  Cc: Alexey Kardashevskiy, Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson


Am 09.06.2013 um 06:12 schrieb liu ping fan <qemulist@gmail.com>:

> Hi Peter,
> 
> Is it that sending zero page mostly service the first iteration, ie
> bluk-stage? And for the subsequent iteration, dirty pages are normally
> not zero.
> 

Yes most Zero Pages are sent during bulk stage except for busy windows guests or linux with page sanitization.
In these cases freed Memory is zeroed.

Peter

> Thanks
> 
> On Wed, Jun 5, 2013 at 2:09 PM, Peter Lieven <pl@kamp.de> wrote:
>> 
>> Am 05.06.2013 um 05:37 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>> 
>>> On 06/05/2013 05:15 AM, Peter Lieven wrote:
>>>> 
>>>> Am 04.06.2013 um 17:17 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>>> 
>>>>> Il 04/06/2013 16:48, Peter Lieven ha scritto:
>>>>>> Still using 3.2, but strange enough the above example is also not
>>>>>> increasing RSS size for me.
>>>>>> 
>>>>>> Can you try the following:
>>>>>> qemu git master with 1G of memory (hanging in bios with no boot device)
>>>>>> and migrate it. Before migration RSS Size os somewhat
>>>>>> around 16MB. After migration its RSS size is in the order of 1G.
>>>>> 
>>>>> That may be a kernel bug.  The kernel did not do the copy-on-write trick
>>>>> on huge zero pages.  It was fixed last year, maybe 3.2 is not enough.
>>>>> Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces.
>>>> 
>>>> that's it. thanks for the pointer. the huge zero page was introduced in 3.8.
>>>> 
>>>> paolo, alexey: can you please verify the following works for you:
>>>> https://github.com/plieven/qemu/tree/fix-migration
>>> 
>>> These two?
>>> 848b796 Tue Jun 4 14:43:04 2013 +0200 Peter Lieven migration: do not
>>> overwrite zero pages
>>> 2206ac8 Tue Jun 4 14:25:33 2013 +0200 Peter Lieven Revert "migration: do
>>> not sent zero pages in bulk stage"
>> 
>> Yes, sorry forgot to mention this.
>> 
>>> 
>>> That works for me (qemu 1.5, kernel 3.9-rc2).
>>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> 
>> Thank you,
>> Peter

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-09  3:09                                 ` Alexey Kardashevskiy
  2013-06-09  3:31                                   ` Wenchao Xia
@ 2013-06-09  7:27                                   ` Peter Lieven
  2013-06-10  6:39                                     ` Alexey Kardashevskiy
  1 sibling, 1 reply; 49+ messages in thread
From: Peter Lieven @ 2013-06-09  7:27 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paolo Bonzini, David Gibson, qemu-ppc, Wenchao Xia, qemu-devel


Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:

> On 06/09/2013 01:01 PM, Wenchao Xia wrote:
>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>>> You could also scan the page for nonzero values before
>>>>>>>>>>>>>>> writing it.
>>>>>>>>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>>>>>>>>> out to be a bad idea.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>>>>>>>>> verify it fixes your problem.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> paolo: would we still need the madvise or is it enough to not
>>>>>>>>>>>>> write
>>>>>>>>>>>>> the zeroes?
>>>>>>>>>>>> It should be enough to not write them.
>>>>>>>>>>> Problem: checking the pages for zero allocates them. even at the
>>>>>>>>>>> source.
>>>>>>>>>> It doesn't look like.  I tried this program and top doesn't show an
>>>>>>>>>> increasing amount of reserved memory:
>>>>>>>>>> 
>>>>>>>>>> #include <stdio.h>
>>>>>>>>>> #include <stdlib.h>
>>>>>>>>>> int main()
>>>>>>>>>> {
>>>>>>>>>>        char *x = malloc(500 << 20);
>>>>>>>>>>        int i, j;
>>>>>>>>>>        for (i = 0; i < 500; i += 10) {
>>>>>>>>>>            for (j = 0; j < 10 << 20; j += 4096) {
>>>>>>>>>>                 *(volatile char*) (x + (i << 20) + j);
>>>>>>>>>>            }
>>>>>>>>>>            getchar();
>>>>>>>>>>        }
>>>>>>>>>> }
>>>>>>>>> strange. we are talking about RSS size, right?
>>>>>>>> None of the three top values change, and only VIRT is >500 MB.
>>>>>>>> 
>>>>>>>>> is the malloc above using mmapped memory?
>>>>>>>> Yes.
>>>>>>>> 
>>>>>>>>> which kernel version do you use?
>>>>>>>> 3.9.
>>>>>>>> 
>>>>>>>>> what avoids allocating the memory for me is the following (with
>>>>>>>>> whatever side effects it has ;-))
>>>>>>>> This would also fail to migrate any page that is swapped out, breaking
>>>>>>>> overcommit in a more subtle way. :)
>>>>>>>> 
>>>>>>>> Paolo
>>>>>>> the following does also not allocate memory, but qemu does...
>>>>>> Hi, Peter
>>>>>>    As the patch writes
>>>>>> 
>>>>>> "not sending zero pages breaks migration if a page is zero
>>>>>> at the source but not at the destination."
>>>>>> 
>>>>>>    I don't understand why it would be trouble, shouldn't all page
>>>>>> not received in dest be treated as zero pages?
>>>>> 
>>>>> 
>>>>> How would the destination guest know if some page must be cleared? The
>>>>> previous patch (which Peter reverted) did not send anything for the pages
>>>>> which were zero on the source side.
>>>>   If an page was not received and destination knows that page should
>>>> exist according to total size, fill it with zero at destination, would
>>>> it solve the problem?
>>> 
>>> It is _live_ migration, the source sends changes, same pages can change and
>>> be sent several times. So we would need to turn tracking on on the
>>> destination to know if some page was received from the source or changed by
>>> the destination itself (by writing there bios/firmware images, etc) and
>>> then clear pages which were touched by the destination and were not sent by
>>> the source.
>>  OK, I can understand the problem is, for example:
>> Destination boots up with 0x0000-0xFFFF filled with bios image.
>> Source forgot to send zero pages in 0x0000-0xFFFF.
> 
> 
> The source did not forget, instead it zeroed these pages during its life
> and thought that they must be zeroed at the destination already (as the
> destination did not start and did not have a chance to write something there).
> 
> 
>> After migration destination got 0x0000-0xFFFF dirty(different with
>> source)
> 
> Yep. And those pages were empty on the source what made debugging very easy :)
> 
> 
>>  Thanks for explain.
>> 
>>  This seems refer to the migration protocol: how should the guest treat
>> unsent pages. The patch causing the problem, actually treat zero pages
>> as "not to sent" at source, but another half is missing: treat "not
>> received" as zero pages at destination. I guess if second half is added,
>> problem is gone:
>> after page transfer completed, before destination resume,
>> fill zero in "not received" pages.
> 
> 
> 
> Make a working patch, we'll discuss it :) I do not see much acceleration
> coming from there.

I would also not spent much time with this. I would either look to find an easy way to fix the initialization code to not unneccessarily load data into RAM or i will sent a v2 of my patch following Eric's concerns.

Peter

> 
> 
> -- 
> Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-09  7:27                                   ` Peter Lieven
@ 2013-06-10  6:39                                     ` Alexey Kardashevskiy
  2013-06-10  6:50                                       ` Peter Lieven
  0 siblings, 1 reply; 49+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-10  6:39 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Paolo Bonzini, David Gibson, qemu-ppc, Wenchao Xia, qemu-devel

On 06/09/2013 05:27 PM, Peter Lieven wrote:
> 
> Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
> 
>> On 06/09/2013 01:01 PM, Wenchao Xia wrote:
>>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
>>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>>>> You could also scan the page for nonzero
>>>>>>>>>>>>>>>> values before writing it.
>>>>>>>>>>>>>> i had this in mind, but then choosed the other
>>>>>>>>>>>>>> approach.... turned out to be a bad idea.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> alexey: i will prepare a patch later today,
>>>>>>>>>>>>>> could you then please verify it fixes your
>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> paolo: would we still need the madvise or is
>>>>>>>>>>>>>> it enough to not write the zeroes?
>>>>>>>>>>>>> It should be enough to not write them.
>>>>>>>>>>>> Problem: checking the pages for zero allocates
>>>>>>>>>>>> them. even at the source.
>>>>>>>>>>> It doesn't look like.  I tried this program and top
>>>>>>>>>>> doesn't show an increasing amount of reserved
>>>>>>>>>>> memory:
>>>>>>>>>>> 
>>>>>>>>>>> #include <stdio.h> #include <stdlib.h> int main() { 
>>>>>>>>>>> char *x = malloc(500 << 20); int i, j; for (i = 0; i
>>>>>>>>>>> < 500; i += 10) { for (j = 0; j < 10 << 20; j +=
>>>>>>>>>>> 4096) { *(volatile char*) (x + (i << 20) + j); } 
>>>>>>>>>>> getchar(); } }
>>>>>>>>>> strange. we are talking about RSS size, right?
>>>>>>>>> None of the three top values change, and only VIRT is
>>>>>>>>> >500 MB.
>>>>>>>>> 
>>>>>>>>>> is the malloc above using mmapped memory?
>>>>>>>>> Yes.
>>>>>>>>> 
>>>>>>>>>> which kernel version do you use?
>>>>>>>>> 3.9.
>>>>>>>>> 
>>>>>>>>>> what avoids allocating the memory for me is the
>>>>>>>>>> following (with whatever side effects it has ;-))
>>>>>>>>> This would also fail to migrate any page that is swapped
>>>>>>>>> out, breaking overcommit in a more subtle way. :)
>>>>>>>>> 
>>>>>>>>> Paolo
>>>>>>>> the following does also not allocate memory, but qemu
>>>>>>>> does...
>>>>>>> Hi, Peter As the patch writes
>>>>>>> 
>>>>>>> "not sending zero pages breaks migration if a page is zero 
>>>>>>> at the source but not at the destination."
>>>>>>> 
>>>>>>> I don't understand why it would be trouble, shouldn't all
>>>>>>> page not received in dest be treated as zero pages?
>>>>>> 
>>>>>> 
>>>>>> How would the destination guest know if some page must be
>>>>>> cleared? The previous patch (which Peter reverted) did not
>>>>>> send anything for the pages which were zero on the source
>>>>>> side.
>>>>> If an page was not received and destination knows that page
>>>>> should exist according to total size, fill it with zero at
>>>>> destination, would it solve the problem?
>>>> 
>>>> It is _live_ migration, the source sends changes, same pages can
>>>> change and be sent several times. So we would need to turn
>>>> tracking on on the destination to know if some page was received
>>>> from the source or changed by the destination itself (by writing
>>>> there bios/firmware images, etc) and then clear pages which were
>>>> touched by the destination and were not sent by the source.
>>> OK, I can understand the problem is, for example: Destination boots
>>> up with 0x0000-0xFFFF filled with bios image. Source forgot to send
>>> zero pages in 0x0000-0xFFFF.
>> 
>> 
>> The source did not forget, instead it zeroed these pages during its
>> life and thought that they must be zeroed at the destination already
>> (as the destination did not start and did not have a chance to write
>> something there).
>> 
>> 
>>> After migration destination got 0x0000-0xFFFF dirty(different with 
>>> source)
>> 
>> Yep. And those pages were empty on the source what made debugging very
>> easy :)
>> 
>> 
>>> Thanks for explain.
>>> 
>>> This seems refer to the migration protocol: how should the guest
>>> treat unsent pages. The patch causing the problem, actually treat
>>> zero pages as "not to sent" at source, but another half is missing:
>>> treat "not received" as zero pages at destination. I guess if second
>>> half is added, problem is gone: after page transfer completed,
>>> before destination resume, fill zero in "not received" pages.
>> 
>> 
>> 
>> Make a working patch, we'll discuss it :) I do not see much
>> acceleration coming from there.
> 

> I would also not spent much time with this. I would either look to find
> an easy way to fix the initialization code to not unneccessarily load
> data into RAM or i will sent a v2 of my patch following Eric's
> concerns.

There is no easy way to implement the flag and keep your original patch as
we have to implement this flag in all architectures which got broken by
your patch and I personally can fix only PPC64-pseries but not the others.

Furthermore your revert + new patches perfectly solve the problem, why
would we want to bother now with this new flag which nobody really needs
right now?

Please, please, revert the original patch or I'll try to do it :)


-- 
Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-10  6:39                                     ` Alexey Kardashevskiy
@ 2013-06-10  6:50                                       ` Peter Lieven
  2013-06-10  6:55                                         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Lieven @ 2013-06-10  6:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paolo Bonzini, David Gibson, qemu-ppc, Wenchao Xia, qemu-devel

On 10.06.2013 08:39, Alexey Kardashevskiy wrote:
> On 06/09/2013 05:27 PM, Peter Lieven wrote:
>> Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>>
>>> On 06/09/2013 01:01 PM, Wenchao Xia wrote:
>>>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
>>>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>>>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>>>>> You could also scan the page for nonzero
>>>>>>>>>>>>>>>>> values before writing it.
>>>>>>>>>>>>>>> i had this in mind, but then choosed the other
>>>>>>>>>>>>>>> approach.... turned out to be a bad idea.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> alexey: i will prepare a patch later today,
>>>>>>>>>>>>>>> could you then please verify it fixes your
>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> paolo: would we still need the madvise or is
>>>>>>>>>>>>>>> it enough to not write the zeroes?
>>>>>>>>>>>>>> It should be enough to not write them.
>>>>>>>>>>>>> Problem: checking the pages for zero allocates
>>>>>>>>>>>>> them. even at the source.
>>>>>>>>>>>> It doesn't look like.  I tried this program and top
>>>>>>>>>>>> doesn't show an increasing amount of reserved
>>>>>>>>>>>> memory:
>>>>>>>>>>>>
>>>>>>>>>>>> #include <stdio.h> #include <stdlib.h> int main() {
>>>>>>>>>>>> char *x = malloc(500 << 20); int i, j; for (i = 0; i
>>>>>>>>>>>> < 500; i += 10) { for (j = 0; j < 10 << 20; j +=
>>>>>>>>>>>> 4096) { *(volatile char*) (x + (i << 20) + j); }
>>>>>>>>>>>> getchar(); } }
>>>>>>>>>>> strange. we are talking about RSS size, right?
>>>>>>>>>> None of the three top values change, and only VIRT is
>>>>>>>>>>> 500 MB.
>>>>>>>>>>> is the malloc above using mmapped memory?
>>>>>>>>>> Yes.
>>>>>>>>>>
>>>>>>>>>>> which kernel version do you use?
>>>>>>>>>> 3.9.
>>>>>>>>>>
>>>>>>>>>>> what avoids allocating the memory for me is the
>>>>>>>>>>> following (with whatever side effects it has ;-))
>>>>>>>>>> This would also fail to migrate any page that is swapped
>>>>>>>>>> out, breaking overcommit in a more subtle way. :)
>>>>>>>>>>
>>>>>>>>>> Paolo
>>>>>>>>> the following does also not allocate memory, but qemu
>>>>>>>>> does...
>>>>>>>> Hi, Peter As the patch writes
>>>>>>>>
>>>>>>>> "not sending zero pages breaks migration if a page is zero
>>>>>>>> at the source but not at the destination."
>>>>>>>>
>>>>>>>> I don't understand why it would be trouble, shouldn't all
>>>>>>>> page not received in dest be treated as zero pages?
>>>>>>>
>>>>>>> How would the destination guest know if some page must be
>>>>>>> cleared? The previous patch (which Peter reverted) did not
>>>>>>> send anything for the pages which were zero on the source
>>>>>>> side.
>>>>>> If an page was not received and destination knows that page
>>>>>> should exist according to total size, fill it with zero at
>>>>>> destination, would it solve the problem?
>>>>> It is _live_ migration, the source sends changes, same pages can
>>>>> change and be sent several times. So we would need to turn
>>>>> tracking on on the destination to know if some page was received
>>>>> from the source or changed by the destination itself (by writing
>>>>> there bios/firmware images, etc) and then clear pages which were
>>>>> touched by the destination and were not sent by the source.
>>>> OK, I can understand the problem is, for example: Destination boots
>>>> up with 0x0000-0xFFFF filled with bios image. Source forgot to send
>>>> zero pages in 0x0000-0xFFFF.
>>>
>>> The source did not forget, instead it zeroed these pages during its
>>> life and thought that they must be zeroed at the destination already
>>> (as the destination did not start and did not have a chance to write
>>> something there).
>>>
>>>
>>>> After migration destination got 0x0000-0xFFFF dirty(different with
>>>> source)
>>> Yep. And those pages were empty on the source what made debugging very
>>> easy :)
>>>
>>>
>>>> Thanks for explain.
>>>>
>>>> This seems refer to the migration protocol: how should the guest
>>>> treat unsent pages. The patch causing the problem, actually treat
>>>> zero pages as "not to sent" at source, but another half is missing:
>>>> treat "not received" as zero pages at destination. I guess if second
>>>> half is added, problem is gone: after page transfer completed,
>>>> before destination resume, fill zero in "not received" pages.
>>>
>>>
>>> Make a working patch, we'll discuss it :) I do not see much
>>> acceleration coming from there.
>> I would also not spent much time with this. I would either look to find
>> an easy way to fix the initialization code to not unneccessarily load
>> data into RAM or i will sent a v2 of my patch following Eric's
>> concerns.
> There is no easy way to implement the flag and keep your original patch as
> we have to implement this flag in all architectures which got broken by
> your patch and I personally can fix only PPC64-pseries but not the others.
>
> Furthermore your revert + new patches perfectly solve the problem, why
> would we want to bother now with this new flag which nobody really needs
> right now?
>
> Please, please, revert the original patch or I'll try to do it :)
>
>
I tried, but there where concerns by the community. Alternativly I found
the following alternate solution. Please drop the 2 patches and try the
following:

diff --git a/arch_init.c b/arch_init.c
index 5d32ecf..458bf8c 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -799,6 +799,8 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
                  while (total_ram_bytes) {
                      RAMBlock *block;
                      uint8_t len;
+                    void *base;
+                    ram_addr_t offset;

                      len = qemu_get_byte(f);
                      qemu_get_buffer(f, (uint8_t *)id, len);
@@ -822,6 +824,14 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
                          goto done;
                      }

+                    base = memory_region_get_ram_ptr(block->mr);
+                    for (offset = 0; offset < block->length;
+                         offset += TARGET_PAGE_SIZE) {
+                        if (!is_zero_page(base + offset)) {
+                            memset(base + offset, 0x00, TARGET_PAGE_SIZE);
+                        }
+                    }
+
                      total_ram_bytes -= length;
                  }
              }

This is done at setup time so there is no additional cost for zero checking at each compressed page
coming in.

Peter

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-10  6:50                                       ` Peter Lieven
@ 2013-06-10  6:55                                         ` Alexey Kardashevskiy
  2013-06-10  8:44                                           ` Peter Lieven
  0 siblings, 1 reply; 49+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-10  6:55 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Paolo Bonzini, David Gibson, qemu-ppc, Wenchao Xia, qemu-devel

On 06/10/2013 04:50 PM, Peter Lieven wrote:
> On 10.06.2013 08:39, Alexey Kardashevskiy wrote:
>> On 06/09/2013 05:27 PM, Peter Lieven wrote:
>>> Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>>>
>>>> On 06/09/2013 01:01 PM, Wenchao Xia wrote:
>>>>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
>>>>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>>>>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>>>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>>>>>> You could also scan the page for nonzero
>>>>>>>>>>>>>>>>>> values before writing it.
>>>>>>>>>>>>>>>> i had this in mind, but then choosed the other
>>>>>>>>>>>>>>>> approach.... turned out to be a bad idea.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> alexey: i will prepare a patch later today,
>>>>>>>>>>>>>>>> could you then please verify it fixes your
>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> paolo: would we still need the madvise or is
>>>>>>>>>>>>>>>> it enough to not write the zeroes?
>>>>>>>>>>>>>>> It should be enough to not write them.
>>>>>>>>>>>>>> Problem: checking the pages for zero allocates
>>>>>>>>>>>>>> them. even at the source.
>>>>>>>>>>>>> It doesn't look like.  I tried this program and top
>>>>>>>>>>>>> doesn't show an increasing amount of reserved
>>>>>>>>>>>>> memory:
>>>>>>>>>>>>>
>>>>>>>>>>>>> #include <stdio.h> #include <stdlib.h> int main() {
>>>>>>>>>>>>> char *x = malloc(500 << 20); int i, j; for (i = 0; i
>>>>>>>>>>>>> < 500; i += 10) { for (j = 0; j < 10 << 20; j +=
>>>>>>>>>>>>> 4096) { *(volatile char*) (x + (i << 20) + j); }
>>>>>>>>>>>>> getchar(); } }
>>>>>>>>>>>> strange. we are talking about RSS size, right?
>>>>>>>>>>> None of the three top values change, and only VIRT is
>>>>>>>>>>>> 500 MB.
>>>>>>>>>>>> is the malloc above using mmapped memory?
>>>>>>>>>>> Yes.
>>>>>>>>>>>
>>>>>>>>>>>> which kernel version do you use?
>>>>>>>>>>> 3.9.
>>>>>>>>>>>
>>>>>>>>>>>> what avoids allocating the memory for me is the
>>>>>>>>>>>> following (with whatever side effects it has ;-))
>>>>>>>>>>> This would also fail to migrate any page that is swapped
>>>>>>>>>>> out, breaking overcommit in a more subtle way. :)
>>>>>>>>>>>
>>>>>>>>>>> Paolo
>>>>>>>>>> the following does also not allocate memory, but qemu
>>>>>>>>>> does...
>>>>>>>>> Hi, Peter As the patch writes
>>>>>>>>>
>>>>>>>>> "not sending zero pages breaks migration if a page is zero
>>>>>>>>> at the source but not at the destination."
>>>>>>>>>
>>>>>>>>> I don't understand why it would be trouble, shouldn't all
>>>>>>>>> page not received in dest be treated as zero pages?
>>>>>>>>
>>>>>>>> How would the destination guest know if some page must be
>>>>>>>> cleared? The previous patch (which Peter reverted) did not
>>>>>>>> send anything for the pages which were zero on the source
>>>>>>>> side.
>>>>>>> If an page was not received and destination knows that page
>>>>>>> should exist according to total size, fill it with zero at
>>>>>>> destination, would it solve the problem?
>>>>>> It is _live_ migration, the source sends changes, same pages can
>>>>>> change and be sent several times. So we would need to turn
>>>>>> tracking on on the destination to know if some page was received
>>>>>> from the source or changed by the destination itself (by writing
>>>>>> there bios/firmware images, etc) and then clear pages which were
>>>>>> touched by the destination and were not sent by the source.
>>>>> OK, I can understand the problem is, for example: Destination boots
>>>>> up with 0x0000-0xFFFF filled with bios image. Source forgot to send
>>>>> zero pages in 0x0000-0xFFFF.
>>>>
>>>> The source did not forget, instead it zeroed these pages during its
>>>> life and thought that they must be zeroed at the destination already
>>>> (as the destination did not start and did not have a chance to write
>>>> something there).
>>>>
>>>>
>>>>> After migration destination got 0x0000-0xFFFF dirty(different with
>>>>> source)
>>>> Yep. And those pages were empty on the source what made debugging very
>>>> easy :)
>>>>
>>>>
>>>>> Thanks for explain.
>>>>>
>>>>> This seems refer to the migration protocol: how should the guest
>>>>> treat unsent pages. The patch causing the problem, actually treat
>>>>> zero pages as "not to sent" at source, but another half is missing:
>>>>> treat "not received" as zero pages at destination. I guess if second
>>>>> half is added, problem is gone: after page transfer completed,
>>>>> before destination resume, fill zero in "not received" pages.
>>>>
>>>>
>>>> Make a working patch, we'll discuss it :) I do not see much
>>>> acceleration coming from there.
>>> I would also not spent much time with this. I would either look to find
>>> an easy way to fix the initialization code to not unneccessarily load
>>> data into RAM or i will sent a v2 of my patch following Eric's
>>> concerns.
>> There is no easy way to implement the flag and keep your original patch as
>> we have to implement this flag in all architectures which got broken by
>> your patch and I personally can fix only PPC64-pseries but not the others.
>>
>> Furthermore your revert + new patches perfectly solve the problem, why
>> would we want to bother now with this new flag which nobody really needs
>> right now?
>>
>> Please, please, revert the original patch or I'll try to do it :)
>>
>>
> I tried, but there where concerns by the community. 


Was here anybody who did not want to revert the patch (besides you)?
I did not notice.


> Alternativly I found
> the following alternate solution. Please drop the 2 patches and try the
> following:


How is it going to work if upstream QEMU doesn't send anything about empty
pages at all (this is why I want to revert that patch)?


> 
> diff --git a/arch_init.c b/arch_init.c
> index 5d32ecf..458bf8c 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -799,6 +799,8 @@ static int ram_load(QEMUFile *f, void *opaque, int
> version_id)
>                  while (total_ram_bytes) {
>                      RAMBlock *block;
>                      uint8_t len;
> +                    void *base;
> +                    ram_addr_t offset;
> 
>                      len = qemu_get_byte(f);
>                      qemu_get_buffer(f, (uint8_t *)id, len);
> @@ -822,6 +824,14 @@ static int ram_load(QEMUFile *f, void *opaque, int
> version_id)
>                          goto done;
>                      }
> 
> +                    base = memory_region_get_ram_ptr(block->mr);
> +                    for (offset = 0; offset < block->length;
> +                         offset += TARGET_PAGE_SIZE) {
> +                        if (!is_zero_page(base + offset)) {
> +                            memset(base + offset, 0x00, TARGET_PAGE_SIZE);
> +                        }
> +                    }
> +
>                      total_ram_bytes -= length;
>                  }
>              }
> 
> This is done at setup time so there is no additional cost for zero checking
> at each compressed page
> coming in.
> 
> Peter


-- 
Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-10  6:55                                         ` Alexey Kardashevskiy
@ 2013-06-10  8:44                                           ` Peter Lieven
  2013-06-10  9:10                                             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Lieven @ 2013-06-10  8:44 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paolo Bonzini, David Gibson, qemu-ppc, Wenchao Xia, qemu-devel

On 10.06.2013 08:55, Alexey Kardashevskiy wrote:
> On 06/10/2013 04:50 PM, Peter Lieven wrote:
>> On 10.06.2013 08:39, Alexey Kardashevskiy wrote:
>>> On 06/09/2013 05:27 PM, Peter Lieven wrote:
>>>> Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>>>>
>>>>> On 06/09/2013 01:01 PM, Wenchao Xia wrote:
>>>>>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
>>>>>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>>>>>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>>>>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>>>>>>> You could also scan the page for nonzero
>>>>>>>>>>>>>>>>>>> values before writing it.
>>>>>>>>>>>>>>>>> i had this in mind, but then choosed the other
>>>>>>>>>>>>>>>>> approach.... turned out to be a bad idea.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> alexey: i will prepare a patch later today,
>>>>>>>>>>>>>>>>> could you then please verify it fixes your
>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> paolo: would we still need the madvise or is
>>>>>>>>>>>>>>>>> it enough to not write the zeroes?
>>>>>>>>>>>>>>>> It should be enough to not write them.
>>>>>>>>>>>>>>> Problem: checking the pages for zero allocates
>>>>>>>>>>>>>>> them. even at the source.
>>>>>>>>>>>>>> It doesn't look like.  I tried this program and top
>>>>>>>>>>>>>> doesn't show an increasing amount of reserved
>>>>>>>>>>>>>> memory:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #include <stdio.h> #include <stdlib.h> int main() {
>>>>>>>>>>>>>> char *x = malloc(500 << 20); int i, j; for (i = 0; i
>>>>>>>>>>>>>> < 500; i += 10) { for (j = 0; j < 10 << 20; j +=
>>>>>>>>>>>>>> 4096) { *(volatile char*) (x + (i << 20) + j); }
>>>>>>>>>>>>>> getchar(); } }
>>>>>>>>>>>>> strange. we are talking about RSS size, right?
>>>>>>>>>>>> None of the three top values change, and only VIRT is
>>>>>>>>>>>>> 500 MB.
>>>>>>>>>>>>> is the malloc above using mmapped memory?
>>>>>>>>>>>> Yes.
>>>>>>>>>>>>
>>>>>>>>>>>>> which kernel version do you use?
>>>>>>>>>>>> 3.9.
>>>>>>>>>>>>
>>>>>>>>>>>>> what avoids allocating the memory for me is the
>>>>>>>>>>>>> following (with whatever side effects it has ;-))
>>>>>>>>>>>> This would also fail to migrate any page that is swapped
>>>>>>>>>>>> out, breaking overcommit in a more subtle way. :)
>>>>>>>>>>>>
>>>>>>>>>>>> Paolo
>>>>>>>>>>> the following does also not allocate memory, but qemu
>>>>>>>>>>> does...
>>>>>>>>>> Hi, Peter As the patch writes
>>>>>>>>>>
>>>>>>>>>> "not sending zero pages breaks migration if a page is zero
>>>>>>>>>> at the source but not at the destination."
>>>>>>>>>>
>>>>>>>>>> I don't understand why it would be trouble, shouldn't all
>>>>>>>>>> page not received in dest be treated as zero pages?
>>>>>>>>> How would the destination guest know if some page must be
>>>>>>>>> cleared? The previous patch (which Peter reverted) did not
>>>>>>>>> send anything for the pages which were zero on the source
>>>>>>>>> side.
>>>>>>>> If an page was not received and destination knows that page
>>>>>>>> should exist according to total size, fill it with zero at
>>>>>>>> destination, would it solve the problem?
>>>>>>> It is _live_ migration, the source sends changes, same pages can
>>>>>>> change and be sent several times. So we would need to turn
>>>>>>> tracking on on the destination to know if some page was received
>>>>>>> from the source or changed by the destination itself (by writing
>>>>>>> there bios/firmware images, etc) and then clear pages which were
>>>>>>> touched by the destination and were not sent by the source.
>>>>>> OK, I can understand the problem is, for example: Destination boots
>>>>>> up with 0x0000-0xFFFF filled with bios image. Source forgot to send
>>>>>> zero pages in 0x0000-0xFFFF.
>>>>> The source did not forget, instead it zeroed these pages during its
>>>>> life and thought that they must be zeroed at the destination already
>>>>> (as the destination did not start and did not have a chance to write
>>>>> something there).
>>>>>
>>>>>
>>>>>> After migration destination got 0x0000-0xFFFF dirty(different with
>>>>>> source)
>>>>> Yep. And those pages were empty on the source what made debugging very
>>>>> easy :)
>>>>>
>>>>>
>>>>>> Thanks for explain.
>>>>>>
>>>>>> This seems refer to the migration protocol: how should the guest
>>>>>> treat unsent pages. The patch causing the problem, actually treat
>>>>>> zero pages as "not to sent" at source, but another half is missing:
>>>>>> treat "not received" as zero pages at destination. I guess if second
>>>>>> half is added, problem is gone: after page transfer completed,
>>>>>> before destination resume, fill zero in "not received" pages.
>>>>>
>>>>> Make a working patch, we'll discuss it :) I do not see much
>>>>> acceleration coming from there.
>>>> I would also not spent much time with this. I would either look to find
>>>> an easy way to fix the initialization code to not unneccessarily load
>>>> data into RAM or i will sent a v2 of my patch following Eric's
>>>> concerns.
>>> There is no easy way to implement the flag and keep your original patch as
>>> we have to implement this flag in all architectures which got broken by
>>> your patch and I personally can fix only PPC64-pseries but not the others.
>>>
>>> Furthermore your revert + new patches perfectly solve the problem, why
>>> would we want to bother now with this new flag which nobody really needs
>>> right now?
>>>
>>> Please, please, revert the original patch or I'll try to do it :)
>>>
>>>
>> I tried, but there where concerns by the community.
>
> Was here anybody who did not want to revert the patch (besides you)?
> I did not notice.
Eric said I should not drop the skipped_pages stuff in the monitor.
>
>
>> Alternativly I found
>> the following alternate solution. Please drop the 2 patches and try the
>> following:
>
> How is it going to work if upstream QEMU doesn't send anything about empty
> pages at all (this is why I want to revert that patch)?
I do not understand your question. The patch below zeroes out the destination
memory if it is not zero (e.g. if there is a BIOS copied to memory already during
machine init).

I would prefer not to completely drop the patch since it saves bandwidth and
resources.

Peter

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] broken incoming migration
  2013-06-10  8:44                                           ` Peter Lieven
@ 2013-06-10  9:10                                             ` Alexey Kardashevskiy
  2013-06-10  9:33                                               ` [Qemu-devel] [Qemu-ppc] " Benjamin Herrenschmidt
  0 siblings, 1 reply; 49+ messages in thread
From: Alexey Kardashevskiy @ 2013-06-10  9:10 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Paolo Bonzini, David Gibson, qemu-ppc, Wenchao Xia, qemu-devel

On 06/10/2013 06:44 PM, Peter Lieven wrote:
> On 10.06.2013 08:55, Alexey Kardashevskiy wrote:
>> On 06/10/2013 04:50 PM, Peter Lieven wrote:
>>> On 10.06.2013 08:39, Alexey Kardashevskiy wrote:
>>>> On 06/09/2013 05:27 PM, Peter Lieven wrote:
>>>>> Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>>>>>
>>>>>> On 06/09/2013 01:01 PM, Wenchao Xia wrote:
>>>>>>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
>>>>>>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>>>>>>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>>>>>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>>>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>>>>>>>> You could also scan the page for nonzero
>>>>>>>>>>>>>>>>>>>> values before writing it.
>>>>>>>>>>>>>>>>>> i had this in mind, but then choosed the other
>>>>>>>>>>>>>>>>>> approach.... turned out to be a bad idea.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> alexey: i will prepare a patch later today,
>>>>>>>>>>>>>>>>>> could you then please verify it fixes your
>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> paolo: would we still need the madvise or is
>>>>>>>>>>>>>>>>>> it enough to not write the zeroes?
>>>>>>>>>>>>>>>>> It should be enough to not write them.
>>>>>>>>>>>>>>>> Problem: checking the pages for zero allocates
>>>>>>>>>>>>>>>> them. even at the source.
>>>>>>>>>>>>>>> It doesn't look like.  I tried this program and top
>>>>>>>>>>>>>>> doesn't show an increasing amount of reserved
>>>>>>>>>>>>>>> memory:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> #include <stdio.h> #include <stdlib.h> int main() {
>>>>>>>>>>>>>>> char *x = malloc(500 << 20); int i, j; for (i = 0; i
>>>>>>>>>>>>>>> < 500; i += 10) { for (j = 0; j < 10 << 20; j +=
>>>>>>>>>>>>>>> 4096) { *(volatile char*) (x + (i << 20) + j); }
>>>>>>>>>>>>>>> getchar(); } }
>>>>>>>>>>>>>> strange. we are talking about RSS size, right?
>>>>>>>>>>>>> None of the three top values change, and only VIRT is
>>>>>>>>>>>>>> 500 MB.
>>>>>>>>>>>>>> is the malloc above using mmapped memory?
>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> which kernel version do you use?
>>>>>>>>>>>>> 3.9.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> what avoids allocating the memory for me is the
>>>>>>>>>>>>>> following (with whatever side effects it has ;-))
>>>>>>>>>>>>> This would also fail to migrate any page that is swapped
>>>>>>>>>>>>> out, breaking overcommit in a more subtle way. :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Paolo
>>>>>>>>>>>> the following does also not allocate memory, but qemu
>>>>>>>>>>>> does...
>>>>>>>>>>> Hi, Peter As the patch writes
>>>>>>>>>>>
>>>>>>>>>>> "not sending zero pages breaks migration if a page is zero
>>>>>>>>>>> at the source but not at the destination."
>>>>>>>>>>>
>>>>>>>>>>> I don't understand why it would be trouble, shouldn't all
>>>>>>>>>>> page not received in dest be treated as zero pages?
>>>>>>>>>> How would the destination guest know if some page must be
>>>>>>>>>> cleared? The previous patch (which Peter reverted) did not
>>>>>>>>>> send anything for the pages which were zero on the source
>>>>>>>>>> side.
>>>>>>>>> If an page was not received and destination knows that page
>>>>>>>>> should exist according to total size, fill it with zero at
>>>>>>>>> destination, would it solve the problem?
>>>>>>>> It is _live_ migration, the source sends changes, same pages can
>>>>>>>> change and be sent several times. So we would need to turn
>>>>>>>> tracking on on the destination to know if some page was received
>>>>>>>> from the source or changed by the destination itself (by writing
>>>>>>>> there bios/firmware images, etc) and then clear pages which were
>>>>>>>> touched by the destination and were not sent by the source.
>>>>>>> OK, I can understand the problem is, for example: Destination boots
>>>>>>> up with 0x0000-0xFFFF filled with bios image. Source forgot to send
>>>>>>> zero pages in 0x0000-0xFFFF.
>>>>>> The source did not forget, instead it zeroed these pages during its
>>>>>> life and thought that they must be zeroed at the destination already
>>>>>> (as the destination did not start and did not have a chance to write
>>>>>> something there).
>>>>>>
>>>>>>
>>>>>>> After migration destination got 0x0000-0xFFFF dirty(different with
>>>>>>> source)
>>>>>> Yep. And those pages were empty on the source what made debugging very
>>>>>> easy :)
>>>>>>
>>>>>>
>>>>>>> Thanks for explain.
>>>>>>>
>>>>>>> This seems refer to the migration protocol: how should the guest
>>>>>>> treat unsent pages. The patch causing the problem, actually treat
>>>>>>> zero pages as "not to sent" at source, but another half is missing:
>>>>>>> treat "not received" as zero pages at destination. I guess if second
>>>>>>> half is added, problem is gone: after page transfer completed,
>>>>>>> before destination resume, fill zero in "not received" pages.
>>>>>>
>>>>>> Make a working patch, we'll discuss it :) I do not see much
>>>>>> acceleration coming from there.
>>>>> I would also not spent much time with this. I would either look to find
>>>>> an easy way to fix the initialization code to not unneccessarily load
>>>>> data into RAM or i will sent a v2 of my patch following Eric's
>>>>> concerns.
>>>> There is no easy way to implement the flag and keep your original patch as
>>>> we have to implement this flag in all architectures which got broken by
>>>> your patch and I personally can fix only PPC64-pseries but not the others.
>>>>
>>>> Furthermore your revert + new patches perfectly solve the problem, why
>>>> would we want to bother now with this new flag which nobody really needs
>>>> right now?
>>>>
>>>> Please, please, revert the original patch or I'll try to do it :)
>>>>
>>>>
>>> I tried, but there where concerns by the community.
>>
>> Was here anybody who did not want to revert the patch (besides you)?
>> I did not notice.
> Eric said I should not drop the skipped_pages stuff in the monitor.
>>
>>
>>> Alternativly I found
>>> the following alternate solution. Please drop the 2 patches and try the
>>> following:
>>
>> How is it going to work if upstream QEMU doesn't send anything about empty
>> pages at all (this is why I want to revert that patch)?
> I do not understand your question. The patch below zeroes out the destination
> memory if it is not zero (e.g. if there is a BIOS copied to memory already
> during
> machine init).
> 
> I would prefer not to completely drop the patch since it saves bandwidth and
> resources.

I would like migration to do what it should do - send pages no matter what,
this is exactly what migration is for. If there any many, many empty pages
(which I doubt to be a very often real life case), they could all merged in
big consecutive chunks and sent at the end of migration.


-- 
Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc]  broken incoming migration
  2013-06-10  9:10                                             ` Alexey Kardashevskiy
@ 2013-06-10  9:33                                               ` Benjamin Herrenschmidt
  2013-06-10  9:42                                                 ` Peter Lieven
  0 siblings, 1 reply; 49+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-10  9:33 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Peter Lieven, qemu-devel, qemu-ppc, Paolo Bonzini, Wenchao Xia,
	David Gibson

On Mon, 2013-06-10 at 19:10 +1000, Alexey Kardashevskiy wrote:
> > I would prefer not to completely drop the patch since it saves bandwidth and
> > resources.
> 
> I would like migration to do what it should do - send pages no matter what,
> this is exactly what migration is for. If there any many, many empty pages
> (which I doubt to be a very often real life case), they could all merged in
> big consecutive chunks and sent at the end of migration.

I tend to agree. The problem of sending empty pages is purely a problem of
compression. If the current mechanism is deemed "not efficient enough" for
in the case of having lots of zero-pages, then by all means invent a better
packet format for more tightly representing them on the wire, but don't
break things by not sending them at all.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc]  broken incoming migration
  2013-06-10  9:33                                               ` [Qemu-devel] [Qemu-ppc] " Benjamin Herrenschmidt
@ 2013-06-10  9:42                                                 ` Peter Lieven
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Lieven @ 2013-06-10  9:42 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, Paolo Bonzini,
	Wenchao Xia, David Gibson

On 10.06.2013 11:33, Benjamin Herrenschmidt wrote:
> On Mon, 2013-06-10 at 19:10 +1000, Alexey Kardashevskiy wrote:
>>> I would prefer not to completely drop the patch since it saves bandwidth and
>>> resources.
>> I would like migration to do what it should do - send pages no matter what,
>> this is exactly what migration is for. If there any many, many empty pages
>> (which I doubt to be a very often real life case), they could all merged in
>> big consecutive chunks and sent at the end of migration.
> I tend to agree. The problem of sending empty pages is purely a problem of
> compression. If the current mechanism is deemed "not efficient enough" for
> in the case of having lots of zero-pages, then by all means invent a better
> packet format for more tightly representing them on the wire, but don't
> break things by not sending them at all.

Ok, I see the point. I think the paradigm to say that the destination
should "decide" if it needs a page or not is a sound one.

Zero pages are quite often depending on the lifetime and the operating
system used. But a consecutive range of zero pages is only likely
in the bulk stage. I don't know if its reasonable to add a special encoding
for that.

I will sent a v2 of my previous revert patch addressing Erics concerns shortly.

Peter

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc]  broken incoming migration
  2013-06-09  2:53                             ` Benjamin Herrenschmidt
@ 2013-06-12 14:00                               ` Paolo Bonzini
  2013-06-12 14:11                                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 49+ messages in thread
From: Paolo Bonzini @ 2013-06-12 14:00 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, Peter Lieven, qemu-devel, qemu-ppc,
	Wenchao Xia, David Gibson

Il 08/06/2013 22:53, Benjamin Herrenschmidt ha scritto:
> On Sun, 2013-06-09 at 10:16 +0800, Wenchao Xia wrote:
>>    If an page was not received and destination knows that page should
>> exist according to total size, fill it with zero at destination, would
>> it solve the problem?
> 
> The easiest way to do that is to not write to those pages at the
> destination to begin with, when initializing the VM... Is there any way
> to know that a VM is being setup as a migration target or not ?

There is the "incoming" variable in vl.c (currently not a global), but I
suspect Peter's patch could have also broken loadvm.  It could quickly
become a rat hole.

The only bug we have is not a performance bug related to compression;
it's that writing zero pages breaks overcommit.  Let's fix that, and
only that.

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc]  broken incoming migration
  2013-06-12 14:00                               ` Paolo Bonzini
@ 2013-06-12 14:11                                 ` Benjamin Herrenschmidt
  2013-06-12 20:10                                   ` Paolo Bonzini
  0 siblings, 1 reply; 49+ messages in thread
From: Benjamin Herrenschmidt @ 2013-06-12 14:11 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Alexey Kardashevskiy, Peter Lieven, qemu-devel, qemu-ppc,
	Wenchao Xia, David Gibson

On Wed, 2013-06-12 at 10:00 -0400, Paolo Bonzini wrote:
> The only bug we have is not a performance bug related to compression;
> it's that writing zero pages breaks overcommit.  Let's fix that, and
> only that.

Right, do we have a way to madvise "throw away" these instead ? Or do we
have a way to track that the platform init code did write something
there and only clear *those* pages ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc]  broken incoming migration
  2013-06-12 14:11                                 ` Benjamin Herrenschmidt
@ 2013-06-12 20:10                                   ` Paolo Bonzini
  2013-06-13  2:41                                     ` Wenchao Xia
  0 siblings, 1 reply; 49+ messages in thread
From: Paolo Bonzini @ 2013-06-12 20:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, Peter Lieven, qemu-devel, qemu-ppc,
	Wenchao Xia, David Gibson

Il 12/06/2013 10:11, Benjamin Herrenschmidt ha scritto:
> On Wed, 2013-06-12 at 10:00 -0400, Paolo Bonzini wrote:
>> The only bug we have is not a performance bug related to compression;
>> it's that writing zero pages breaks overcommit.  Let's fix that, and
>> only that.
> 
> Right, do we have a way to madvise "throw away" these instead ?

We already do that, but apparently that madvise is asynchronous.

> Or do we
> have a way to track that the platform init code did write something
> there and only clear *those* pages ?

No need for; since it's copy-on-write, not copy-on-read :) we can just
check for pages that are zero and not rewrite them with zeros.  That's
what Peter's patches do, I'll review them right away.

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc]  broken incoming migration
  2013-06-12 20:10                                   ` Paolo Bonzini
@ 2013-06-13  2:41                                     ` Wenchao Xia
  0 siblings, 0 replies; 49+ messages in thread
From: Wenchao Xia @ 2013-06-13  2:41 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Alexey Kardashevskiy, Peter Lieven, qemu-devel, qemu-ppc, David Gibson

于 2013-6-13 4:10, Paolo Bonzini 写道:
> Il 12/06/2013 10:11, Benjamin Herrenschmidt ha scritto:
>> On Wed, 2013-06-12 at 10:00 -0400, Paolo Bonzini wrote:
>>> The only bug we have is not a performance bug related to compression;
>>> it's that writing zero pages breaks overcommit.  Let's fix that, and
>>> only that.
>>
>> Right, do we have a way to madvise "throw away" these instead ?
>
> We already do that, but apparently that madvise is asynchronous.
>
>> Or do we
>> have a way to track that the platform init code did write something
>> there and only clear *those* pages ?
>
> No need for; since it's copy-on-write, not copy-on-read :) we can just
> check for pages that are zero and not rewrite them with zeros.  That's
   I think it is the right way to improve overcommit without breaking
anything.

> what Peter's patches do, I'll review them right away.
>
> Paolo
>



>


-- 
Best Regards

Wenchao Xia

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2013-06-13  2:41 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-30  7:44 [Qemu-devel] broken incoming migration Alexey Kardashevskiy
2013-05-30  7:49 ` Alexey Kardashevskiy
2013-05-30  7:49 ` Paolo Bonzini
2013-05-30  8:18   ` Alexey Kardashevskiy
2013-05-30  9:08     ` Peter Lieven
2013-05-30  9:31       ` Alexey Kardashevskiy
2013-05-30 13:00       ` Paolo Bonzini
2013-05-30 13:38         ` Alexey Kardashevskiy
2013-05-30 14:08           ` Paolo Bonzini
2013-05-30 14:38         ` Peter Lieven
2013-05-30 14:41           ` Paolo Bonzini
2013-06-04 13:52             ` Peter Lieven
2013-06-04 14:14               ` Paolo Bonzini
2013-06-04 14:38                 ` Peter Lieven
2013-06-04 14:40                   ` Paolo Bonzini
2013-06-04 14:48                     ` Peter Lieven
2013-06-04 15:17                       ` Paolo Bonzini
2013-06-04 19:15                         ` Peter Lieven
2013-06-05  3:37                           ` Alexey Kardashevskiy
2013-06-05  6:09                             ` Peter Lieven
2013-06-09  4:12                               ` liu ping fan
2013-06-09  7:22                                 ` Peter Lieven
2013-06-04 15:10                     ` Peter Lieven
2013-06-08  8:27                       ` Wenchao Xia
2013-06-08  8:30                         ` Alexey Kardashevskiy
2013-06-09  2:16                           ` Wenchao Xia
2013-06-09  2:34                             ` Alexey Kardashevskiy
2013-06-09  2:52                               ` [Qemu-devel] [Qemu-ppc] " Benjamin Herrenschmidt
2013-06-09  3:01                                 ` Alexey Kardashevskiy
2013-06-09  3:01                               ` [Qemu-devel] " Wenchao Xia
2013-06-09  3:09                                 ` Alexey Kardashevskiy
2013-06-09  3:31                                   ` Wenchao Xia
2013-06-09  7:27                                   ` Peter Lieven
2013-06-10  6:39                                     ` Alexey Kardashevskiy
2013-06-10  6:50                                       ` Peter Lieven
2013-06-10  6:55                                         ` Alexey Kardashevskiy
2013-06-10  8:44                                           ` Peter Lieven
2013-06-10  9:10                                             ` Alexey Kardashevskiy
2013-06-10  9:33                                               ` [Qemu-devel] [Qemu-ppc] " Benjamin Herrenschmidt
2013-06-10  9:42                                                 ` Peter Lieven
2013-06-09  2:53                             ` Benjamin Herrenschmidt
2013-06-12 14:00                               ` Paolo Bonzini
2013-06-12 14:11                                 ` Benjamin Herrenschmidt
2013-06-12 20:10                                   ` Paolo Bonzini
2013-06-13  2:41                                     ` Wenchao Xia
2013-06-03 10:04           ` [Qemu-devel] " Alexey Kardashevskiy
2013-06-04 10:56             ` Peter Lieven
2013-06-08  8:24         ` Wenchao Xia
2013-05-30 10:18 ` Peter Maydell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.