* [Qemu-devel] broken incoming migration @ 2013-05-30 7:44 Alexey Kardashevskiy 2013-05-30 7:49 ` Alexey Kardashevskiy ` (2 more replies) 0 siblings, 3 replies; 49+ messages in thread From: Alexey Kardashevskiy @ 2013-05-30 7:44 UTC (permalink / raw) To: qemu-devel; +Cc: qemu-ppc, David Gibson Hi! I found the migration broken on pseries platform, specifically, this patch broke it: f1c72795af573b24a7da5eb52375c9aba8a37972 migration: do not sent zero pages in bulk stage The idea is not to send zero pages to the destination guest which is expected to have 100% empty RAM. However on pseries plaftorm the guest always has some stuff in the RAM as a part of initialization (device tree, system firmware and rtas (?)) so it is not completely empty. As the source guest cannot detect this, it skips some pages during migration and we get a broken destination guest. Bug. While the idea is ok in general, I do not see any easy way to fix it as neither QEMUMachine::init nor QEMUMachine::reset callbacks has information about whether we are about to receive a migration or not (-incoming parameter) and we cannot move device-tree and system firmware initialization anywhere else. ram_bulk_stage is static and cannot be disabled from the platform initialization code. So what would the community suggest? Thanks! -- Alexey ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-05-30 7:44 [Qemu-devel] broken incoming migration Alexey Kardashevskiy @ 2013-05-30 7:49 ` Alexey Kardashevskiy 2013-05-30 7:49 ` Paolo Bonzini 2013-05-30 10:18 ` Peter Maydell 2 siblings, 0 replies; 49+ messages in thread From: Alexey Kardashevskiy @ 2013-05-30 7:49 UTC (permalink / raw) To: qemu-devel; +Cc: qemu-ppc, Paul Mackerras, David Gibson Forgot some cc:. On 05/30/2013 05:44 PM, Alexey Kardashevskiy wrote: > Hi! > > I found the migration broken on pseries platform, specifically, this patch > broke it: > > f1c72795af573b24a7da5eb52375c9aba8a37972 > migration: do not sent zero pages in bulk stage > > The idea is not to send zero pages to the destination guest which is > expected to have 100% empty RAM. > > However on pseries plaftorm the guest always has some stuff in the RAM as a > part of initialization (device tree, system firmware and rtas (?)) so it is > not completely empty. As the source guest cannot detect this, it skips some > pages during migration and we get a broken destination guest. Bug. > > While the idea is ok in general, I do not see any easy way to fix it as > neither QEMUMachine::init nor QEMUMachine::reset callbacks has information > about whether we are about to receive a migration or not (-incoming > parameter) and we cannot move device-tree and system firmware > initialization anywhere else. > > ram_bulk_stage is static and cannot be disabled from the platform > initialization code. > > So what would the community suggest? Thanks! > > -- Alexey ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-05-30 7:44 [Qemu-devel] broken incoming migration Alexey Kardashevskiy 2013-05-30 7:49 ` Alexey Kardashevskiy @ 2013-05-30 7:49 ` Paolo Bonzini 2013-05-30 8:18 ` Alexey Kardashevskiy 2013-05-30 10:18 ` Peter Maydell 2 siblings, 1 reply; 49+ messages in thread From: Paolo Bonzini @ 2013-05-30 7:49 UTC (permalink / raw) To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, David Gibson Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto: > Hi! > > I found the migration broken on pseries platform, specifically, this patch > broke it: > > f1c72795af573b24a7da5eb52375c9aba8a37972 > migration: do not sent zero pages in bulk stage > > The idea is not to send zero pages to the destination guest which is > expected to have 100% empty RAM. > > However on pseries plaftorm the guest always has some stuff in the RAM as a > part of initialization (device tree, system firmware and rtas (?)) so it is > not completely empty. As the source guest cannot detect this, it skips some > pages during migration and we get a broken destination guest. Bug. > > While the idea is ok in general, I do not see any easy way to fix it as > neither QEMUMachine::init nor QEMUMachine::reset callbacks has information > about whether we are about to receive a migration or not (-incoming > parameter) and we cannot move device-tree and system firmware > initialization anywhere else. > > ram_bulk_stage is static and cannot be disabled from the platform > initialization code. > > So what would the community suggest? Revert the patch. :) Paolo ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-05-30 7:49 ` Paolo Bonzini @ 2013-05-30 8:18 ` Alexey Kardashevskiy 2013-05-30 9:08 ` Peter Lieven 0 siblings, 1 reply; 49+ messages in thread From: Alexey Kardashevskiy @ 2013-05-30 8:18 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Peter Lieven, qemu-ppc, qemu-devel, David Gibson On 05/30/2013 05:49 PM, Paolo Bonzini wrote: > Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto: >> Hi! >> >> I found the migration broken on pseries platform, specifically, this patch >> broke it: >> >> f1c72795af573b24a7da5eb52375c9aba8a37972 >> migration: do not sent zero pages in bulk stage >> >> The idea is not to send zero pages to the destination guest which is >> expected to have 100% empty RAM. >> >> However on pseries plaftorm the guest always has some stuff in the RAM as a >> part of initialization (device tree, system firmware and rtas (?)) so it is >> not completely empty. As the source guest cannot detect this, it skips some >> pages during migration and we get a broken destination guest. Bug. >> >> While the idea is ok in general, I do not see any easy way to fix it as >> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information >> about whether we are about to receive a migration or not (-incoming >> parameter) and we cannot move device-tree and system firmware >> initialization anywhere else. >> >> ram_bulk_stage is static and cannot be disabled from the platform >> initialization code. >> >> So what would the community suggest? > > Revert the patch. :) I'll wait for 24 hours (forgot to cc: the author) and then post a revert patch :) -- Alexey ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-05-30 8:18 ` Alexey Kardashevskiy @ 2013-05-30 9:08 ` Peter Lieven 2013-05-30 9:31 ` Alexey Kardashevskiy 2013-05-30 13:00 ` Paolo Bonzini 0 siblings, 2 replies; 49+ messages in thread From: Peter Lieven @ 2013-05-30 9:08 UTC (permalink / raw) To: Alexey Kardashevskiy; +Cc: Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson Am 30.05.2013 10:18, schrieb Alexey Kardashevskiy: > On 05/30/2013 05:49 PM, Paolo Bonzini wrote: >> Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto: >>> Hi! >>> >>> I found the migration broken on pseries platform, specifically, this patch >>> broke it: >>> >>> f1c72795af573b24a7da5eb52375c9aba8a37972 >>> migration: do not sent zero pages in bulk stage >>> >>> The idea is not to send zero pages to the destination guest which is >>> expected to have 100% empty RAM. >>> >>> However on pseries plaftorm the guest always has some stuff in the RAM as a >>> part of initialization (device tree, system firmware and rtas (?)) so it is >>> not completely empty. As the source guest cannot detect this, it skips some >>> pages during migration and we get a broken destination guest. Bug. >>> >>> While the idea is ok in general, I do not see any easy way to fix it as >>> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information >>> about whether we are about to receive a migration or not (-incoming >>> parameter) and we cannot move device-tree and system firmware >>> initialization anywhere else. >>> >>> ram_bulk_stage is static and cannot be disabled from the platform >>> initialization code. >>> >>> So what would the community suggest? >> Revert the patch. :) > I'll wait for 24 hours (forgot to cc: the author) and then post a revert > patch :) > > > does this problem only occur on pseries emulation? not sending zero pages is not only a performance benefit it also makes overcomitted memory usable. the madv_dontneed seems to kick in asynchronously and memory is not available immediately. what I do not understand if the a memory region is not empty at destination due to device tree, firmware etc. it shouldn't be empty at the source as well so in theory this should not be a problem. Peter ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-05-30 9:08 ` Peter Lieven @ 2013-05-30 9:31 ` Alexey Kardashevskiy 2013-05-30 13:00 ` Paolo Bonzini 1 sibling, 0 replies; 49+ messages in thread From: Alexey Kardashevskiy @ 2013-05-30 9:31 UTC (permalink / raw) To: Peter Lieven; +Cc: Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson On 05/30/2013 07:08 PM, Peter Lieven wrote: > Am 30.05.2013 10:18, schrieb Alexey Kardashevskiy: >> On 05/30/2013 05:49 PM, Paolo Bonzini wrote: >>> Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto: >>>> Hi! >>>> >>>> I found the migration broken on pseries platform, specifically, this patch >>>> broke it: >>>> >>>> f1c72795af573b24a7da5eb52375c9aba8a37972 >>>> migration: do not sent zero pages in bulk stage >>>> >>>> The idea is not to send zero pages to the destination guest which is >>>> expected to have 100% empty RAM. >>>> >>>> However on pseries plaftorm the guest always has some stuff in the RAM as a >>>> part of initialization (device tree, system firmware and rtas (?)) so it is >>>> not completely empty. As the source guest cannot detect this, it skips some >>>> pages during migration and we get a broken destination guest. Bug. >>>> >>>> While the idea is ok in general, I do not see any easy way to fix it as >>>> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information >>>> about whether we are about to receive a migration or not (-incoming >>>> parameter) and we cannot move device-tree and system firmware >>>> initialization anywhere else. >>>> >>>> ram_bulk_stage is static and cannot be disabled from the platform >>>> initialization code. >>>> >>>> So what would the community suggest? >>> Revert the patch. :) >> I'll wait for 24 hours (forgot to cc: the author) and then post a revert >> patch :) >> >> >> > does this problem only occur on pseries emulation? No idea, really. > not sending zero pages is not only a performance benefit it also makes > overcomitted memory usable. the madv_dontneed seems to kick in asynchronously > and memory is not available immediately. Ok, I do not mind :) > what I do not understand if the a memory region is not empty at destination > due to device tree, firmware etc. it shouldn't be empty at the source as well so > in theory this should not be a problem. This is how it works - first QEMU allocates RAM and put devicetree+firmware somewhere. Then QEMU starts the guest so the firmware starts, loads the kernel and then the kernel zeroes the whole (most of?) RAM including the area where the firmware used to be. Now we migrate. If the source guest is in the kernel already, then it does not know about the memory area previously occupied by the firmware, it is just an empty page. If the source guest is still in the firmware, then those pages are not empty and they are perfectly migrated. -- Alexey ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-05-30 9:08 ` Peter Lieven 2013-05-30 9:31 ` Alexey Kardashevskiy @ 2013-05-30 13:00 ` Paolo Bonzini 2013-05-30 13:38 ` Alexey Kardashevskiy ` (2 more replies) 1 sibling, 3 replies; 49+ messages in thread From: Paolo Bonzini @ 2013-05-30 13:00 UTC (permalink / raw) To: Peter Lieven; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson Il 30/05/2013 11:08, Peter Lieven ha scritto: > Am 30.05.2013 10:18, schrieb Alexey Kardashevskiy: >> On 05/30/2013 05:49 PM, Paolo Bonzini wrote: >>> Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto: >>>> Hi! >>>> >>>> I found the migration broken on pseries platform, specifically, this patch >>>> broke it: >>>> >>>> f1c72795af573b24a7da5eb52375c9aba8a37972 >>>> migration: do not sent zero pages in bulk stage >>>> >>>> The idea is not to send zero pages to the destination guest which is >>>> expected to have 100% empty RAM. >>>> >>>> However on pseries plaftorm the guest always has some stuff in the RAM as a >>>> part of initialization (device tree, system firmware and rtas (?)) so it is >>>> not completely empty. As the source guest cannot detect this, it skips some >>>> pages during migration and we get a broken destination guest. Bug. >>>> >>>> While the idea is ok in general, I do not see any easy way to fix it as >>>> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information >>>> about whether we are about to receive a migration or not (-incoming >>>> parameter) and we cannot move device-tree and system firmware >>>> initialization anywhere else. >>>> >>>> ram_bulk_stage is static and cannot be disabled from the platform >>>> initialization code. >>>> >>>> So what would the community suggest? >>> Revert the patch. :) >> I'll wait for 24 hours (forgot to cc: the author) and then post a revert >> patch :) >> > does this problem only occur on pseries emulation? Probably not. On a PC, it would occur if you had 4K of zeros in the source BIOS but not in the destination BIOS. When you reboot, the BIOS image is wrong. > not sending zero pages is not only a performance benefit it also makes > overcomitted memory usable. the madv_dontneed seems to kick in asynchronously > and memory is not available immediately. You could also scan the page for nonzero values before writing it. Paolo ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-05-30 13:00 ` Paolo Bonzini @ 2013-05-30 13:38 ` Alexey Kardashevskiy 2013-05-30 14:08 ` Paolo Bonzini 2013-05-30 14:38 ` Peter Lieven 2013-06-08 8:24 ` Wenchao Xia 2 siblings, 1 reply; 49+ messages in thread From: Alexey Kardashevskiy @ 2013-05-30 13:38 UTC (permalink / raw) To: Paolo Bonzini; +Cc: qemu-ppc, Peter Lieven, qemu-devel, David Gibson On 05/30/2013 11:00 PM, Paolo Bonzini wrote: > Il 30/05/2013 11:08, Peter Lieven ha scritto: >> Am 30.05.2013 10:18, schrieb Alexey Kardashevskiy: >>> On 05/30/2013 05:49 PM, Paolo Bonzini wrote: >>>> Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto: >>>>> Hi! >>>>> >>>>> I found the migration broken on pseries platform, specifically, this patch >>>>> broke it: >>>>> >>>>> f1c72795af573b24a7da5eb52375c9aba8a37972 >>>>> migration: do not sent zero pages in bulk stage >>>>> >>>>> The idea is not to send zero pages to the destination guest which is >>>>> expected to have 100% empty RAM. >>>>> >>>>> However on pseries plaftorm the guest always has some stuff in the RAM as a >>>>> part of initialization (device tree, system firmware and rtas (?)) so it is >>>>> not completely empty. As the source guest cannot detect this, it skips some >>>>> pages during migration and we get a broken destination guest. Bug. >>>>> >>>>> While the idea is ok in general, I do not see any easy way to fix it as >>>>> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information >>>>> about whether we are about to receive a migration or not (-incoming >>>>> parameter) and we cannot move device-tree and system firmware >>>>> initialization anywhere else. >>>>> >>>>> ram_bulk_stage is static and cannot be disabled from the platform >>>>> initialization code. >>>>> >>>>> So what would the community suggest? >>>> Revert the patch. :) >>> I'll wait for 24 hours (forgot to cc: the author) and then post a revert >>> patch :) >>> >> does this problem only occur on pseries emulation? > > Probably not. On a PC, it would occur if you had 4K of zeros in the > source BIOS but not in the destination BIOS. When you reboot, the BIOS > image is wrong. > >> not sending zero pages is not only a performance benefit it also makes >> overcomitted memory usable. the madv_dontneed seems to kick in asynchronously >> and memory is not available immediately. > > You could also scan the page for nonzero values before writing it. Scan where? On the source? It is there already. On the destination? Won't just trying to read a page kill all effect from MADV_DONTNEED? -- Alexey ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-05-30 13:38 ` Alexey Kardashevskiy @ 2013-05-30 14:08 ` Paolo Bonzini 0 siblings, 0 replies; 49+ messages in thread From: Paolo Bonzini @ 2013-05-30 14:08 UTC (permalink / raw) To: Alexey Kardashevskiy; +Cc: qemu-ppc, Peter Lieven, qemu-devel, David Gibson Il 30/05/2013 15:38, Alexey Kardashevskiy ha scritto: >>> does this problem only occur on pseries emulation? >> >> Probably not. On a PC, it would occur if you had 4K of zeros in the >> source BIOS but not in the destination BIOS. When you reboot, the BIOS >> image is wrong. >> >>> not sending zero pages is not only a performance benefit it also makes >>> overcomitted memory usable. the madv_dontneed seems to kick in asynchronously >>> and memory is not available immediately. >> >> You could also scan the page for nonzero values before writing it. > > Scan where? On the source? It is there already. On the destination? Won't > just trying to read a page kill all effect from MADV_DONTNEED? No, zero pages are handled with special copy-on-write magic by the kernel. :) So, reading works but writing doesn't. Paolo ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-05-30 13:00 ` Paolo Bonzini 2013-05-30 13:38 ` Alexey Kardashevskiy @ 2013-05-30 14:38 ` Peter Lieven 2013-05-30 14:41 ` Paolo Bonzini 2013-06-03 10:04 ` [Qemu-devel] " Alexey Kardashevskiy 2013-06-08 8:24 ` Wenchao Xia 2 siblings, 2 replies; 49+ messages in thread From: Peter Lieven @ 2013-05-30 14:38 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson Am 30.05.2013 um 15:41 schrieb "Paolo Bonzini" <pbonzini@redhat.com>: > Il 30/05/2013 11:08, Peter Lieven ha scritto: >> Am 30.05.2013 10:18, schrieb Alexey Kardashevskiy: >>> On 05/30/2013 05:49 PM, Paolo Bonzini wrote: >>>> Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto: >>>>> Hi! >>>>> >>>>> I found the migration broken on pseries platform, specifically, this patch >>>>> broke it: >>>>> >>>>> f1c72795af573b24a7da5eb52375c9aba8a37972 >>>>> migration: do not sent zero pages in bulk stage >>>>> >>>>> The idea is not to send zero pages to the destination guest which is >>>>> expected to have 100% empty RAM. >>>>> >>>>> However on pseries plaftorm the guest always has some stuff in the RAM as a >>>>> part of initialization (device tree, system firmware and rtas (?)) so it is >>>>> not completely empty. As the source guest cannot detect this, it skips some >>>>> pages during migration and we get a broken destination guest. Bug. >>>>> >>>>> While the idea is ok in general, I do not see any easy way to fix it as >>>>> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information >>>>> about whether we are about to receive a migration or not (-incoming >>>>> parameter) and we cannot move device-tree and system firmware >>>>> initialization anywhere else. >>>>> >>>>> ram_bulk_stage is static and cannot be disabled from the platform >>>>> initialization code. >>>>> >>>>> So what would the community suggest? >>>> Revert the patch. :) >>> I'll wait for 24 hours (forgot to cc: the author) and then post a revert >>> patch :) >> does this problem only occur on pseries emulation? > > Probably not. On a PC, it would occur if you had 4K of zeros in the > source BIOS but not in the destination BIOS. When you reboot, the BIOS > image is wrong. > >> not sending zero pages is not only a performance benefit it also makes >> overcomitted memory usable. the madv_dontneed seems to kick in asynchronously >> and memory is not available immediately. > > You could also scan the page for nonzero values before writing it. i had this in mind, but then choosed the other approach.... turned out to be a bad idea. alexey: i will prepare a patch later today, could you then please verify it fixes your problem. paolo: would we still need the madvise or is it enough to not write the zeroes? Peter > > Paolo ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-05-30 14:38 ` Peter Lieven @ 2013-05-30 14:41 ` Paolo Bonzini 2013-06-04 13:52 ` Peter Lieven 2013-06-03 10:04 ` [Qemu-devel] " Alexey Kardashevskiy 1 sibling, 1 reply; 49+ messages in thread From: Paolo Bonzini @ 2013-05-30 14:41 UTC (permalink / raw) To: Peter Lieven; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson Il 30/05/2013 16:38, Peter Lieven ha scritto: >> > You could also scan the page for nonzero values before writing it. > i had this in mind, but then choosed the other approach.... turned out to be a bad idea. > > alexey: i will prepare a patch later today, could you then please verify it fixes your problem. > > paolo: would we still need the madvise or is it enough to not write the zeroes? It should be enough to not write them. Paolo ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-05-30 14:41 ` Paolo Bonzini @ 2013-06-04 13:52 ` Peter Lieven 2013-06-04 14:14 ` Paolo Bonzini 0 siblings, 1 reply; 49+ messages in thread From: Peter Lieven @ 2013-06-04 13:52 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson On 30.05.2013 16:41, Paolo Bonzini wrote: > Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>> You could also scan the page for nonzero values before writing it. >> i had this in mind, but then choosed the other approach.... turned out to be a bad idea. >> >> alexey: i will prepare a patch later today, could you then please verify it fixes your problem. >> >> paolo: would we still need the madvise or is it enough to not write the zeroes? > It should be enough to not write them. Problem: checking the pages for zero allocates them. even at the source. Peter ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-04 13:52 ` Peter Lieven @ 2013-06-04 14:14 ` Paolo Bonzini 2013-06-04 14:38 ` Peter Lieven 0 siblings, 1 reply; 49+ messages in thread From: Paolo Bonzini @ 2013-06-04 14:14 UTC (permalink / raw) To: Peter Lieven; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson Il 04/06/2013 15:52, Peter Lieven ha scritto: > On 30.05.2013 16:41, Paolo Bonzini wrote: >> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>> You could also scan the page for nonzero values before writing it. >>> i had this in mind, but then choosed the other approach.... turned >>> out to be a bad idea. >>> >>> alexey: i will prepare a patch later today, could you then please >>> verify it fixes your problem. >>> >>> paolo: would we still need the madvise or is it enough to not write >>> the zeroes? >> It should be enough to not write them. > > Problem: checking the pages for zero allocates them. even at the source. It doesn't look like. I tried this program and top doesn't show an increasing amount of reserved memory: #include <stdio.h> #include <stdlib.h> int main() { char *x = malloc(500 << 20); int i, j; for (i = 0; i < 500; i += 10) { for (j = 0; j < 10 << 20; j += 4096) { *(volatile char*) (x + (i << 20) + j); } getchar(); } } Paolo ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-04 14:14 ` Paolo Bonzini @ 2013-06-04 14:38 ` Peter Lieven 2013-06-04 14:40 ` Paolo Bonzini 0 siblings, 1 reply; 49+ messages in thread From: Peter Lieven @ 2013-06-04 14:38 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson On 04.06.2013 16:14, Paolo Bonzini wrote: > Il 04/06/2013 15:52, Peter Lieven ha scritto: >> On 30.05.2013 16:41, Paolo Bonzini wrote: >>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>> You could also scan the page for nonzero values before writing it. >>>> i had this in mind, but then choosed the other approach.... turned >>>> out to be a bad idea. >>>> >>>> alexey: i will prepare a patch later today, could you then please >>>> verify it fixes your problem. >>>> >>>> paolo: would we still need the madvise or is it enough to not write >>>> the zeroes? >>> It should be enough to not write them. >> Problem: checking the pages for zero allocates them. even at the source. > It doesn't look like. I tried this program and top doesn't show an > increasing amount of reserved memory: > > #include <stdio.h> > #include <stdlib.h> > int main() > { > char *x = malloc(500 << 20); > int i, j; > for (i = 0; i < 500; i += 10) { > for (j = 0; j < 10 << 20; j += 4096) { > *(volatile char*) (x + (i << 20) + j); > } > getchar(); > } > } strange. we are talking about RSS size, right? is the malloc above using mmapped memory? which kernel version do you use? what avoids allocating the memory for me is the following (with whatever side effects it has ;-)) diff --git a/arch_init.c b/arch_init.c index 642f241..25d20a9 100644 --- a/arch_init.c +++ b/arch_init.c @@ -148,6 +148,10 @@ int qemu_read_default_config_files(bool userconfig) static inline bool is_zero_page(uint8_t *p) { + uint8_t ret; + if (mincore(p, TARGET_PAGE_SIZE, &ret) == 0 && !(ret&0x1)) { + return 1; + } return buffer_find_nonzero_offset(p, TARGET_PAGE_SIZE) == TARGET_PAGE_SIZE; } Peter > > Paolo -- Mit freundlichen Grüßen Peter Lieven ........................................................... KAMP Netzwerkdienste GmbH Vestische Str. 89-91 | 46117 Oberhausen Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40 pl@kamp.de | http://www.kamp.de Geschäftsführer: Heiner Lante | Michael Lante Amtsgericht Duisburg | HRB Nr. 12154 USt-Id-Nr.: DE 120607556 ........................................................... ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-04 14:38 ` Peter Lieven @ 2013-06-04 14:40 ` Paolo Bonzini 2013-06-04 14:48 ` Peter Lieven 2013-06-04 15:10 ` Peter Lieven 0 siblings, 2 replies; 49+ messages in thread From: Paolo Bonzini @ 2013-06-04 14:40 UTC (permalink / raw) To: Peter Lieven; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson Il 04/06/2013 16:38, Peter Lieven ha scritto: > On 04.06.2013 16:14, Paolo Bonzini wrote: >> Il 04/06/2013 15:52, Peter Lieven ha scritto: >>> On 30.05.2013 16:41, Paolo Bonzini wrote: >>>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>>> You could also scan the page for nonzero values before writing it. >>>>> i had this in mind, but then choosed the other approach.... turned >>>>> out to be a bad idea. >>>>> >>>>> alexey: i will prepare a patch later today, could you then please >>>>> verify it fixes your problem. >>>>> >>>>> paolo: would we still need the madvise or is it enough to not write >>>>> the zeroes? >>>> It should be enough to not write them. >>> Problem: checking the pages for zero allocates them. even at the source. >> It doesn't look like. I tried this program and top doesn't show an >> increasing amount of reserved memory: >> >> #include <stdio.h> >> #include <stdlib.h> >> int main() >> { >> char *x = malloc(500 << 20); >> int i, j; >> for (i = 0; i < 500; i += 10) { >> for (j = 0; j < 10 << 20; j += 4096) { >> *(volatile char*) (x + (i << 20) + j); >> } >> getchar(); >> } >> } > strange. we are talking about RSS size, right? None of the three top values change, and only VIRT is >500 MB. > is the malloc above using mmapped memory? Yes. > which kernel version do you use? 3.9. > what avoids allocating the memory for me is the following (with > whatever side effects it has ;-)) This would also fail to migrate any page that is swapped out, breaking overcommit in a more subtle way. :) Paolo ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-04 14:40 ` Paolo Bonzini @ 2013-06-04 14:48 ` Peter Lieven 2013-06-04 15:17 ` Paolo Bonzini 2013-06-04 15:10 ` Peter Lieven 1 sibling, 1 reply; 49+ messages in thread From: Peter Lieven @ 2013-06-04 14:48 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson On 04.06.2013 16:40, Paolo Bonzini wrote: > Il 04/06/2013 16:38, Peter Lieven ha scritto: >> On 04.06.2013 16:14, Paolo Bonzini wrote: >>> Il 04/06/2013 15:52, Peter Lieven ha scritto: >>>> On 30.05.2013 16:41, Paolo Bonzini wrote: >>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>>>> You could also scan the page for nonzero values before writing it. >>>>>> i had this in mind, but then choosed the other approach.... turned >>>>>> out to be a bad idea. >>>>>> >>>>>> alexey: i will prepare a patch later today, could you then please >>>>>> verify it fixes your problem. >>>>>> >>>>>> paolo: would we still need the madvise or is it enough to not write >>>>>> the zeroes? >>>>> It should be enough to not write them. >>>> Problem: checking the pages for zero allocates them. even at the source. >>> It doesn't look like. I tried this program and top doesn't show an >>> increasing amount of reserved memory: >>> >>> #include <stdio.h> >>> #include <stdlib.h> >>> int main() >>> { >>> char *x = malloc(500 << 20); >>> int i, j; >>> for (i = 0; i < 500; i += 10) { >>> for (j = 0; j < 10 << 20; j += 4096) { >>> *(volatile char*) (x + (i << 20) + j); >>> } >>> getchar(); >>> } >>> } >> strange. we are talking about RSS size, right? > None of the three top values change, and only VIRT is >500 MB. > >> is the malloc above using mmapped memory? > Yes. > >> which kernel version do you use? > 3.9. Still using 3.2, but strange enough the above example is also not increasing RSS size for me. Can you try the following: qemu git master with 1G of memory (hanging in bios with no boot device) and migrate it. Before migration RSS Size os somewhat around 16MB. After migration its RSS size is in the order of 1G. Peter ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-04 14:48 ` Peter Lieven @ 2013-06-04 15:17 ` Paolo Bonzini 2013-06-04 19:15 ` Peter Lieven 0 siblings, 1 reply; 49+ messages in thread From: Paolo Bonzini @ 2013-06-04 15:17 UTC (permalink / raw) To: Peter Lieven; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson Il 04/06/2013 16:48, Peter Lieven ha scritto: > Still using 3.2, but strange enough the above example is also not > increasing RSS size for me. > > Can you try the following: > qemu git master with 1G of memory (hanging in bios with no boot device) > and migrate it. Before migration RSS Size os somewhat > around 16MB. After migration its RSS size is in the order of 1G. That may be a kernel bug. The kernel did not do the copy-on-write trick on huge zero pages. It was fixed last year, maybe 3.2 is not enough. Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces. Paolo ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-04 15:17 ` Paolo Bonzini @ 2013-06-04 19:15 ` Peter Lieven 2013-06-05 3:37 ` Alexey Kardashevskiy 0 siblings, 1 reply; 49+ messages in thread From: Peter Lieven @ 2013-06-04 19:15 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson Am 04.06.2013 um 17:17 schrieb Paolo Bonzini <pbonzini@redhat.com>: > Il 04/06/2013 16:48, Peter Lieven ha scritto: >> Still using 3.2, but strange enough the above example is also not >> increasing RSS size for me. >> >> Can you try the following: >> qemu git master with 1G of memory (hanging in bios with no boot device) >> and migrate it. Before migration RSS Size os somewhat >> around 16MB. After migration its RSS size is in the order of 1G. > > That may be a kernel bug. The kernel did not do the copy-on-write trick > on huge zero pages. It was fixed last year, maybe 3.2 is not enough. > Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces. that's it. thanks for the pointer. the huge zero page was introduced in 3.8. paolo, alexey: can you please verify the following works for you: https://github.com/plieven/qemu/tree/fix-migration thanks peter > > Paolo > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-04 19:15 ` Peter Lieven @ 2013-06-05 3:37 ` Alexey Kardashevskiy 2013-06-05 6:09 ` Peter Lieven 0 siblings, 1 reply; 49+ messages in thread From: Alexey Kardashevskiy @ 2013-06-05 3:37 UTC (permalink / raw) To: Peter Lieven; +Cc: Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson On 06/05/2013 05:15 AM, Peter Lieven wrote: > > Am 04.06.2013 um 17:17 schrieb Paolo Bonzini <pbonzini@redhat.com>: > >> Il 04/06/2013 16:48, Peter Lieven ha scritto: >>> Still using 3.2, but strange enough the above example is also not >>> increasing RSS size for me. >>> >>> Can you try the following: >>> qemu git master with 1G of memory (hanging in bios with no boot device) >>> and migrate it. Before migration RSS Size os somewhat >>> around 16MB. After migration its RSS size is in the order of 1G. >> >> That may be a kernel bug. The kernel did not do the copy-on-write trick >> on huge zero pages. It was fixed last year, maybe 3.2 is not enough. >> Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces. > > that's it. thanks for the pointer. the huge zero page was introduced in 3.8. > > paolo, alexey: can you please verify the following works for you: > https://github.com/plieven/qemu/tree/fix-migration These two? 848b796 Tue Jun 4 14:43:04 2013 +0200 Peter Lieven migration: do not overwrite zero pages 2206ac8 Tue Jun 4 14:25:33 2013 +0200 Peter Lieven Revert "migration: do not sent zero pages in bulk stage" That works for me (qemu 1.5, kernel 3.9-rc2). Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> -- Alexey ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-05 3:37 ` Alexey Kardashevskiy @ 2013-06-05 6:09 ` Peter Lieven 2013-06-09 4:12 ` liu ping fan 0 siblings, 1 reply; 49+ messages in thread From: Peter Lieven @ 2013-06-05 6:09 UTC (permalink / raw) To: Alexey Kardashevskiy; +Cc: Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson Am 05.06.2013 um 05:37 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>: > On 06/05/2013 05:15 AM, Peter Lieven wrote: >> >> Am 04.06.2013 um 17:17 schrieb Paolo Bonzini <pbonzini@redhat.com>: >> >>> Il 04/06/2013 16:48, Peter Lieven ha scritto: >>>> Still using 3.2, but strange enough the above example is also not >>>> increasing RSS size for me. >>>> >>>> Can you try the following: >>>> qemu git master with 1G of memory (hanging in bios with no boot device) >>>> and migrate it. Before migration RSS Size os somewhat >>>> around 16MB. After migration its RSS size is in the order of 1G. >>> >>> That may be a kernel bug. The kernel did not do the copy-on-write trick >>> on huge zero pages. It was fixed last year, maybe 3.2 is not enough. >>> Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces. >> >> that's it. thanks for the pointer. the huge zero page was introduced in 3.8. >> >> paolo, alexey: can you please verify the following works for you: >> https://github.com/plieven/qemu/tree/fix-migration > > These two? > 848b796 Tue Jun 4 14:43:04 2013 +0200 Peter Lieven migration: do not > overwrite zero pages > 2206ac8 Tue Jun 4 14:25:33 2013 +0200 Peter Lieven Revert "migration: do > not sent zero pages in bulk stage" Yes, sorry forgot to mention this. > > That works for me (qemu 1.5, kernel 3.9-rc2). > Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Thank you, Peter ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-05 6:09 ` Peter Lieven @ 2013-06-09 4:12 ` liu ping fan 2013-06-09 7:22 ` Peter Lieven 0 siblings, 1 reply; 49+ messages in thread From: liu ping fan @ 2013-06-09 4:12 UTC (permalink / raw) To: Peter Lieven Cc: Alexey Kardashevskiy, Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson Hi Peter, Is it that sending zero page mostly service the first iteration, ie bluk-stage? And for the subsequent iteration, dirty pages are normally not zero. Thanks On Wed, Jun 5, 2013 at 2:09 PM, Peter Lieven <pl@kamp.de> wrote: > > Am 05.06.2013 um 05:37 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>: > >> On 06/05/2013 05:15 AM, Peter Lieven wrote: >>> >>> Am 04.06.2013 um 17:17 schrieb Paolo Bonzini <pbonzini@redhat.com>: >>> >>>> Il 04/06/2013 16:48, Peter Lieven ha scritto: >>>>> Still using 3.2, but strange enough the above example is also not >>>>> increasing RSS size for me. >>>>> >>>>> Can you try the following: >>>>> qemu git master with 1G of memory (hanging in bios with no boot device) >>>>> and migrate it. Before migration RSS Size os somewhat >>>>> around 16MB. After migration its RSS size is in the order of 1G. >>>> >>>> That may be a kernel bug. The kernel did not do the copy-on-write trick >>>> on huge zero pages. It was fixed last year, maybe 3.2 is not enough. >>>> Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces. >>> >>> that's it. thanks for the pointer. the huge zero page was introduced in 3.8. >>> >>> paolo, alexey: can you please verify the following works for you: >>> https://github.com/plieven/qemu/tree/fix-migration >> >> These two? >> 848b796 Tue Jun 4 14:43:04 2013 +0200 Peter Lieven migration: do not >> overwrite zero pages >> 2206ac8 Tue Jun 4 14:25:33 2013 +0200 Peter Lieven Revert "migration: do >> not sent zero pages in bulk stage" > > Yes, sorry forgot to mention this. > >> >> That works for me (qemu 1.5, kernel 3.9-rc2). >> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > Thank you, > Peter ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-09 4:12 ` liu ping fan @ 2013-06-09 7:22 ` Peter Lieven 0 siblings, 0 replies; 49+ messages in thread From: Peter Lieven @ 2013-06-09 7:22 UTC (permalink / raw) To: liu ping fan Cc: Alexey Kardashevskiy, Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson Am 09.06.2013 um 06:12 schrieb liu ping fan <qemulist@gmail.com>: > Hi Peter, > > Is it that sending zero page mostly service the first iteration, ie > bluk-stage? And for the subsequent iteration, dirty pages are normally > not zero. > Yes most Zero Pages are sent during bulk stage except for busy windows guests or linux with page sanitization. In these cases freed Memory is zeroed. Peter > Thanks > > On Wed, Jun 5, 2013 at 2:09 PM, Peter Lieven <pl@kamp.de> wrote: >> >> Am 05.06.2013 um 05:37 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>: >> >>> On 06/05/2013 05:15 AM, Peter Lieven wrote: >>>> >>>> Am 04.06.2013 um 17:17 schrieb Paolo Bonzini <pbonzini@redhat.com>: >>>> >>>>> Il 04/06/2013 16:48, Peter Lieven ha scritto: >>>>>> Still using 3.2, but strange enough the above example is also not >>>>>> increasing RSS size for me. >>>>>> >>>>>> Can you try the following: >>>>>> qemu git master with 1G of memory (hanging in bios with no boot device) >>>>>> and migrate it. Before migration RSS Size os somewhat >>>>>> around 16MB. After migration its RSS size is in the order of 1G. >>>>> >>>>> That may be a kernel bug. The kernel did not do the copy-on-write trick >>>>> on huge zero pages. It was fixed last year, maybe 3.2 is not enough. >>>>> Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces. >>>> >>>> that's it. thanks for the pointer. the huge zero page was introduced in 3.8. >>>> >>>> paolo, alexey: can you please verify the following works for you: >>>> https://github.com/plieven/qemu/tree/fix-migration >>> >>> These two? >>> 848b796 Tue Jun 4 14:43:04 2013 +0200 Peter Lieven migration: do not >>> overwrite zero pages >>> 2206ac8 Tue Jun 4 14:25:33 2013 +0200 Peter Lieven Revert "migration: do >>> not sent zero pages in bulk stage" >> >> Yes, sorry forgot to mention this. >> >>> >>> That works for me (qemu 1.5, kernel 3.9-rc2). >>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> >> >> Thank you, >> Peter ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-04 14:40 ` Paolo Bonzini 2013-06-04 14:48 ` Peter Lieven @ 2013-06-04 15:10 ` Peter Lieven 2013-06-08 8:27 ` Wenchao Xia 1 sibling, 1 reply; 49+ messages in thread From: Peter Lieven @ 2013-06-04 15:10 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson On 04.06.2013 16:40, Paolo Bonzini wrote: > Il 04/06/2013 16:38, Peter Lieven ha scritto: >> On 04.06.2013 16:14, Paolo Bonzini wrote: >>> Il 04/06/2013 15:52, Peter Lieven ha scritto: >>>> On 30.05.2013 16:41, Paolo Bonzini wrote: >>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>>>> You could also scan the page for nonzero values before writing it. >>>>>> i had this in mind, but then choosed the other approach.... turned >>>>>> out to be a bad idea. >>>>>> >>>>>> alexey: i will prepare a patch later today, could you then please >>>>>> verify it fixes your problem. >>>>>> >>>>>> paolo: would we still need the madvise or is it enough to not write >>>>>> the zeroes? >>>>> It should be enough to not write them. >>>> Problem: checking the pages for zero allocates them. even at the source. >>> It doesn't look like. I tried this program and top doesn't show an >>> increasing amount of reserved memory: >>> >>> #include <stdio.h> >>> #include <stdlib.h> >>> int main() >>> { >>> char *x = malloc(500 << 20); >>> int i, j; >>> for (i = 0; i < 500; i += 10) { >>> for (j = 0; j < 10 << 20; j += 4096) { >>> *(volatile char*) (x + (i << 20) + j); >>> } >>> getchar(); >>> } >>> } >> strange. we are talking about RSS size, right? > None of the three top values change, and only VIRT is >500 MB. > >> is the malloc above using mmapped memory? > Yes. > >> which kernel version do you use? > 3.9. > >> what avoids allocating the memory for me is the following (with >> whatever side effects it has ;-)) > This would also fail to migrate any page that is swapped out, breaking > overcommit in a more subtle way. :) > > Paolo the following does also not allocate memory, but qemu does... #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <unistd.h> #include <sys/resource.h> #include <inttypes.h> #include <string.h> #include <sys/mman.h> #include <errno.h> #if defined __SSE2__ #include <emmintrin.h> #define VECTYPE __m128i #define SPLAT(p) _mm_set1_epi8(*(p)) #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF) #else #define VECTYPE unsigned long #define SPLAT(p) (*(p) * (~0UL / 255)) #define ALL_EQ(v1, v2) ((v1) == (v2)) #endif #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8 /* Round number down to multiple */ #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m)) /* Round number up to multiple */ #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m)) #define QEMU_VMALLOC_ALIGN (256 * 4096) /* alloc shared memory pages */ void *qemu_anon_ram_alloc(size_t size) { size_t align = QEMU_VMALLOC_ALIGN; size_t total = size + align - getpagesize(); void *ptr = mmap(0, total, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr; if (ptr == MAP_FAILED) { fprintf(stderr, "Failed to allocate %zu B: %s\n", size, strerror(errno)); abort(); } ptr += offset; total -= offset; if (offset > 0) { munmap(ptr - offset, offset); } if (total > size) { munmap(ptr + size, total - size); } return ptr; } static inline int can_use_buffer_find_nonzero_offset(const void *buf, size_t len) { return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)) == 0 && ((uintptr_t) buf) % sizeof(VECTYPE) == 0); } size_t buffer_find_nonzero_offset(const void *buf, size_t len) { const VECTYPE *p = buf; const VECTYPE zero = (VECTYPE){0}; size_t i; if (!len) { return 0; } assert(can_use_buffer_find_nonzero_offset(buf, len)); for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) { if (!ALL_EQ(p[i], zero)) { return i * sizeof(VECTYPE); } } for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i < len / sizeof(VECTYPE); i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) { VECTYPE tmp0 = p[i + 0] | p[i + 1]; VECTYPE tmp1 = p[i + 2] | p[i + 3]; VECTYPE tmp2 = p[i + 4] | p[i + 5]; VECTYPE tmp3 = p[i + 6] | p[i + 7]; VECTYPE tmp01 = tmp0 | tmp1; VECTYPE tmp23 = tmp2 | tmp3; if (!ALL_EQ(tmp01 | tmp23, zero)) { break; } } return i * sizeof(VECTYPE); } int main() { //char *x = malloc(1024 << 20); char *x = qemu_anon_ram_alloc(1024 << 20); int i, j; int ret = 0; struct rusage rusage; for (i = 0; i < 500; i ++) { for (j = 0; j < 10 << 20; j += 4096) { ret += buffer_find_nonzero_offset((char*) (x + (i << 20) + j), 4096); } getrusage( RUSAGE_SELF, &rusage ); printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10), rusage.ru_maxrss); getchar(); } printf("%d zero pages\n", ret); } ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-04 15:10 ` Peter Lieven @ 2013-06-08 8:27 ` Wenchao Xia 2013-06-08 8:30 ` Alexey Kardashevskiy 0 siblings, 1 reply; 49+ messages in thread From: Wenchao Xia @ 2013-06-08 8:27 UTC (permalink / raw) To: Peter Lieven Cc: Alexey Kardashevskiy, Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson > On 04.06.2013 16:40, Paolo Bonzini wrote: >> Il 04/06/2013 16:38, Peter Lieven ha scritto: >>> On 04.06.2013 16:14, Paolo Bonzini wrote: >>>> Il 04/06/2013 15:52, Peter Lieven ha scritto: >>>>> On 30.05.2013 16:41, Paolo Bonzini wrote: >>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>>>>> You could also scan the page for nonzero values before writing it. >>>>>>> i had this in mind, but then choosed the other approach.... turned >>>>>>> out to be a bad idea. >>>>>>> >>>>>>> alexey: i will prepare a patch later today, could you then please >>>>>>> verify it fixes your problem. >>>>>>> >>>>>>> paolo: would we still need the madvise or is it enough to not write >>>>>>> the zeroes? >>>>>> It should be enough to not write them. >>>>> Problem: checking the pages for zero allocates them. even at the >>>>> source. >>>> It doesn't look like. I tried this program and top doesn't show an >>>> increasing amount of reserved memory: >>>> >>>> #include <stdio.h> >>>> #include <stdlib.h> >>>> int main() >>>> { >>>> char *x = malloc(500 << 20); >>>> int i, j; >>>> for (i = 0; i < 500; i += 10) { >>>> for (j = 0; j < 10 << 20; j += 4096) { >>>> *(volatile char*) (x + (i << 20) + j); >>>> } >>>> getchar(); >>>> } >>>> } >>> strange. we are talking about RSS size, right? >> None of the three top values change, and only VIRT is >500 MB. >> >>> is the malloc above using mmapped memory? >> Yes. >> >>> which kernel version do you use? >> 3.9. >> >>> what avoids allocating the memory for me is the following (with >>> whatever side effects it has ;-)) >> This would also fail to migrate any page that is swapped out, breaking >> overcommit in a more subtle way. :) >> >> Paolo > the following does also not allocate memory, but qemu does... > Hi, Peter As the patch writes "not sending zero pages breaks migration if a page is zero at the source but not at the destination." I don't understand why it would be trouble, shouldn't all page not received in dest be treated as zero pages? Also, you mean following code is from qemu and it does not allocate memory with you gcc right? Maybe it is related to KVM, how about turn off KVM and retry following code in qemu? > #include <stdio.h> > #include <stdlib.h> > #include <assert.h> > #include <unistd.h> > #include <sys/resource.h> > #include <inttypes.h> > #include <string.h> > #include <sys/mman.h> > #include <errno.h> > > #if defined __SSE2__ > #include <emmintrin.h> > #define VECTYPE __m128i > #define SPLAT(p) _mm_set1_epi8(*(p)) > #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == > 0xFFFF) > #else > #define VECTYPE unsigned long > #define SPLAT(p) (*(p) * (~0UL / 255)) > #define ALL_EQ(v1, v2) ((v1) == (v2)) > #endif > > #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8 > > /* Round number down to multiple */ > #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m)) > > /* Round number up to multiple */ > #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m)) > > #define QEMU_VMALLOC_ALIGN (256 * 4096) > > /* alloc shared memory pages */ > void *qemu_anon_ram_alloc(size_t size) > { > size_t align = QEMU_VMALLOC_ALIGN; > size_t total = size + align - getpagesize(); > void *ptr = mmap(0, total, PROT_READ | PROT_WRITE, > MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr; > > if (ptr == MAP_FAILED) { > fprintf(stderr, "Failed to allocate %zu B: %s\n", > size, strerror(errno)); > abort(); > } > > ptr += offset; > total -= offset; > > if (offset > 0) { > munmap(ptr - offset, offset); > } > if (total > size) { > munmap(ptr + size, total - size); > } > > return ptr; > } > > static inline int > can_use_buffer_find_nonzero_offset(const void *buf, size_t len) > { > return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR > * sizeof(VECTYPE)) == 0 > && ((uintptr_t) buf) % sizeof(VECTYPE) == 0); > } > > size_t buffer_find_nonzero_offset(const void *buf, size_t len) > { > const VECTYPE *p = buf; > const VECTYPE zero = (VECTYPE){0}; > size_t i; > > if (!len) { > return 0; > } > > assert(can_use_buffer_find_nonzero_offset(buf, len)); > > for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) { > if (!ALL_EQ(p[i], zero)) { > return i * sizeof(VECTYPE); > } > } > > for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; > i < len / sizeof(VECTYPE); > i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) { > VECTYPE tmp0 = p[i + 0] | p[i + 1]; > VECTYPE tmp1 = p[i + 2] | p[i + 3]; > VECTYPE tmp2 = p[i + 4] | p[i + 5]; > VECTYPE tmp3 = p[i + 6] | p[i + 7]; > VECTYPE tmp01 = tmp0 | tmp1; > VECTYPE tmp23 = tmp2 | tmp3; > if (!ALL_EQ(tmp01 | tmp23, zero)) { > break; > } > } > > return i * sizeof(VECTYPE); > } > > int main() > { > //char *x = malloc(1024 << 20); > char *x = qemu_anon_ram_alloc(1024 << 20); > > int i, j; > int ret = 0; > struct rusage rusage; > for (i = 0; i < 500; i ++) { > for (j = 0; j < 10 << 20; j += 4096) { > ret += buffer_find_nonzero_offset((char*) (x + (i << 20) > + j), 4096); > } > getrusage( RUSAGE_SELF, &rusage ); > printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10), > rusage.ru_maxrss); > getchar(); > } > printf("%d zero pages\n", ret); > } > -- Best Regards Wenchao Xia ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-08 8:27 ` Wenchao Xia @ 2013-06-08 8:30 ` Alexey Kardashevskiy 2013-06-09 2:16 ` Wenchao Xia 0 siblings, 1 reply; 49+ messages in thread From: Alexey Kardashevskiy @ 2013-06-08 8:30 UTC (permalink / raw) To: Wenchao Xia Cc: qemu-ppc, Paolo Bonzini, Peter Lieven, qemu-devel, David Gibson On 06/08/2013 06:27 PM, Wenchao Xia wrote: >> On 04.06.2013 16:40, Paolo Bonzini wrote: >>> Il 04/06/2013 16:38, Peter Lieven ha scritto: >>>> On 04.06.2013 16:14, Paolo Bonzini wrote: >>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto: >>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote: >>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>> You could also scan the page for nonzero values before writing it. >>>>>>>> i had this in mind, but then choosed the other approach.... turned >>>>>>>> out to be a bad idea. >>>>>>>> >>>>>>>> alexey: i will prepare a patch later today, could you then please >>>>>>>> verify it fixes your problem. >>>>>>>> >>>>>>>> paolo: would we still need the madvise or is it enough to not write >>>>>>>> the zeroes? >>>>>>> It should be enough to not write them. >>>>>> Problem: checking the pages for zero allocates them. even at the >>>>>> source. >>>>> It doesn't look like. I tried this program and top doesn't show an >>>>> increasing amount of reserved memory: >>>>> >>>>> #include <stdio.h> >>>>> #include <stdlib.h> >>>>> int main() >>>>> { >>>>> char *x = malloc(500 << 20); >>>>> int i, j; >>>>> for (i = 0; i < 500; i += 10) { >>>>> for (j = 0; j < 10 << 20; j += 4096) { >>>>> *(volatile char*) (x + (i << 20) + j); >>>>> } >>>>> getchar(); >>>>> } >>>>> } >>>> strange. we are talking about RSS size, right? >>> None of the three top values change, and only VIRT is >500 MB. >>> >>>> is the malloc above using mmapped memory? >>> Yes. >>> >>>> which kernel version do you use? >>> 3.9. >>> >>>> what avoids allocating the memory for me is the following (with >>>> whatever side effects it has ;-)) >>> This would also fail to migrate any page that is swapped out, breaking >>> overcommit in a more subtle way. :) >>> >>> Paolo >> the following does also not allocate memory, but qemu does... >> > Hi, Peter > As the patch writes > > "not sending zero pages breaks migration if a page is zero > at the source but not at the destination." > > I don't understand why it would be trouble, shouldn't all page > not received in dest be treated as zero pages? How would the destination guest know if some page must be cleared? The previous patch (which Peter reverted) did not send anything for the pages which were zero on the source side. > Also, you mean following code is from qemu and it does not allocate > memory with you gcc right? Maybe it is related to KVM, how about > turn off KVM and retry following code in qemu? > >> #include <stdio.h> >> #include <stdlib.h> >> #include <assert.h> >> #include <unistd.h> >> #include <sys/resource.h> >> #include <inttypes.h> >> #include <string.h> >> #include <sys/mman.h> >> #include <errno.h> >> >> #if defined __SSE2__ >> #include <emmintrin.h> >> #define VECTYPE __m128i >> #define SPLAT(p) _mm_set1_epi8(*(p)) >> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == >> 0xFFFF) >> #else >> #define VECTYPE unsigned long >> #define SPLAT(p) (*(p) * (~0UL / 255)) >> #define ALL_EQ(v1, v2) ((v1) == (v2)) >> #endif >> >> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8 >> >> /* Round number down to multiple */ >> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m)) >> >> /* Round number up to multiple */ >> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m)) >> >> #define QEMU_VMALLOC_ALIGN (256 * 4096) >> >> /* alloc shared memory pages */ >> void *qemu_anon_ram_alloc(size_t size) >> { >> size_t align = QEMU_VMALLOC_ALIGN; >> size_t total = size + align - getpagesize(); >> void *ptr = mmap(0, total, PROT_READ | PROT_WRITE, >> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); >> size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr; >> >> if (ptr == MAP_FAILED) { >> fprintf(stderr, "Failed to allocate %zu B: %s\n", >> size, strerror(errno)); >> abort(); >> } >> >> ptr += offset; >> total -= offset; >> >> if (offset > 0) { >> munmap(ptr - offset, offset); >> } >> if (total > size) { >> munmap(ptr + size, total - size); >> } >> >> return ptr; >> } >> >> static inline int >> can_use_buffer_find_nonzero_offset(const void *buf, size_t len) >> { >> return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR >> * sizeof(VECTYPE)) == 0 >> && ((uintptr_t) buf) % sizeof(VECTYPE) == 0); >> } >> >> size_t buffer_find_nonzero_offset(const void *buf, size_t len) >> { >> const VECTYPE *p = buf; >> const VECTYPE zero = (VECTYPE){0}; >> size_t i; >> >> if (!len) { >> return 0; >> } >> >> assert(can_use_buffer_find_nonzero_offset(buf, len)); >> >> for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) { >> if (!ALL_EQ(p[i], zero)) { >> return i * sizeof(VECTYPE); >> } >> } >> >> for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; >> i < len / sizeof(VECTYPE); >> i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) { >> VECTYPE tmp0 = p[i + 0] | p[i + 1]; >> VECTYPE tmp1 = p[i + 2] | p[i + 3]; >> VECTYPE tmp2 = p[i + 4] | p[i + 5]; >> VECTYPE tmp3 = p[i + 6] | p[i + 7]; >> VECTYPE tmp01 = tmp0 | tmp1; >> VECTYPE tmp23 = tmp2 | tmp3; >> if (!ALL_EQ(tmp01 | tmp23, zero)) { >> break; >> } >> } >> >> return i * sizeof(VECTYPE); >> } >> >> int main() >> { >> //char *x = malloc(1024 << 20); >> char *x = qemu_anon_ram_alloc(1024 << 20); >> >> int i, j; >> int ret = 0; >> struct rusage rusage; >> for (i = 0; i < 500; i ++) { >> for (j = 0; j < 10 << 20; j += 4096) { >> ret += buffer_find_nonzero_offset((char*) (x + (i << 20) >> + j), 4096); >> } >> getrusage( RUSAGE_SELF, &rusage ); >> printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10), >> rusage.ru_maxrss); >> getchar(); >> } >> printf("%d zero pages\n", ret); >> } >> > > -- Alexey ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-08 8:30 ` Alexey Kardashevskiy @ 2013-06-09 2:16 ` Wenchao Xia 2013-06-09 2:34 ` Alexey Kardashevskiy 2013-06-09 2:53 ` Benjamin Herrenschmidt 0 siblings, 2 replies; 49+ messages in thread From: Wenchao Xia @ 2013-06-09 2:16 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: qemu-ppc, Paolo Bonzini, Peter Lieven, qemu-devel, David Gibson 于 2013-6-8 16:30, Alexey Kardashevskiy 写道: > On 06/08/2013 06:27 PM, Wenchao Xia wrote: >>> On 04.06.2013 16:40, Paolo Bonzini wrote: >>>> Il 04/06/2013 16:38, Peter Lieven ha scritto: >>>>> On 04.06.2013 16:14, Paolo Bonzini wrote: >>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto: >>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote: >>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>>> You could also scan the page for nonzero values before writing it. >>>>>>>>> i had this in mind, but then choosed the other approach.... turned >>>>>>>>> out to be a bad idea. >>>>>>>>> >>>>>>>>> alexey: i will prepare a patch later today, could you then please >>>>>>>>> verify it fixes your problem. >>>>>>>>> >>>>>>>>> paolo: would we still need the madvise or is it enough to not write >>>>>>>>> the zeroes? >>>>>>>> It should be enough to not write them. >>>>>>> Problem: checking the pages for zero allocates them. even at the >>>>>>> source. >>>>>> It doesn't look like. I tried this program and top doesn't show an >>>>>> increasing amount of reserved memory: >>>>>> >>>>>> #include <stdio.h> >>>>>> #include <stdlib.h> >>>>>> int main() >>>>>> { >>>>>> char *x = malloc(500 << 20); >>>>>> int i, j; >>>>>> for (i = 0; i < 500; i += 10) { >>>>>> for (j = 0; j < 10 << 20; j += 4096) { >>>>>> *(volatile char*) (x + (i << 20) + j); >>>>>> } >>>>>> getchar(); >>>>>> } >>>>>> } >>>>> strange. we are talking about RSS size, right? >>>> None of the three top values change, and only VIRT is >500 MB. >>>> >>>>> is the malloc above using mmapped memory? >>>> Yes. >>>> >>>>> which kernel version do you use? >>>> 3.9. >>>> >>>>> what avoids allocating the memory for me is the following (with >>>>> whatever side effects it has ;-)) >>>> This would also fail to migrate any page that is swapped out, breaking >>>> overcommit in a more subtle way. :) >>>> >>>> Paolo >>> the following does also not allocate memory, but qemu does... >>> >> Hi, Peter >> As the patch writes >> >> "not sending zero pages breaks migration if a page is zero >> at the source but not at the destination." >> >> I don't understand why it would be trouble, shouldn't all page >> not received in dest be treated as zero pages? > > > How would the destination guest know if some page must be cleared? The > previous patch (which Peter reverted) did not send anything for the pages > which were zero on the source side. > > If an page was not received and destination knows that page should exist according to total size, fill it with zero at destination, would it solve the problem? > >> Also, you mean following code is from qemu and it does not allocate >> memory with you gcc right? Maybe it is related to KVM, how about >> turn off KVM and retry following code in qemu? >> >>> #include <stdio.h> >>> #include <stdlib.h> >>> #include <assert.h> >>> #include <unistd.h> >>> #include <sys/resource.h> >>> #include <inttypes.h> >>> #include <string.h> >>> #include <sys/mman.h> >>> #include <errno.h> >>> >>> #if defined __SSE2__ >>> #include <emmintrin.h> >>> #define VECTYPE __m128i >>> #define SPLAT(p) _mm_set1_epi8(*(p)) >>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == >>> 0xFFFF) >>> #else >>> #define VECTYPE unsigned long >>> #define SPLAT(p) (*(p) * (~0UL / 255)) >>> #define ALL_EQ(v1, v2) ((v1) == (v2)) >>> #endif >>> >>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8 >>> >>> /* Round number down to multiple */ >>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m)) >>> >>> /* Round number up to multiple */ >>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m)) >>> >>> #define QEMU_VMALLOC_ALIGN (256 * 4096) >>> >>> /* alloc shared memory pages */ >>> void *qemu_anon_ram_alloc(size_t size) >>> { >>> size_t align = QEMU_VMALLOC_ALIGN; >>> size_t total = size + align - getpagesize(); >>> void *ptr = mmap(0, total, PROT_READ | PROT_WRITE, >>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); >>> size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr; >>> >>> if (ptr == MAP_FAILED) { >>> fprintf(stderr, "Failed to allocate %zu B: %s\n", >>> size, strerror(errno)); >>> abort(); >>> } >>> >>> ptr += offset; >>> total -= offset; >>> >>> if (offset > 0) { >>> munmap(ptr - offset, offset); >>> } >>> if (total > size) { >>> munmap(ptr + size, total - size); >>> } >>> >>> return ptr; >>> } >>> >>> static inline int >>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len) >>> { >>> return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR >>> * sizeof(VECTYPE)) == 0 >>> && ((uintptr_t) buf) % sizeof(VECTYPE) == 0); >>> } >>> >>> size_t buffer_find_nonzero_offset(const void *buf, size_t len) >>> { >>> const VECTYPE *p = buf; >>> const VECTYPE zero = (VECTYPE){0}; >>> size_t i; >>> >>> if (!len) { >>> return 0; >>> } >>> >>> assert(can_use_buffer_find_nonzero_offset(buf, len)); >>> >>> for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) { >>> if (!ALL_EQ(p[i], zero)) { >>> return i * sizeof(VECTYPE); >>> } >>> } >>> >>> for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; >>> i < len / sizeof(VECTYPE); >>> i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) { >>> VECTYPE tmp0 = p[i + 0] | p[i + 1]; >>> VECTYPE tmp1 = p[i + 2] | p[i + 3]; >>> VECTYPE tmp2 = p[i + 4] | p[i + 5]; >>> VECTYPE tmp3 = p[i + 6] | p[i + 7]; >>> VECTYPE tmp01 = tmp0 | tmp1; >>> VECTYPE tmp23 = tmp2 | tmp3; >>> if (!ALL_EQ(tmp01 | tmp23, zero)) { >>> break; >>> } >>> } >>> >>> return i * sizeof(VECTYPE); >>> } >>> >>> int main() >>> { >>> //char *x = malloc(1024 << 20); >>> char *x = qemu_anon_ram_alloc(1024 << 20); >>> >>> int i, j; >>> int ret = 0; >>> struct rusage rusage; >>> for (i = 0; i < 500; i ++) { >>> for (j = 0; j < 10 << 20; j += 4096) { >>> ret += buffer_find_nonzero_offset((char*) (x + (i << 20) >>> + j), 4096); >>> } >>> getrusage( RUSAGE_SELF, &rusage ); >>> printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10), >>> rusage.ru_maxrss); >>> getchar(); >>> } >>> printf("%d zero pages\n", ret); >>> } >>> >> >> > > -- Best Regards Wenchao Xia ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-09 2:16 ` Wenchao Xia @ 2013-06-09 2:34 ` Alexey Kardashevskiy 2013-06-09 2:52 ` [Qemu-devel] [Qemu-ppc] " Benjamin Herrenschmidt 2013-06-09 3:01 ` [Qemu-devel] " Wenchao Xia 2013-06-09 2:53 ` Benjamin Herrenschmidt 1 sibling, 2 replies; 49+ messages in thread From: Alexey Kardashevskiy @ 2013-06-09 2:34 UTC (permalink / raw) To: Wenchao Xia Cc: qemu-ppc, Paolo Bonzini, Peter Lieven, qemu-devel, David Gibson On 06/09/2013 12:16 PM, Wenchao Xia wrote: > 于 2013-6-8 16:30, Alexey Kardashevskiy 写道: >> On 06/08/2013 06:27 PM, Wenchao Xia wrote: >>>> On 04.06.2013 16:40, Paolo Bonzini wrote: >>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto: >>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote: >>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto: >>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote: >>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>>>> You could also scan the page for nonzero values before writing it. >>>>>>>>>> i had this in mind, but then choosed the other approach.... turned >>>>>>>>>> out to be a bad idea. >>>>>>>>>> >>>>>>>>>> alexey: i will prepare a patch later today, could you then please >>>>>>>>>> verify it fixes your problem. >>>>>>>>>> >>>>>>>>>> paolo: would we still need the madvise or is it enough to not write >>>>>>>>>> the zeroes? >>>>>>>>> It should be enough to not write them. >>>>>>>> Problem: checking the pages for zero allocates them. even at the >>>>>>>> source. >>>>>>> It doesn't look like. I tried this program and top doesn't show an >>>>>>> increasing amount of reserved memory: >>>>>>> >>>>>>> #include <stdio.h> >>>>>>> #include <stdlib.h> >>>>>>> int main() >>>>>>> { >>>>>>> char *x = malloc(500 << 20); >>>>>>> int i, j; >>>>>>> for (i = 0; i < 500; i += 10) { >>>>>>> for (j = 0; j < 10 << 20; j += 4096) { >>>>>>> *(volatile char*) (x + (i << 20) + j); >>>>>>> } >>>>>>> getchar(); >>>>>>> } >>>>>>> } >>>>>> strange. we are talking about RSS size, right? >>>>> None of the three top values change, and only VIRT is >500 MB. >>>>> >>>>>> is the malloc above using mmapped memory? >>>>> Yes. >>>>> >>>>>> which kernel version do you use? >>>>> 3.9. >>>>> >>>>>> what avoids allocating the memory for me is the following (with >>>>>> whatever side effects it has ;-)) >>>>> This would also fail to migrate any page that is swapped out, breaking >>>>> overcommit in a more subtle way. :) >>>>> >>>>> Paolo >>>> the following does also not allocate memory, but qemu does... >>>> >>> Hi, Peter >>> As the patch writes >>> >>> "not sending zero pages breaks migration if a page is zero >>> at the source but not at the destination." >>> >>> I don't understand why it would be trouble, shouldn't all page >>> not received in dest be treated as zero pages? >> >> >> How would the destination guest know if some page must be cleared? The >> previous patch (which Peter reverted) did not send anything for the pages >> which were zero on the source side. >> >> > If an page was not received and destination knows that page should > exist according to total size, fill it with zero at destination, would > it solve the problem? It is _live_ migration, the source sends changes, same pages can change and be sent several times. So we would need to turn tracking on on the destination to know if some page was received from the source or changed by the destination itself (by writing there bios/firmware images, etc) and then clear pages which were touched by the destination and were not sent by the source. Or we do not make guesses, the source sends everything and the destination simply checks if a page which is empty on the source is empty on the destination and avoid writing zeroes to it. Looks simpler to me and this is what the new patch does. > >> >>> Also, you mean following code is from qemu and it does not allocate >>> memory with you gcc right? Maybe it is related to KVM, how about >>> turn off KVM and retry following code in qemu? >>> >>>> #include <stdio.h> >>>> #include <stdlib.h> >>>> #include <assert.h> >>>> #include <unistd.h> >>>> #include <sys/resource.h> >>>> #include <inttypes.h> >>>> #include <string.h> >>>> #include <sys/mman.h> >>>> #include <errno.h> >>>> >>>> #if defined __SSE2__ >>>> #include <emmintrin.h> >>>> #define VECTYPE __m128i >>>> #define SPLAT(p) _mm_set1_epi8(*(p)) >>>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == >>>> 0xFFFF) >>>> #else >>>> #define VECTYPE unsigned long >>>> #define SPLAT(p) (*(p) * (~0UL / 255)) >>>> #define ALL_EQ(v1, v2) ((v1) == (v2)) >>>> #endif >>>> >>>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8 >>>> >>>> /* Round number down to multiple */ >>>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m)) >>>> >>>> /* Round number up to multiple */ >>>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m)) >>>> >>>> #define QEMU_VMALLOC_ALIGN (256 * 4096) >>>> >>>> /* alloc shared memory pages */ >>>> void *qemu_anon_ram_alloc(size_t size) >>>> { >>>> size_t align = QEMU_VMALLOC_ALIGN; >>>> size_t total = size + align - getpagesize(); >>>> void *ptr = mmap(0, total, PROT_READ | PROT_WRITE, >>>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); >>>> size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - >>>> (uintptr_t)ptr; >>>> >>>> if (ptr == MAP_FAILED) { >>>> fprintf(stderr, "Failed to allocate %zu B: %s\n", >>>> size, strerror(errno)); >>>> abort(); >>>> } >>>> >>>> ptr += offset; >>>> total -= offset; >>>> >>>> if (offset > 0) { >>>> munmap(ptr - offset, offset); >>>> } >>>> if (total > size) { >>>> munmap(ptr + size, total - size); >>>> } >>>> >>>> return ptr; >>>> } >>>> >>>> static inline int >>>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len) >>>> { >>>> return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR >>>> * sizeof(VECTYPE)) == 0 >>>> && ((uintptr_t) buf) % sizeof(VECTYPE) == 0); >>>> } >>>> >>>> size_t buffer_find_nonzero_offset(const void *buf, size_t len) >>>> { >>>> const VECTYPE *p = buf; >>>> const VECTYPE zero = (VECTYPE){0}; >>>> size_t i; >>>> >>>> if (!len) { >>>> return 0; >>>> } >>>> >>>> assert(can_use_buffer_find_nonzero_offset(buf, len)); >>>> >>>> for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) { >>>> if (!ALL_EQ(p[i], zero)) { >>>> return i * sizeof(VECTYPE); >>>> } >>>> } >>>> >>>> for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; >>>> i < len / sizeof(VECTYPE); >>>> i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) { >>>> VECTYPE tmp0 = p[i + 0] | p[i + 1]; >>>> VECTYPE tmp1 = p[i + 2] | p[i + 3]; >>>> VECTYPE tmp2 = p[i + 4] | p[i + 5]; >>>> VECTYPE tmp3 = p[i + 6] | p[i + 7]; >>>> VECTYPE tmp01 = tmp0 | tmp1; >>>> VECTYPE tmp23 = tmp2 | tmp3; >>>> if (!ALL_EQ(tmp01 | tmp23, zero)) { >>>> break; >>>> } >>>> } >>>> >>>> return i * sizeof(VECTYPE); >>>> } >>>> >>>> int main() >>>> { >>>> //char *x = malloc(1024 << 20); >>>> char *x = qemu_anon_ram_alloc(1024 << 20); >>>> >>>> int i, j; >>>> int ret = 0; >>>> struct rusage rusage; >>>> for (i = 0; i < 500; i ++) { >>>> for (j = 0; j < 10 << 20; j += 4096) { >>>> ret += buffer_find_nonzero_offset((char*) (x + (i << 20) >>>> + j), 4096); >>>> } >>>> getrusage( RUSAGE_SELF, &rusage ); >>>> printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10), >>>> rusage.ru_maxrss); >>>> getchar(); >>>> } >>>> printf("%d zero pages\n", ret); >>>> } >>>> >>> >>> >> >> > > -- Alexey ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [Qemu-ppc] broken incoming migration 2013-06-09 2:34 ` Alexey Kardashevskiy @ 2013-06-09 2:52 ` Benjamin Herrenschmidt 2013-06-09 3:01 ` Alexey Kardashevskiy 2013-06-09 3:01 ` [Qemu-devel] " Wenchao Xia 1 sibling, 1 reply; 49+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-09 2:52 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Peter Lieven, qemu-devel, qemu-ppc, Paolo Bonzini, Wenchao Xia, David Gibson On Sun, 2013-06-09 at 12:34 +1000, Alexey Kardashevskiy wrote: > It is _live_ migration, the source sends changes, same pages can change and > be sent several times. So we would need to turn tracking on on the > destination to know if some page was received from the source or changed by > the destination itself (by writing there bios/firmware images, etc) and > then clear pages which were touched by the destination and were not sent by > the source. Or we can set some kind of flag so that when creating a "migration target" VM we don't load all these things into memory. > Or we do not make guesses, the source sends everything and the destination > simply checks if a page which is empty on the source is empty on the > destination and avoid writing zeroes to it. Looks simpler to me and this is > what the new patch does. But you end up sending a lot of zero's ... is the migration compressed (I am not familiar with it at all) ? If it is, that shouldn't be a big deal, but else it feels to me that you should be able to send a special packet instead that says "all zeros" because you'll potentially have an awful lot of these. Ben. > > > >> > >>> Also, you mean following code is from qemu and it does not allocate > >>> memory with you gcc right? Maybe it is related to KVM, how about > >>> turn off KVM and retry following code in qemu? > >>> > >>>> #include <stdio.h> > >>>> #include <stdlib.h> > >>>> #include <assert.h> > >>>> #include <unistd.h> > >>>> #include <sys/resource.h> > >>>> #include <inttypes.h> > >>>> #include <string.h> > >>>> #include <sys/mman.h> > >>>> #include <errno.h> > >>>> > >>>> #if defined __SSE2__ > >>>> #include <emmintrin.h> > >>>> #define VECTYPE __m128i > >>>> #define SPLAT(p) _mm_set1_epi8(*(p)) > >>>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == > >>>> 0xFFFF) > >>>> #else > >>>> #define VECTYPE unsigned long > >>>> #define SPLAT(p) (*(p) * (~0UL / 255)) > >>>> #define ALL_EQ(v1, v2) ((v1) == (v2)) > >>>> #endif > >>>> > >>>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8 > >>>> > >>>> /* Round number down to multiple */ > >>>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m)) > >>>> > >>>> /* Round number up to multiple */ > >>>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m)) > >>>> > >>>> #define QEMU_VMALLOC_ALIGN (256 * 4096) > >>>> > >>>> /* alloc shared memory pages */ > >>>> void *qemu_anon_ram_alloc(size_t size) > >>>> { > >>>> size_t align = QEMU_VMALLOC_ALIGN; > >>>> size_t total = size + align - getpagesize(); > >>>> void *ptr = mmap(0, total, PROT_READ | PROT_WRITE, > >>>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > >>>> size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - > >>>> (uintptr_t)ptr; > >>>> > >>>> if (ptr == MAP_FAILED) { > >>>> fprintf(stderr, "Failed to allocate %zu B: %s\n", > >>>> size, strerror(errno)); > >>>> abort(); > >>>> } > >>>> > >>>> ptr += offset; > >>>> total -= offset; > >>>> > >>>> if (offset > 0) { > >>>> munmap(ptr - offset, offset); > >>>> } > >>>> if (total > size) { > >>>> munmap(ptr + size, total - size); > >>>> } > >>>> > >>>> return ptr; > >>>> } > >>>> > >>>> static inline int > >>>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len) > >>>> { > >>>> return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR > >>>> * sizeof(VECTYPE)) == 0 > >>>> && ((uintptr_t) buf) % sizeof(VECTYPE) == 0); > >>>> } > >>>> > >>>> size_t buffer_find_nonzero_offset(const void *buf, size_t len) > >>>> { > >>>> const VECTYPE *p = buf; > >>>> const VECTYPE zero = (VECTYPE){0}; > >>>> size_t i; > >>>> > >>>> if (!len) { > >>>> return 0; > >>>> } > >>>> > >>>> assert(can_use_buffer_find_nonzero_offset(buf, len)); > >>>> > >>>> for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) { > >>>> if (!ALL_EQ(p[i], zero)) { > >>>> return i * sizeof(VECTYPE); > >>>> } > >>>> } > >>>> > >>>> for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; > >>>> i < len / sizeof(VECTYPE); > >>>> i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) { > >>>> VECTYPE tmp0 = p[i + 0] | p[i + 1]; > >>>> VECTYPE tmp1 = p[i + 2] | p[i + 3]; > >>>> VECTYPE tmp2 = p[i + 4] | p[i + 5]; > >>>> VECTYPE tmp3 = p[i + 6] | p[i + 7]; > >>>> VECTYPE tmp01 = tmp0 | tmp1; > >>>> VECTYPE tmp23 = tmp2 | tmp3; > >>>> if (!ALL_EQ(tmp01 | tmp23, zero)) { > >>>> break; > >>>> } > >>>> } > >>>> > >>>> return i * sizeof(VECTYPE); > >>>> } > >>>> > >>>> int main() > >>>> { > >>>> //char *x = malloc(1024 << 20); > >>>> char *x = qemu_anon_ram_alloc(1024 << 20); > >>>> > >>>> int i, j; > >>>> int ret = 0; > >>>> struct rusage rusage; > >>>> for (i = 0; i < 500; i ++) { > >>>> for (j = 0; j < 10 << 20; j += 4096) { > >>>> ret += buffer_find_nonzero_offset((char*) (x + (i << 20) > >>>> + j), 4096); > >>>> } > >>>> getrusage( RUSAGE_SELF, &rusage ); > >>>> printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10), > >>>> rusage.ru_maxrss); > >>>> getchar(); > >>>> } > >>>> printf("%d zero pages\n", ret); > >>>> } > >>>> > >>> > >>> > >> > >> > > > > > > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [Qemu-ppc] broken incoming migration 2013-06-09 2:52 ` [Qemu-devel] [Qemu-ppc] " Benjamin Herrenschmidt @ 2013-06-09 3:01 ` Alexey Kardashevskiy 0 siblings, 0 replies; 49+ messages in thread From: Alexey Kardashevskiy @ 2013-06-09 3:01 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Peter Lieven, qemu-devel, qemu-ppc, Paolo Bonzini, Wenchao Xia, David Gibson On 06/09/2013 12:52 PM, Benjamin Herrenschmidt wrote: > On Sun, 2013-06-09 at 12:34 +1000, Alexey Kardashevskiy wrote: > >> It is _live_ migration, the source sends changes, same pages can change and >> be sent several times. So we would need to turn tracking on on the >> destination to know if some page was received from the source or changed by >> the destination itself (by writing there bios/firmware images, etc) and >> then clear pages which were touched by the destination and were not sent by >> the source. > > Or we can set some kind of flag so that when creating a "migration > target" VM we don't load all these things into memory. How would we do that? The platform initialization code does not have a clue whether it is going to receive a migrated host or not. >> Or we do not make guesses, the source sends everything and the destination >> simply checks if a page which is empty on the source is empty on the >> destination and avoid writing zeroes to it. Looks simpler to me and this is >> what the new patch does. > > But you end up sending a lot of zero's ... is the migration compressed > (I am not familiar with it at all) ? If it is, that shouldn't be a big > deal, but else it feels to me that you should be able to send a special > packet instead that says "all zeros" because you'll potentially have an > awful lot of these. It is compressed exactly as you described.. > > Ben. > >>> >>>> >>>>> Also, you mean following code is from qemu and it does not allocate >>>>> memory with you gcc right? Maybe it is related to KVM, how about >>>>> turn off KVM and retry following code in qemu? >>>>> >>>>>> #include <stdio.h> >>>>>> #include <stdlib.h> >>>>>> #include <assert.h> >>>>>> #include <unistd.h> >>>>>> #include <sys/resource.h> >>>>>> #include <inttypes.h> >>>>>> #include <string.h> >>>>>> #include <sys/mman.h> >>>>>> #include <errno.h> >>>>>> >>>>>> #if defined __SSE2__ >>>>>> #include <emmintrin.h> >>>>>> #define VECTYPE __m128i >>>>>> #define SPLAT(p) _mm_set1_epi8(*(p)) >>>>>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == >>>>>> 0xFFFF) >>>>>> #else >>>>>> #define VECTYPE unsigned long >>>>>> #define SPLAT(p) (*(p) * (~0UL / 255)) >>>>>> #define ALL_EQ(v1, v2) ((v1) == (v2)) >>>>>> #endif >>>>>> >>>>>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8 >>>>>> >>>>>> /* Round number down to multiple */ >>>>>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m)) >>>>>> >>>>>> /* Round number up to multiple */ >>>>>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m)) >>>>>> >>>>>> #define QEMU_VMALLOC_ALIGN (256 * 4096) >>>>>> >>>>>> /* alloc shared memory pages */ >>>>>> void *qemu_anon_ram_alloc(size_t size) >>>>>> { >>>>>> size_t align = QEMU_VMALLOC_ALIGN; >>>>>> size_t total = size + align - getpagesize(); >>>>>> void *ptr = mmap(0, total, PROT_READ | PROT_WRITE, >>>>>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); >>>>>> size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - >>>>>> (uintptr_t)ptr; >>>>>> >>>>>> if (ptr == MAP_FAILED) { >>>>>> fprintf(stderr, "Failed to allocate %zu B: %s\n", >>>>>> size, strerror(errno)); >>>>>> abort(); >>>>>> } >>>>>> >>>>>> ptr += offset; >>>>>> total -= offset; >>>>>> >>>>>> if (offset > 0) { >>>>>> munmap(ptr - offset, offset); >>>>>> } >>>>>> if (total > size) { >>>>>> munmap(ptr + size, total - size); >>>>>> } >>>>>> >>>>>> return ptr; >>>>>> } >>>>>> >>>>>> static inline int >>>>>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len) >>>>>> { >>>>>> return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR >>>>>> * sizeof(VECTYPE)) == 0 >>>>>> && ((uintptr_t) buf) % sizeof(VECTYPE) == 0); >>>>>> } >>>>>> >>>>>> size_t buffer_find_nonzero_offset(const void *buf, size_t len) >>>>>> { >>>>>> const VECTYPE *p = buf; >>>>>> const VECTYPE zero = (VECTYPE){0}; >>>>>> size_t i; >>>>>> >>>>>> if (!len) { >>>>>> return 0; >>>>>> } >>>>>> >>>>>> assert(can_use_buffer_find_nonzero_offset(buf, len)); >>>>>> >>>>>> for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) { >>>>>> if (!ALL_EQ(p[i], zero)) { >>>>>> return i * sizeof(VECTYPE); >>>>>> } >>>>>> } >>>>>> >>>>>> for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; >>>>>> i < len / sizeof(VECTYPE); >>>>>> i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) { >>>>>> VECTYPE tmp0 = p[i + 0] | p[i + 1]; >>>>>> VECTYPE tmp1 = p[i + 2] | p[i + 3]; >>>>>> VECTYPE tmp2 = p[i + 4] | p[i + 5]; >>>>>> VECTYPE tmp3 = p[i + 6] | p[i + 7]; >>>>>> VECTYPE tmp01 = tmp0 | tmp1; >>>>>> VECTYPE tmp23 = tmp2 | tmp3; >>>>>> if (!ALL_EQ(tmp01 | tmp23, zero)) { >>>>>> break; >>>>>> } >>>>>> } >>>>>> >>>>>> return i * sizeof(VECTYPE); >>>>>> } >>>>>> >>>>>> int main() >>>>>> { >>>>>> //char *x = malloc(1024 << 20); >>>>>> char *x = qemu_anon_ram_alloc(1024 << 20); >>>>>> >>>>>> int i, j; >>>>>> int ret = 0; >>>>>> struct rusage rusage; >>>>>> for (i = 0; i < 500; i ++) { >>>>>> for (j = 0; j < 10 << 20; j += 4096) { >>>>>> ret += buffer_find_nonzero_offset((char*) (x + (i << 20) >>>>>> + j), 4096); >>>>>> } >>>>>> getrusage( RUSAGE_SELF, &rusage ); >>>>>> printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10), >>>>>> rusage.ru_maxrss); >>>>>> getchar(); >>>>>> } >>>>>> printf("%d zero pages\n", ret); >>>>>> } >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > > -- Alexey ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-09 2:34 ` Alexey Kardashevskiy 2013-06-09 2:52 ` [Qemu-devel] [Qemu-ppc] " Benjamin Herrenschmidt @ 2013-06-09 3:01 ` Wenchao Xia 2013-06-09 3:09 ` Alexey Kardashevskiy 1 sibling, 1 reply; 49+ messages in thread From: Wenchao Xia @ 2013-06-09 3:01 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: qemu-ppc, Paolo Bonzini, Peter Lieven, qemu-devel, David Gibson 于 2013-6-9 10:34, Alexey Kardashevskiy 写道: > On 06/09/2013 12:16 PM, Wenchao Xia wrote: >> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道: >>> On 06/08/2013 06:27 PM, Wenchao Xia wrote: >>>>> On 04.06.2013 16:40, Paolo Bonzini wrote: >>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto: >>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote: >>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto: >>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote: >>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>>>>> You could also scan the page for nonzero values before writing it. >>>>>>>>>>> i had this in mind, but then choosed the other approach.... turned >>>>>>>>>>> out to be a bad idea. >>>>>>>>>>> >>>>>>>>>>> alexey: i will prepare a patch later today, could you then please >>>>>>>>>>> verify it fixes your problem. >>>>>>>>>>> >>>>>>>>>>> paolo: would we still need the madvise or is it enough to not write >>>>>>>>>>> the zeroes? >>>>>>>>>> It should be enough to not write them. >>>>>>>>> Problem: checking the pages for zero allocates them. even at the >>>>>>>>> source. >>>>>>>> It doesn't look like. I tried this program and top doesn't show an >>>>>>>> increasing amount of reserved memory: >>>>>>>> >>>>>>>> #include <stdio.h> >>>>>>>> #include <stdlib.h> >>>>>>>> int main() >>>>>>>> { >>>>>>>> char *x = malloc(500 << 20); >>>>>>>> int i, j; >>>>>>>> for (i = 0; i < 500; i += 10) { >>>>>>>> for (j = 0; j < 10 << 20; j += 4096) { >>>>>>>> *(volatile char*) (x + (i << 20) + j); >>>>>>>> } >>>>>>>> getchar(); >>>>>>>> } >>>>>>>> } >>>>>>> strange. we are talking about RSS size, right? >>>>>> None of the three top values change, and only VIRT is >500 MB. >>>>>> >>>>>>> is the malloc above using mmapped memory? >>>>>> Yes. >>>>>> >>>>>>> which kernel version do you use? >>>>>> 3.9. >>>>>> >>>>>>> what avoids allocating the memory for me is the following (with >>>>>>> whatever side effects it has ;-)) >>>>>> This would also fail to migrate any page that is swapped out, breaking >>>>>> overcommit in a more subtle way. :) >>>>>> >>>>>> Paolo >>>>> the following does also not allocate memory, but qemu does... >>>>> >>>> Hi, Peter >>>> As the patch writes >>>> >>>> "not sending zero pages breaks migration if a page is zero >>>> at the source but not at the destination." >>>> >>>> I don't understand why it would be trouble, shouldn't all page >>>> not received in dest be treated as zero pages? >>> >>> >>> How would the destination guest know if some page must be cleared? The >>> previous patch (which Peter reverted) did not send anything for the pages >>> which were zero on the source side. >>> >>> >> If an page was not received and destination knows that page should >> exist according to total size, fill it with zero at destination, would >> it solve the problem? > > It is _live_ migration, the source sends changes, same pages can change and > be sent several times. So we would need to turn tracking on on the > destination to know if some page was received from the source or changed by > the destination itself (by writing there bios/firmware images, etc) and > then clear pages which were touched by the destination and were not sent by > the source. OK, I can understand the problem is, for example: Destination boots up with 0x0000-0xFFFF filled with bios image. Source forgot to send zero pages in 0x0000-0xFFFF. After migration destination got 0x0000-0xFFFF dirty(different with source) Thanks for explain. This seems refer to the migration protocol: how should the guest treat unsent pages. The patch causing the problem, actually treat zero pages as "not to sent" at source, but another half is missing: treat "not received" as zero pages at destination. I guess if second half is added, problem is gone: after page transfer completed, before destination resume, fill zero in "not received" pages. > > Or we do not make guesses, the source sends everything and the destination > simply checks if a page which is empty on the source is empty on the > destination and avoid writing zeroes to it. Looks simpler to me and this is > what the new patch does. > > >> >>> >>>> Also, you mean following code is from qemu and it does not allocate >>>> memory with you gcc right? Maybe it is related to KVM, how about >>>> turn off KVM and retry following code in qemu? >>>> >>>>> #include <stdio.h> >>>>> #include <stdlib.h> >>>>> #include <assert.h> >>>>> #include <unistd.h> >>>>> #include <sys/resource.h> >>>>> #include <inttypes.h> >>>>> #include <string.h> >>>>> #include <sys/mman.h> >>>>> #include <errno.h> >>>>> >>>>> #if defined __SSE2__ >>>>> #include <emmintrin.h> >>>>> #define VECTYPE __m128i >>>>> #define SPLAT(p) _mm_set1_epi8(*(p)) >>>>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == >>>>> 0xFFFF) >>>>> #else >>>>> #define VECTYPE unsigned long >>>>> #define SPLAT(p) (*(p) * (~0UL / 255)) >>>>> #define ALL_EQ(v1, v2) ((v1) == (v2)) >>>>> #endif >>>>> >>>>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8 >>>>> >>>>> /* Round number down to multiple */ >>>>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m)) >>>>> >>>>> /* Round number up to multiple */ >>>>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m)) >>>>> >>>>> #define QEMU_VMALLOC_ALIGN (256 * 4096) >>>>> >>>>> /* alloc shared memory pages */ >>>>> void *qemu_anon_ram_alloc(size_t size) >>>>> { >>>>> size_t align = QEMU_VMALLOC_ALIGN; >>>>> size_t total = size + align - getpagesize(); >>>>> void *ptr = mmap(0, total, PROT_READ | PROT_WRITE, >>>>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); >>>>> size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - >>>>> (uintptr_t)ptr; >>>>> >>>>> if (ptr == MAP_FAILED) { >>>>> fprintf(stderr, "Failed to allocate %zu B: %s\n", >>>>> size, strerror(errno)); >>>>> abort(); >>>>> } >>>>> >>>>> ptr += offset; >>>>> total -= offset; >>>>> >>>>> if (offset > 0) { >>>>> munmap(ptr - offset, offset); >>>>> } >>>>> if (total > size) { >>>>> munmap(ptr + size, total - size); >>>>> } >>>>> >>>>> return ptr; >>>>> } >>>>> >>>>> static inline int >>>>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len) >>>>> { >>>>> return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR >>>>> * sizeof(VECTYPE)) == 0 >>>>> && ((uintptr_t) buf) % sizeof(VECTYPE) == 0); >>>>> } >>>>> >>>>> size_t buffer_find_nonzero_offset(const void *buf, size_t len) >>>>> { >>>>> const VECTYPE *p = buf; >>>>> const VECTYPE zero = (VECTYPE){0}; >>>>> size_t i; >>>>> >>>>> if (!len) { >>>>> return 0; >>>>> } >>>>> >>>>> assert(can_use_buffer_find_nonzero_offset(buf, len)); >>>>> >>>>> for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) { >>>>> if (!ALL_EQ(p[i], zero)) { >>>>> return i * sizeof(VECTYPE); >>>>> } >>>>> } >>>>> >>>>> for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; >>>>> i < len / sizeof(VECTYPE); >>>>> i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) { >>>>> VECTYPE tmp0 = p[i + 0] | p[i + 1]; >>>>> VECTYPE tmp1 = p[i + 2] | p[i + 3]; >>>>> VECTYPE tmp2 = p[i + 4] | p[i + 5]; >>>>> VECTYPE tmp3 = p[i + 6] | p[i + 7]; >>>>> VECTYPE tmp01 = tmp0 | tmp1; >>>>> VECTYPE tmp23 = tmp2 | tmp3; >>>>> if (!ALL_EQ(tmp01 | tmp23, zero)) { >>>>> break; >>>>> } >>>>> } >>>>> >>>>> return i * sizeof(VECTYPE); >>>>> } >>>>> >>>>> int main() >>>>> { >>>>> //char *x = malloc(1024 << 20); >>>>> char *x = qemu_anon_ram_alloc(1024 << 20); >>>>> >>>>> int i, j; >>>>> int ret = 0; >>>>> struct rusage rusage; >>>>> for (i = 0; i < 500; i ++) { >>>>> for (j = 0; j < 10 << 20; j += 4096) { >>>>> ret += buffer_find_nonzero_offset((char*) (x + (i << 20) >>>>> + j), 4096); >>>>> } >>>>> getrusage( RUSAGE_SELF, &rusage ); >>>>> printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10), >>>>> rusage.ru_maxrss); >>>>> getchar(); >>>>> } >>>>> printf("%d zero pages\n", ret); >>>>> } >>>>> >>>> >>>> >>> >>> >> >> > > -- Best Regards Wenchao Xia ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-09 3:01 ` [Qemu-devel] " Wenchao Xia @ 2013-06-09 3:09 ` Alexey Kardashevskiy 2013-06-09 3:31 ` Wenchao Xia 2013-06-09 7:27 ` Peter Lieven 0 siblings, 2 replies; 49+ messages in thread From: Alexey Kardashevskiy @ 2013-06-09 3:09 UTC (permalink / raw) To: Wenchao Xia Cc: qemu-ppc, Paolo Bonzini, Peter Lieven, qemu-devel, David Gibson On 06/09/2013 01:01 PM, Wenchao Xia wrote: > 于 2013-6-9 10:34, Alexey Kardashevskiy 写道: >> On 06/09/2013 12:16 PM, Wenchao Xia wrote: >>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道: >>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote: >>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote: >>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto: >>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote: >>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto: >>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote: >>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>>>>>> You could also scan the page for nonzero values before >>>>>>>>>>>>>> writing it. >>>>>>>>>>>> i had this in mind, but then choosed the other approach.... turned >>>>>>>>>>>> out to be a bad idea. >>>>>>>>>>>> >>>>>>>>>>>> alexey: i will prepare a patch later today, could you then please >>>>>>>>>>>> verify it fixes your problem. >>>>>>>>>>>> >>>>>>>>>>>> paolo: would we still need the madvise or is it enough to not >>>>>>>>>>>> write >>>>>>>>>>>> the zeroes? >>>>>>>>>>> It should be enough to not write them. >>>>>>>>>> Problem: checking the pages for zero allocates them. even at the >>>>>>>>>> source. >>>>>>>>> It doesn't look like. I tried this program and top doesn't show an >>>>>>>>> increasing amount of reserved memory: >>>>>>>>> >>>>>>>>> #include <stdio.h> >>>>>>>>> #include <stdlib.h> >>>>>>>>> int main() >>>>>>>>> { >>>>>>>>> char *x = malloc(500 << 20); >>>>>>>>> int i, j; >>>>>>>>> for (i = 0; i < 500; i += 10) { >>>>>>>>> for (j = 0; j < 10 << 20; j += 4096) { >>>>>>>>> *(volatile char*) (x + (i << 20) + j); >>>>>>>>> } >>>>>>>>> getchar(); >>>>>>>>> } >>>>>>>>> } >>>>>>>> strange. we are talking about RSS size, right? >>>>>>> None of the three top values change, and only VIRT is >500 MB. >>>>>>> >>>>>>>> is the malloc above using mmapped memory? >>>>>>> Yes. >>>>>>> >>>>>>>> which kernel version do you use? >>>>>>> 3.9. >>>>>>> >>>>>>>> what avoids allocating the memory for me is the following (with >>>>>>>> whatever side effects it has ;-)) >>>>>>> This would also fail to migrate any page that is swapped out, breaking >>>>>>> overcommit in a more subtle way. :) >>>>>>> >>>>>>> Paolo >>>>>> the following does also not allocate memory, but qemu does... >>>>>> >>>>> Hi, Peter >>>>> As the patch writes >>>>> >>>>> "not sending zero pages breaks migration if a page is zero >>>>> at the source but not at the destination." >>>>> >>>>> I don't understand why it would be trouble, shouldn't all page >>>>> not received in dest be treated as zero pages? >>>> >>>> >>>> How would the destination guest know if some page must be cleared? The >>>> previous patch (which Peter reverted) did not send anything for the pages >>>> which were zero on the source side. >>>> >>>> >>> If an page was not received and destination knows that page should >>> exist according to total size, fill it with zero at destination, would >>> it solve the problem? >> >> It is _live_ migration, the source sends changes, same pages can change and >> be sent several times. So we would need to turn tracking on on the >> destination to know if some page was received from the source or changed by >> the destination itself (by writing there bios/firmware images, etc) and >> then clear pages which were touched by the destination and were not sent by >> the source. > OK, I can understand the problem is, for example: > Destination boots up with 0x0000-0xFFFF filled with bios image. > Source forgot to send zero pages in 0x0000-0xFFFF. The source did not forget, instead it zeroed these pages during its life and thought that they must be zeroed at the destination already (as the destination did not start and did not have a chance to write something there). > After migration destination got 0x0000-0xFFFF dirty(different with > source) Yep. And those pages were empty on the source what made debugging very easy :) > Thanks for explain. > > This seems refer to the migration protocol: how should the guest treat > unsent pages. The patch causing the problem, actually treat zero pages > as "not to sent" at source, but another half is missing: treat "not > received" as zero pages at destination. I guess if second half is added, > problem is gone: > after page transfer completed, before destination resume, > fill zero in "not received" pages. Make a working patch, we'll discuss it :) I do not see much acceleration coming from there. -- Alexey ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-09 3:09 ` Alexey Kardashevskiy @ 2013-06-09 3:31 ` Wenchao Xia 2013-06-09 7:27 ` Peter Lieven 1 sibling, 0 replies; 49+ messages in thread From: Wenchao Xia @ 2013-06-09 3:31 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: qemu-ppc, Paolo Bonzini, Peter Lieven, qemu-devel, David Gibson 于 2013-6-9 11:09, Alexey Kardashevskiy 写道: > On 06/09/2013 01:01 PM, Wenchao Xia wrote: >> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道: >>> On 06/09/2013 12:16 PM, Wenchao Xia wrote: >>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道: >>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote: >>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote: >>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto: >>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote: >>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto: >>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote: >>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>>>>>>> You could also scan the page for nonzero values before >>>>>>>>>>>>>>> writing it. >>>>>>>>>>>>> i had this in mind, but then choosed the other approach.... turned >>>>>>>>>>>>> out to be a bad idea. >>>>>>>>>>>>> >>>>>>>>>>>>> alexey: i will prepare a patch later today, could you then please >>>>>>>>>>>>> verify it fixes your problem. >>>>>>>>>>>>> >>>>>>>>>>>>> paolo: would we still need the madvise or is it enough to not >>>>>>>>>>>>> write >>>>>>>>>>>>> the zeroes? >>>>>>>>>>>> It should be enough to not write them. >>>>>>>>>>> Problem: checking the pages for zero allocates them. even at the >>>>>>>>>>> source. >>>>>>>>>> It doesn't look like. I tried this program and top doesn't show an >>>>>>>>>> increasing amount of reserved memory: >>>>>>>>>> >>>>>>>>>> #include <stdio.h> >>>>>>>>>> #include <stdlib.h> >>>>>>>>>> int main() >>>>>>>>>> { >>>>>>>>>> char *x = malloc(500 << 20); >>>>>>>>>> int i, j; >>>>>>>>>> for (i = 0; i < 500; i += 10) { >>>>>>>>>> for (j = 0; j < 10 << 20; j += 4096) { >>>>>>>>>> *(volatile char*) (x + (i << 20) + j); >>>>>>>>>> } >>>>>>>>>> getchar(); >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>> strange. we are talking about RSS size, right? >>>>>>>> None of the three top values change, and only VIRT is >500 MB. >>>>>>>> >>>>>>>>> is the malloc above using mmapped memory? >>>>>>>> Yes. >>>>>>>> >>>>>>>>> which kernel version do you use? >>>>>>>> 3.9. >>>>>>>> >>>>>>>>> what avoids allocating the memory for me is the following (with >>>>>>>>> whatever side effects it has ;-)) >>>>>>>> This would also fail to migrate any page that is swapped out, breaking >>>>>>>> overcommit in a more subtle way. :) >>>>>>>> >>>>>>>> Paolo >>>>>>> the following does also not allocate memory, but qemu does... >>>>>>> >>>>>> Hi, Peter >>>>>> As the patch writes >>>>>> >>>>>> "not sending zero pages breaks migration if a page is zero >>>>>> at the source but not at the destination." >>>>>> >>>>>> I don't understand why it would be trouble, shouldn't all page >>>>>> not received in dest be treated as zero pages? >>>>> >>>>> >>>>> How would the destination guest know if some page must be cleared? The >>>>> previous patch (which Peter reverted) did not send anything for the pages >>>>> which were zero on the source side. >>>>> >>>>> >>>> If an page was not received and destination knows that page should >>>> exist according to total size, fill it with zero at destination, would >>>> it solve the problem? >>> >>> It is _live_ migration, the source sends changes, same pages can change and >>> be sent several times. So we would need to turn tracking on on the >>> destination to know if some page was received from the source or changed by >>> the destination itself (by writing there bios/firmware images, etc) and >>> then clear pages which were touched by the destination and were not sent by >>> the source. >> OK, I can understand the problem is, for example: >> Destination boots up with 0x0000-0xFFFF filled with bios image. >> Source forgot to send zero pages in 0x0000-0xFFFF. > > > The source did not forget, instead it zeroed these pages during its life > and thought that they must be zeroed at the destination already (as the > destination did not start and did not have a chance to write something there). > > >> After migration destination got 0x0000-0xFFFF dirty(different with >> source) > > Yep. And those pages were empty on the source what made debugging very easy :) > > >> Thanks for explain. >> >> This seems refer to the migration protocol: how should the guest treat >> unsent pages. The patch causing the problem, actually treat zero pages >> as "not to sent" at source, but another half is missing: treat "not >> received" as zero pages at destination. I guess if second half is added, >> problem is gone: >> after page transfer completed, before destination resume, >> fill zero in "not received" pages. > > > > Make a working patch, we'll discuss it :) I do not see much acceleration > coming from there. > > 4k zero page is compressed into header: 8 bytes flag + 1 byte tail + ( 1 + strlen(idstr) when ramblock is a new one), so take 10 bytes as average, ram:network flow is 4000:10 = 400:1 Then for a typical 4GB guest, sending the zero pages will take about 10M network flow, indeed not much acceleration. I think current method is already good enough, unless there are other benefits in not sending zero pages. -- Best Regards Wenchao Xia ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-09 3:09 ` Alexey Kardashevskiy 2013-06-09 3:31 ` Wenchao Xia @ 2013-06-09 7:27 ` Peter Lieven 2013-06-10 6:39 ` Alexey Kardashevskiy 1 sibling, 1 reply; 49+ messages in thread From: Peter Lieven @ 2013-06-09 7:27 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Paolo Bonzini, David Gibson, qemu-ppc, Wenchao Xia, qemu-devel Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>: > On 06/09/2013 01:01 PM, Wenchao Xia wrote: >> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道: >>> On 06/09/2013 12:16 PM, Wenchao Xia wrote: >>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道: >>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote: >>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote: >>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto: >>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote: >>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto: >>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote: >>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>>>>>>> You could also scan the page for nonzero values before >>>>>>>>>>>>>>> writing it. >>>>>>>>>>>>> i had this in mind, but then choosed the other approach.... turned >>>>>>>>>>>>> out to be a bad idea. >>>>>>>>>>>>> >>>>>>>>>>>>> alexey: i will prepare a patch later today, could you then please >>>>>>>>>>>>> verify it fixes your problem. >>>>>>>>>>>>> >>>>>>>>>>>>> paolo: would we still need the madvise or is it enough to not >>>>>>>>>>>>> write >>>>>>>>>>>>> the zeroes? >>>>>>>>>>>> It should be enough to not write them. >>>>>>>>>>> Problem: checking the pages for zero allocates them. even at the >>>>>>>>>>> source. >>>>>>>>>> It doesn't look like. I tried this program and top doesn't show an >>>>>>>>>> increasing amount of reserved memory: >>>>>>>>>> >>>>>>>>>> #include <stdio.h> >>>>>>>>>> #include <stdlib.h> >>>>>>>>>> int main() >>>>>>>>>> { >>>>>>>>>> char *x = malloc(500 << 20); >>>>>>>>>> int i, j; >>>>>>>>>> for (i = 0; i < 500; i += 10) { >>>>>>>>>> for (j = 0; j < 10 << 20; j += 4096) { >>>>>>>>>> *(volatile char*) (x + (i << 20) + j); >>>>>>>>>> } >>>>>>>>>> getchar(); >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>> strange. we are talking about RSS size, right? >>>>>>>> None of the three top values change, and only VIRT is >500 MB. >>>>>>>> >>>>>>>>> is the malloc above using mmapped memory? >>>>>>>> Yes. >>>>>>>> >>>>>>>>> which kernel version do you use? >>>>>>>> 3.9. >>>>>>>> >>>>>>>>> what avoids allocating the memory for me is the following (with >>>>>>>>> whatever side effects it has ;-)) >>>>>>>> This would also fail to migrate any page that is swapped out, breaking >>>>>>>> overcommit in a more subtle way. :) >>>>>>>> >>>>>>>> Paolo >>>>>>> the following does also not allocate memory, but qemu does... >>>>>> Hi, Peter >>>>>> As the patch writes >>>>>> >>>>>> "not sending zero pages breaks migration if a page is zero >>>>>> at the source but not at the destination." >>>>>> >>>>>> I don't understand why it would be trouble, shouldn't all page >>>>>> not received in dest be treated as zero pages? >>>>> >>>>> >>>>> How would the destination guest know if some page must be cleared? The >>>>> previous patch (which Peter reverted) did not send anything for the pages >>>>> which were zero on the source side. >>>> If an page was not received and destination knows that page should >>>> exist according to total size, fill it with zero at destination, would >>>> it solve the problem? >>> >>> It is _live_ migration, the source sends changes, same pages can change and >>> be sent several times. So we would need to turn tracking on on the >>> destination to know if some page was received from the source or changed by >>> the destination itself (by writing there bios/firmware images, etc) and >>> then clear pages which were touched by the destination and were not sent by >>> the source. >> OK, I can understand the problem is, for example: >> Destination boots up with 0x0000-0xFFFF filled with bios image. >> Source forgot to send zero pages in 0x0000-0xFFFF. > > > The source did not forget, instead it zeroed these pages during its life > and thought that they must be zeroed at the destination already (as the > destination did not start and did not have a chance to write something there). > > >> After migration destination got 0x0000-0xFFFF dirty(different with >> source) > > Yep. And those pages were empty on the source what made debugging very easy :) > > >> Thanks for explain. >> >> This seems refer to the migration protocol: how should the guest treat >> unsent pages. The patch causing the problem, actually treat zero pages >> as "not to sent" at source, but another half is missing: treat "not >> received" as zero pages at destination. I guess if second half is added, >> problem is gone: >> after page transfer completed, before destination resume, >> fill zero in "not received" pages. > > > > Make a working patch, we'll discuss it :) I do not see much acceleration > coming from there. I would also not spent much time with this. I would either look to find an easy way to fix the initialization code to not unneccessarily load data into RAM or i will sent a v2 of my patch following Eric's concerns. Peter > > > -- > Alexey ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-09 7:27 ` Peter Lieven @ 2013-06-10 6:39 ` Alexey Kardashevskiy 2013-06-10 6:50 ` Peter Lieven 0 siblings, 1 reply; 49+ messages in thread From: Alexey Kardashevskiy @ 2013-06-10 6:39 UTC (permalink / raw) To: Peter Lieven Cc: Paolo Bonzini, David Gibson, qemu-ppc, Wenchao Xia, qemu-devel On 06/09/2013 05:27 PM, Peter Lieven wrote: > > Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>: > >> On 06/09/2013 01:01 PM, Wenchao Xia wrote: >>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道: >>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote: >>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道: >>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote: >>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote: >>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote: >>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto: >>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote: >>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>>>>>>>> You could also scan the page for nonzero >>>>>>>>>>>>>>>> values before writing it. >>>>>>>>>>>>>> i had this in mind, but then choosed the other >>>>>>>>>>>>>> approach.... turned out to be a bad idea. >>>>>>>>>>>>>> >>>>>>>>>>>>>> alexey: i will prepare a patch later today, >>>>>>>>>>>>>> could you then please verify it fixes your >>>>>>>>>>>>>> problem. >>>>>>>>>>>>>> >>>>>>>>>>>>>> paolo: would we still need the madvise or is >>>>>>>>>>>>>> it enough to not write the zeroes? >>>>>>>>>>>>> It should be enough to not write them. >>>>>>>>>>>> Problem: checking the pages for zero allocates >>>>>>>>>>>> them. even at the source. >>>>>>>>>>> It doesn't look like. I tried this program and top >>>>>>>>>>> doesn't show an increasing amount of reserved >>>>>>>>>>> memory: >>>>>>>>>>> >>>>>>>>>>> #include <stdio.h> #include <stdlib.h> int main() { >>>>>>>>>>> char *x = malloc(500 << 20); int i, j; for (i = 0; i >>>>>>>>>>> < 500; i += 10) { for (j = 0; j < 10 << 20; j += >>>>>>>>>>> 4096) { *(volatile char*) (x + (i << 20) + j); } >>>>>>>>>>> getchar(); } } >>>>>>>>>> strange. we are talking about RSS size, right? >>>>>>>>> None of the three top values change, and only VIRT is >>>>>>>>> >500 MB. >>>>>>>>> >>>>>>>>>> is the malloc above using mmapped memory? >>>>>>>>> Yes. >>>>>>>>> >>>>>>>>>> which kernel version do you use? >>>>>>>>> 3.9. >>>>>>>>> >>>>>>>>>> what avoids allocating the memory for me is the >>>>>>>>>> following (with whatever side effects it has ;-)) >>>>>>>>> This would also fail to migrate any page that is swapped >>>>>>>>> out, breaking overcommit in a more subtle way. :) >>>>>>>>> >>>>>>>>> Paolo >>>>>>>> the following does also not allocate memory, but qemu >>>>>>>> does... >>>>>>> Hi, Peter As the patch writes >>>>>>> >>>>>>> "not sending zero pages breaks migration if a page is zero >>>>>>> at the source but not at the destination." >>>>>>> >>>>>>> I don't understand why it would be trouble, shouldn't all >>>>>>> page not received in dest be treated as zero pages? >>>>>> >>>>>> >>>>>> How would the destination guest know if some page must be >>>>>> cleared? The previous patch (which Peter reverted) did not >>>>>> send anything for the pages which were zero on the source >>>>>> side. >>>>> If an page was not received and destination knows that page >>>>> should exist according to total size, fill it with zero at >>>>> destination, would it solve the problem? >>>> >>>> It is _live_ migration, the source sends changes, same pages can >>>> change and be sent several times. So we would need to turn >>>> tracking on on the destination to know if some page was received >>>> from the source or changed by the destination itself (by writing >>>> there bios/firmware images, etc) and then clear pages which were >>>> touched by the destination and were not sent by the source. >>> OK, I can understand the problem is, for example: Destination boots >>> up with 0x0000-0xFFFF filled with bios image. Source forgot to send >>> zero pages in 0x0000-0xFFFF. >> >> >> The source did not forget, instead it zeroed these pages during its >> life and thought that they must be zeroed at the destination already >> (as the destination did not start and did not have a chance to write >> something there). >> >> >>> After migration destination got 0x0000-0xFFFF dirty(different with >>> source) >> >> Yep. And those pages were empty on the source what made debugging very >> easy :) >> >> >>> Thanks for explain. >>> >>> This seems refer to the migration protocol: how should the guest >>> treat unsent pages. The patch causing the problem, actually treat >>> zero pages as "not to sent" at source, but another half is missing: >>> treat "not received" as zero pages at destination. I guess if second >>> half is added, problem is gone: after page transfer completed, >>> before destination resume, fill zero in "not received" pages. >> >> >> >> Make a working patch, we'll discuss it :) I do not see much >> acceleration coming from there. > > I would also not spent much time with this. I would either look to find > an easy way to fix the initialization code to not unneccessarily load > data into RAM or i will sent a v2 of my patch following Eric's > concerns. There is no easy way to implement the flag and keep your original patch as we have to implement this flag in all architectures which got broken by your patch and I personally can fix only PPC64-pseries but not the others. Furthermore your revert + new patches perfectly solve the problem, why would we want to bother now with this new flag which nobody really needs right now? Please, please, revert the original patch or I'll try to do it :) -- Alexey ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-10 6:39 ` Alexey Kardashevskiy @ 2013-06-10 6:50 ` Peter Lieven 2013-06-10 6:55 ` Alexey Kardashevskiy 0 siblings, 1 reply; 49+ messages in thread From: Peter Lieven @ 2013-06-10 6:50 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Paolo Bonzini, David Gibson, qemu-ppc, Wenchao Xia, qemu-devel On 10.06.2013 08:39, Alexey Kardashevskiy wrote: > On 06/09/2013 05:27 PM, Peter Lieven wrote: >> Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>: >> >>> On 06/09/2013 01:01 PM, Wenchao Xia wrote: >>>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道: >>>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote: >>>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道: >>>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote: >>>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote: >>>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote: >>>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto: >>>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote: >>>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>>>>>>>>> You could also scan the page for nonzero >>>>>>>>>>>>>>>>> values before writing it. >>>>>>>>>>>>>>> i had this in mind, but then choosed the other >>>>>>>>>>>>>>> approach.... turned out to be a bad idea. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> alexey: i will prepare a patch later today, >>>>>>>>>>>>>>> could you then please verify it fixes your >>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> paolo: would we still need the madvise or is >>>>>>>>>>>>>>> it enough to not write the zeroes? >>>>>>>>>>>>>> It should be enough to not write them. >>>>>>>>>>>>> Problem: checking the pages for zero allocates >>>>>>>>>>>>> them. even at the source. >>>>>>>>>>>> It doesn't look like. I tried this program and top >>>>>>>>>>>> doesn't show an increasing amount of reserved >>>>>>>>>>>> memory: >>>>>>>>>>>> >>>>>>>>>>>> #include <stdio.h> #include <stdlib.h> int main() { >>>>>>>>>>>> char *x = malloc(500 << 20); int i, j; for (i = 0; i >>>>>>>>>>>> < 500; i += 10) { for (j = 0; j < 10 << 20; j += >>>>>>>>>>>> 4096) { *(volatile char*) (x + (i << 20) + j); } >>>>>>>>>>>> getchar(); } } >>>>>>>>>>> strange. we are talking about RSS size, right? >>>>>>>>>> None of the three top values change, and only VIRT is >>>>>>>>>>> 500 MB. >>>>>>>>>>> is the malloc above using mmapped memory? >>>>>>>>>> Yes. >>>>>>>>>> >>>>>>>>>>> which kernel version do you use? >>>>>>>>>> 3.9. >>>>>>>>>> >>>>>>>>>>> what avoids allocating the memory for me is the >>>>>>>>>>> following (with whatever side effects it has ;-)) >>>>>>>>>> This would also fail to migrate any page that is swapped >>>>>>>>>> out, breaking overcommit in a more subtle way. :) >>>>>>>>>> >>>>>>>>>> Paolo >>>>>>>>> the following does also not allocate memory, but qemu >>>>>>>>> does... >>>>>>>> Hi, Peter As the patch writes >>>>>>>> >>>>>>>> "not sending zero pages breaks migration if a page is zero >>>>>>>> at the source but not at the destination." >>>>>>>> >>>>>>>> I don't understand why it would be trouble, shouldn't all >>>>>>>> page not received in dest be treated as zero pages? >>>>>>> >>>>>>> How would the destination guest know if some page must be >>>>>>> cleared? The previous patch (which Peter reverted) did not >>>>>>> send anything for the pages which were zero on the source >>>>>>> side. >>>>>> If an page was not received and destination knows that page >>>>>> should exist according to total size, fill it with zero at >>>>>> destination, would it solve the problem? >>>>> It is _live_ migration, the source sends changes, same pages can >>>>> change and be sent several times. So we would need to turn >>>>> tracking on on the destination to know if some page was received >>>>> from the source or changed by the destination itself (by writing >>>>> there bios/firmware images, etc) and then clear pages which were >>>>> touched by the destination and were not sent by the source. >>>> OK, I can understand the problem is, for example: Destination boots >>>> up with 0x0000-0xFFFF filled with bios image. Source forgot to send >>>> zero pages in 0x0000-0xFFFF. >>> >>> The source did not forget, instead it zeroed these pages during its >>> life and thought that they must be zeroed at the destination already >>> (as the destination did not start and did not have a chance to write >>> something there). >>> >>> >>>> After migration destination got 0x0000-0xFFFF dirty(different with >>>> source) >>> Yep. And those pages were empty on the source what made debugging very >>> easy :) >>> >>> >>>> Thanks for explain. >>>> >>>> This seems refer to the migration protocol: how should the guest >>>> treat unsent pages. The patch causing the problem, actually treat >>>> zero pages as "not to sent" at source, but another half is missing: >>>> treat "not received" as zero pages at destination. I guess if second >>>> half is added, problem is gone: after page transfer completed, >>>> before destination resume, fill zero in "not received" pages. >>> >>> >>> Make a working patch, we'll discuss it :) I do not see much >>> acceleration coming from there. >> I would also not spent much time with this. I would either look to find >> an easy way to fix the initialization code to not unneccessarily load >> data into RAM or i will sent a v2 of my patch following Eric's >> concerns. > There is no easy way to implement the flag and keep your original patch as > we have to implement this flag in all architectures which got broken by > your patch and I personally can fix only PPC64-pseries but not the others. > > Furthermore your revert + new patches perfectly solve the problem, why > would we want to bother now with this new flag which nobody really needs > right now? > > Please, please, revert the original patch or I'll try to do it :) > > I tried, but there where concerns by the community. Alternativly I found the following alternate solution. Please drop the 2 patches and try the following: diff --git a/arch_init.c b/arch_init.c index 5d32ecf..458bf8c 100644 --- a/arch_init.c +++ b/arch_init.c @@ -799,6 +799,8 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id) while (total_ram_bytes) { RAMBlock *block; uint8_t len; + void *base; + ram_addr_t offset; len = qemu_get_byte(f); qemu_get_buffer(f, (uint8_t *)id, len); @@ -822,6 +824,14 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id) goto done; } + base = memory_region_get_ram_ptr(block->mr); + for (offset = 0; offset < block->length; + offset += TARGET_PAGE_SIZE) { + if (!is_zero_page(base + offset)) { + memset(base + offset, 0x00, TARGET_PAGE_SIZE); + } + } + total_ram_bytes -= length; } } This is done at setup time so there is no additional cost for zero checking at each compressed page coming in. Peter ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-10 6:50 ` Peter Lieven @ 2013-06-10 6:55 ` Alexey Kardashevskiy 2013-06-10 8:44 ` Peter Lieven 0 siblings, 1 reply; 49+ messages in thread From: Alexey Kardashevskiy @ 2013-06-10 6:55 UTC (permalink / raw) To: Peter Lieven Cc: Paolo Bonzini, David Gibson, qemu-ppc, Wenchao Xia, qemu-devel On 06/10/2013 04:50 PM, Peter Lieven wrote: > On 10.06.2013 08:39, Alexey Kardashevskiy wrote: >> On 06/09/2013 05:27 PM, Peter Lieven wrote: >>> Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>: >>> >>>> On 06/09/2013 01:01 PM, Wenchao Xia wrote: >>>>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道: >>>>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote: >>>>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道: >>>>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote: >>>>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote: >>>>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote: >>>>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto: >>>>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote: >>>>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>>>>>>>>>> You could also scan the page for nonzero >>>>>>>>>>>>>>>>>> values before writing it. >>>>>>>>>>>>>>>> i had this in mind, but then choosed the other >>>>>>>>>>>>>>>> approach.... turned out to be a bad idea. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> alexey: i will prepare a patch later today, >>>>>>>>>>>>>>>> could you then please verify it fixes your >>>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> paolo: would we still need the madvise or is >>>>>>>>>>>>>>>> it enough to not write the zeroes? >>>>>>>>>>>>>>> It should be enough to not write them. >>>>>>>>>>>>>> Problem: checking the pages for zero allocates >>>>>>>>>>>>>> them. even at the source. >>>>>>>>>>>>> It doesn't look like. I tried this program and top >>>>>>>>>>>>> doesn't show an increasing amount of reserved >>>>>>>>>>>>> memory: >>>>>>>>>>>>> >>>>>>>>>>>>> #include <stdio.h> #include <stdlib.h> int main() { >>>>>>>>>>>>> char *x = malloc(500 << 20); int i, j; for (i = 0; i >>>>>>>>>>>>> < 500; i += 10) { for (j = 0; j < 10 << 20; j += >>>>>>>>>>>>> 4096) { *(volatile char*) (x + (i << 20) + j); } >>>>>>>>>>>>> getchar(); } } >>>>>>>>>>>> strange. we are talking about RSS size, right? >>>>>>>>>>> None of the three top values change, and only VIRT is >>>>>>>>>>>> 500 MB. >>>>>>>>>>>> is the malloc above using mmapped memory? >>>>>>>>>>> Yes. >>>>>>>>>>> >>>>>>>>>>>> which kernel version do you use? >>>>>>>>>>> 3.9. >>>>>>>>>>> >>>>>>>>>>>> what avoids allocating the memory for me is the >>>>>>>>>>>> following (with whatever side effects it has ;-)) >>>>>>>>>>> This would also fail to migrate any page that is swapped >>>>>>>>>>> out, breaking overcommit in a more subtle way. :) >>>>>>>>>>> >>>>>>>>>>> Paolo >>>>>>>>>> the following does also not allocate memory, but qemu >>>>>>>>>> does... >>>>>>>>> Hi, Peter As the patch writes >>>>>>>>> >>>>>>>>> "not sending zero pages breaks migration if a page is zero >>>>>>>>> at the source but not at the destination." >>>>>>>>> >>>>>>>>> I don't understand why it would be trouble, shouldn't all >>>>>>>>> page not received in dest be treated as zero pages? >>>>>>>> >>>>>>>> How would the destination guest know if some page must be >>>>>>>> cleared? The previous patch (which Peter reverted) did not >>>>>>>> send anything for the pages which were zero on the source >>>>>>>> side. >>>>>>> If an page was not received and destination knows that page >>>>>>> should exist according to total size, fill it with zero at >>>>>>> destination, would it solve the problem? >>>>>> It is _live_ migration, the source sends changes, same pages can >>>>>> change and be sent several times. So we would need to turn >>>>>> tracking on on the destination to know if some page was received >>>>>> from the source or changed by the destination itself (by writing >>>>>> there bios/firmware images, etc) and then clear pages which were >>>>>> touched by the destination and were not sent by the source. >>>>> OK, I can understand the problem is, for example: Destination boots >>>>> up with 0x0000-0xFFFF filled with bios image. Source forgot to send >>>>> zero pages in 0x0000-0xFFFF. >>>> >>>> The source did not forget, instead it zeroed these pages during its >>>> life and thought that they must be zeroed at the destination already >>>> (as the destination did not start and did not have a chance to write >>>> something there). >>>> >>>> >>>>> After migration destination got 0x0000-0xFFFF dirty(different with >>>>> source) >>>> Yep. And those pages were empty on the source what made debugging very >>>> easy :) >>>> >>>> >>>>> Thanks for explain. >>>>> >>>>> This seems refer to the migration protocol: how should the guest >>>>> treat unsent pages. The patch causing the problem, actually treat >>>>> zero pages as "not to sent" at source, but another half is missing: >>>>> treat "not received" as zero pages at destination. I guess if second >>>>> half is added, problem is gone: after page transfer completed, >>>>> before destination resume, fill zero in "not received" pages. >>>> >>>> >>>> Make a working patch, we'll discuss it :) I do not see much >>>> acceleration coming from there. >>> I would also not spent much time with this. I would either look to find >>> an easy way to fix the initialization code to not unneccessarily load >>> data into RAM or i will sent a v2 of my patch following Eric's >>> concerns. >> There is no easy way to implement the flag and keep your original patch as >> we have to implement this flag in all architectures which got broken by >> your patch and I personally can fix only PPC64-pseries but not the others. >> >> Furthermore your revert + new patches perfectly solve the problem, why >> would we want to bother now with this new flag which nobody really needs >> right now? >> >> Please, please, revert the original patch or I'll try to do it :) >> >> > I tried, but there where concerns by the community. Was here anybody who did not want to revert the patch (besides you)? I did not notice. > Alternativly I found > the following alternate solution. Please drop the 2 patches and try the > following: How is it going to work if upstream QEMU doesn't send anything about empty pages at all (this is why I want to revert that patch)? > > diff --git a/arch_init.c b/arch_init.c > index 5d32ecf..458bf8c 100644 > --- a/arch_init.c > +++ b/arch_init.c > @@ -799,6 +799,8 @@ static int ram_load(QEMUFile *f, void *opaque, int > version_id) > while (total_ram_bytes) { > RAMBlock *block; > uint8_t len; > + void *base; > + ram_addr_t offset; > > len = qemu_get_byte(f); > qemu_get_buffer(f, (uint8_t *)id, len); > @@ -822,6 +824,14 @@ static int ram_load(QEMUFile *f, void *opaque, int > version_id) > goto done; > } > > + base = memory_region_get_ram_ptr(block->mr); > + for (offset = 0; offset < block->length; > + offset += TARGET_PAGE_SIZE) { > + if (!is_zero_page(base + offset)) { > + memset(base + offset, 0x00, TARGET_PAGE_SIZE); > + } > + } > + > total_ram_bytes -= length; > } > } > > This is done at setup time so there is no additional cost for zero checking > at each compressed page > coming in. > > Peter -- Alexey ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-10 6:55 ` Alexey Kardashevskiy @ 2013-06-10 8:44 ` Peter Lieven 2013-06-10 9:10 ` Alexey Kardashevskiy 0 siblings, 1 reply; 49+ messages in thread From: Peter Lieven @ 2013-06-10 8:44 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Paolo Bonzini, David Gibson, qemu-ppc, Wenchao Xia, qemu-devel On 10.06.2013 08:55, Alexey Kardashevskiy wrote: > On 06/10/2013 04:50 PM, Peter Lieven wrote: >> On 10.06.2013 08:39, Alexey Kardashevskiy wrote: >>> On 06/09/2013 05:27 PM, Peter Lieven wrote: >>>> Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>: >>>> >>>>> On 06/09/2013 01:01 PM, Wenchao Xia wrote: >>>>>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道: >>>>>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote: >>>>>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道: >>>>>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote: >>>>>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote: >>>>>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote: >>>>>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto: >>>>>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote: >>>>>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>>>>>>>>>>> You could also scan the page for nonzero >>>>>>>>>>>>>>>>>>> values before writing it. >>>>>>>>>>>>>>>>> i had this in mind, but then choosed the other >>>>>>>>>>>>>>>>> approach.... turned out to be a bad idea. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> alexey: i will prepare a patch later today, >>>>>>>>>>>>>>>>> could you then please verify it fixes your >>>>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> paolo: would we still need the madvise or is >>>>>>>>>>>>>>>>> it enough to not write the zeroes? >>>>>>>>>>>>>>>> It should be enough to not write them. >>>>>>>>>>>>>>> Problem: checking the pages for zero allocates >>>>>>>>>>>>>>> them. even at the source. >>>>>>>>>>>>>> It doesn't look like. I tried this program and top >>>>>>>>>>>>>> doesn't show an increasing amount of reserved >>>>>>>>>>>>>> memory: >>>>>>>>>>>>>> >>>>>>>>>>>>>> #include <stdio.h> #include <stdlib.h> int main() { >>>>>>>>>>>>>> char *x = malloc(500 << 20); int i, j; for (i = 0; i >>>>>>>>>>>>>> < 500; i += 10) { for (j = 0; j < 10 << 20; j += >>>>>>>>>>>>>> 4096) { *(volatile char*) (x + (i << 20) + j); } >>>>>>>>>>>>>> getchar(); } } >>>>>>>>>>>>> strange. we are talking about RSS size, right? >>>>>>>>>>>> None of the three top values change, and only VIRT is >>>>>>>>>>>>> 500 MB. >>>>>>>>>>>>> is the malloc above using mmapped memory? >>>>>>>>>>>> Yes. >>>>>>>>>>>> >>>>>>>>>>>>> which kernel version do you use? >>>>>>>>>>>> 3.9. >>>>>>>>>>>> >>>>>>>>>>>>> what avoids allocating the memory for me is the >>>>>>>>>>>>> following (with whatever side effects it has ;-)) >>>>>>>>>>>> This would also fail to migrate any page that is swapped >>>>>>>>>>>> out, breaking overcommit in a more subtle way. :) >>>>>>>>>>>> >>>>>>>>>>>> Paolo >>>>>>>>>>> the following does also not allocate memory, but qemu >>>>>>>>>>> does... >>>>>>>>>> Hi, Peter As the patch writes >>>>>>>>>> >>>>>>>>>> "not sending zero pages breaks migration if a page is zero >>>>>>>>>> at the source but not at the destination." >>>>>>>>>> >>>>>>>>>> I don't understand why it would be trouble, shouldn't all >>>>>>>>>> page not received in dest be treated as zero pages? >>>>>>>>> How would the destination guest know if some page must be >>>>>>>>> cleared? The previous patch (which Peter reverted) did not >>>>>>>>> send anything for the pages which were zero on the source >>>>>>>>> side. >>>>>>>> If an page was not received and destination knows that page >>>>>>>> should exist according to total size, fill it with zero at >>>>>>>> destination, would it solve the problem? >>>>>>> It is _live_ migration, the source sends changes, same pages can >>>>>>> change and be sent several times. So we would need to turn >>>>>>> tracking on on the destination to know if some page was received >>>>>>> from the source or changed by the destination itself (by writing >>>>>>> there bios/firmware images, etc) and then clear pages which were >>>>>>> touched by the destination and were not sent by the source. >>>>>> OK, I can understand the problem is, for example: Destination boots >>>>>> up with 0x0000-0xFFFF filled with bios image. Source forgot to send >>>>>> zero pages in 0x0000-0xFFFF. >>>>> The source did not forget, instead it zeroed these pages during its >>>>> life and thought that they must be zeroed at the destination already >>>>> (as the destination did not start and did not have a chance to write >>>>> something there). >>>>> >>>>> >>>>>> After migration destination got 0x0000-0xFFFF dirty(different with >>>>>> source) >>>>> Yep. And those pages were empty on the source what made debugging very >>>>> easy :) >>>>> >>>>> >>>>>> Thanks for explain. >>>>>> >>>>>> This seems refer to the migration protocol: how should the guest >>>>>> treat unsent pages. The patch causing the problem, actually treat >>>>>> zero pages as "not to sent" at source, but another half is missing: >>>>>> treat "not received" as zero pages at destination. I guess if second >>>>>> half is added, problem is gone: after page transfer completed, >>>>>> before destination resume, fill zero in "not received" pages. >>>>> >>>>> Make a working patch, we'll discuss it :) I do not see much >>>>> acceleration coming from there. >>>> I would also not spent much time with this. I would either look to find >>>> an easy way to fix the initialization code to not unneccessarily load >>>> data into RAM or i will sent a v2 of my patch following Eric's >>>> concerns. >>> There is no easy way to implement the flag and keep your original patch as >>> we have to implement this flag in all architectures which got broken by >>> your patch and I personally can fix only PPC64-pseries but not the others. >>> >>> Furthermore your revert + new patches perfectly solve the problem, why >>> would we want to bother now with this new flag which nobody really needs >>> right now? >>> >>> Please, please, revert the original patch or I'll try to do it :) >>> >>> >> I tried, but there where concerns by the community. > > Was here anybody who did not want to revert the patch (besides you)? > I did not notice. Eric said I should not drop the skipped_pages stuff in the monitor. > > >> Alternativly I found >> the following alternate solution. Please drop the 2 patches and try the >> following: > > How is it going to work if upstream QEMU doesn't send anything about empty > pages at all (this is why I want to revert that patch)? I do not understand your question. The patch below zeroes out the destination memory if it is not zero (e.g. if there is a BIOS copied to memory already during machine init). I would prefer not to completely drop the patch since it saves bandwidth and resources. Peter ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-10 8:44 ` Peter Lieven @ 2013-06-10 9:10 ` Alexey Kardashevskiy 2013-06-10 9:33 ` [Qemu-devel] [Qemu-ppc] " Benjamin Herrenschmidt 0 siblings, 1 reply; 49+ messages in thread From: Alexey Kardashevskiy @ 2013-06-10 9:10 UTC (permalink / raw) To: Peter Lieven Cc: Paolo Bonzini, David Gibson, qemu-ppc, Wenchao Xia, qemu-devel On 06/10/2013 06:44 PM, Peter Lieven wrote: > On 10.06.2013 08:55, Alexey Kardashevskiy wrote: >> On 06/10/2013 04:50 PM, Peter Lieven wrote: >>> On 10.06.2013 08:39, Alexey Kardashevskiy wrote: >>>> On 06/09/2013 05:27 PM, Peter Lieven wrote: >>>>> Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>: >>>>> >>>>>> On 06/09/2013 01:01 PM, Wenchao Xia wrote: >>>>>>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道: >>>>>>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote: >>>>>>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道: >>>>>>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote: >>>>>>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote: >>>>>>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote: >>>>>>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto: >>>>>>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote: >>>>>>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>>>>>>>>>>>> You could also scan the page for nonzero >>>>>>>>>>>>>>>>>>>> values before writing it. >>>>>>>>>>>>>>>>>> i had this in mind, but then choosed the other >>>>>>>>>>>>>>>>>> approach.... turned out to be a bad idea. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> alexey: i will prepare a patch later today, >>>>>>>>>>>>>>>>>> could you then please verify it fixes your >>>>>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> paolo: would we still need the madvise or is >>>>>>>>>>>>>>>>>> it enough to not write the zeroes? >>>>>>>>>>>>>>>>> It should be enough to not write them. >>>>>>>>>>>>>>>> Problem: checking the pages for zero allocates >>>>>>>>>>>>>>>> them. even at the source. >>>>>>>>>>>>>>> It doesn't look like. I tried this program and top >>>>>>>>>>>>>>> doesn't show an increasing amount of reserved >>>>>>>>>>>>>>> memory: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> #include <stdio.h> #include <stdlib.h> int main() { >>>>>>>>>>>>>>> char *x = malloc(500 << 20); int i, j; for (i = 0; i >>>>>>>>>>>>>>> < 500; i += 10) { for (j = 0; j < 10 << 20; j += >>>>>>>>>>>>>>> 4096) { *(volatile char*) (x + (i << 20) + j); } >>>>>>>>>>>>>>> getchar(); } } >>>>>>>>>>>>>> strange. we are talking about RSS size, right? >>>>>>>>>>>>> None of the three top values change, and only VIRT is >>>>>>>>>>>>>> 500 MB. >>>>>>>>>>>>>> is the malloc above using mmapped memory? >>>>>>>>>>>>> Yes. >>>>>>>>>>>>> >>>>>>>>>>>>>> which kernel version do you use? >>>>>>>>>>>>> 3.9. >>>>>>>>>>>>> >>>>>>>>>>>>>> what avoids allocating the memory for me is the >>>>>>>>>>>>>> following (with whatever side effects it has ;-)) >>>>>>>>>>>>> This would also fail to migrate any page that is swapped >>>>>>>>>>>>> out, breaking overcommit in a more subtle way. :) >>>>>>>>>>>>> >>>>>>>>>>>>> Paolo >>>>>>>>>>>> the following does also not allocate memory, but qemu >>>>>>>>>>>> does... >>>>>>>>>>> Hi, Peter As the patch writes >>>>>>>>>>> >>>>>>>>>>> "not sending zero pages breaks migration if a page is zero >>>>>>>>>>> at the source but not at the destination." >>>>>>>>>>> >>>>>>>>>>> I don't understand why it would be trouble, shouldn't all >>>>>>>>>>> page not received in dest be treated as zero pages? >>>>>>>>>> How would the destination guest know if some page must be >>>>>>>>>> cleared? The previous patch (which Peter reverted) did not >>>>>>>>>> send anything for the pages which were zero on the source >>>>>>>>>> side. >>>>>>>>> If an page was not received and destination knows that page >>>>>>>>> should exist according to total size, fill it with zero at >>>>>>>>> destination, would it solve the problem? >>>>>>>> It is _live_ migration, the source sends changes, same pages can >>>>>>>> change and be sent several times. So we would need to turn >>>>>>>> tracking on on the destination to know if some page was received >>>>>>>> from the source or changed by the destination itself (by writing >>>>>>>> there bios/firmware images, etc) and then clear pages which were >>>>>>>> touched by the destination and were not sent by the source. >>>>>>> OK, I can understand the problem is, for example: Destination boots >>>>>>> up with 0x0000-0xFFFF filled with bios image. Source forgot to send >>>>>>> zero pages in 0x0000-0xFFFF. >>>>>> The source did not forget, instead it zeroed these pages during its >>>>>> life and thought that they must be zeroed at the destination already >>>>>> (as the destination did not start and did not have a chance to write >>>>>> something there). >>>>>> >>>>>> >>>>>>> After migration destination got 0x0000-0xFFFF dirty(different with >>>>>>> source) >>>>>> Yep. And those pages were empty on the source what made debugging very >>>>>> easy :) >>>>>> >>>>>> >>>>>>> Thanks for explain. >>>>>>> >>>>>>> This seems refer to the migration protocol: how should the guest >>>>>>> treat unsent pages. The patch causing the problem, actually treat >>>>>>> zero pages as "not to sent" at source, but another half is missing: >>>>>>> treat "not received" as zero pages at destination. I guess if second >>>>>>> half is added, problem is gone: after page transfer completed, >>>>>>> before destination resume, fill zero in "not received" pages. >>>>>> >>>>>> Make a working patch, we'll discuss it :) I do not see much >>>>>> acceleration coming from there. >>>>> I would also not spent much time with this. I would either look to find >>>>> an easy way to fix the initialization code to not unneccessarily load >>>>> data into RAM or i will sent a v2 of my patch following Eric's >>>>> concerns. >>>> There is no easy way to implement the flag and keep your original patch as >>>> we have to implement this flag in all architectures which got broken by >>>> your patch and I personally can fix only PPC64-pseries but not the others. >>>> >>>> Furthermore your revert + new patches perfectly solve the problem, why >>>> would we want to bother now with this new flag which nobody really needs >>>> right now? >>>> >>>> Please, please, revert the original patch or I'll try to do it :) >>>> >>>> >>> I tried, but there where concerns by the community. >> >> Was here anybody who did not want to revert the patch (besides you)? >> I did not notice. > Eric said I should not drop the skipped_pages stuff in the monitor. >> >> >>> Alternativly I found >>> the following alternate solution. Please drop the 2 patches and try the >>> following: >> >> How is it going to work if upstream QEMU doesn't send anything about empty >> pages at all (this is why I want to revert that patch)? > I do not understand your question. The patch below zeroes out the destination > memory if it is not zero (e.g. if there is a BIOS copied to memory already > during > machine init). > > I would prefer not to completely drop the patch since it saves bandwidth and > resources. I would like migration to do what it should do - send pages no matter what, this is exactly what migration is for. If there any many, many empty pages (which I doubt to be a very often real life case), they could all merged in big consecutive chunks and sent at the end of migration. -- Alexey ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [Qemu-ppc] broken incoming migration 2013-06-10 9:10 ` Alexey Kardashevskiy @ 2013-06-10 9:33 ` Benjamin Herrenschmidt 2013-06-10 9:42 ` Peter Lieven 0 siblings, 1 reply; 49+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-10 9:33 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Peter Lieven, qemu-devel, qemu-ppc, Paolo Bonzini, Wenchao Xia, David Gibson On Mon, 2013-06-10 at 19:10 +1000, Alexey Kardashevskiy wrote: > > I would prefer not to completely drop the patch since it saves bandwidth and > > resources. > > I would like migration to do what it should do - send pages no matter what, > this is exactly what migration is for. If there any many, many empty pages > (which I doubt to be a very often real life case), they could all merged in > big consecutive chunks and sent at the end of migration. I tend to agree. The problem of sending empty pages is purely a problem of compression. If the current mechanism is deemed "not efficient enough" for in the case of having lots of zero-pages, then by all means invent a better packet format for more tightly representing them on the wire, but don't break things by not sending them at all. Cheers, Ben. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [Qemu-ppc] broken incoming migration 2013-06-10 9:33 ` [Qemu-devel] [Qemu-ppc] " Benjamin Herrenschmidt @ 2013-06-10 9:42 ` Peter Lieven 0 siblings, 0 replies; 49+ messages in thread From: Peter Lieven @ 2013-06-10 9:42 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, Paolo Bonzini, Wenchao Xia, David Gibson On 10.06.2013 11:33, Benjamin Herrenschmidt wrote: > On Mon, 2013-06-10 at 19:10 +1000, Alexey Kardashevskiy wrote: >>> I would prefer not to completely drop the patch since it saves bandwidth and >>> resources. >> I would like migration to do what it should do - send pages no matter what, >> this is exactly what migration is for. If there any many, many empty pages >> (which I doubt to be a very often real life case), they could all merged in >> big consecutive chunks and sent at the end of migration. > I tend to agree. The problem of sending empty pages is purely a problem of > compression. If the current mechanism is deemed "not efficient enough" for > in the case of having lots of zero-pages, then by all means invent a better > packet format for more tightly representing them on the wire, but don't > break things by not sending them at all. Ok, I see the point. I think the paradigm to say that the destination should "decide" if it needs a page or not is a sound one. Zero pages are quite often depending on the lifetime and the operating system used. But a consecutive range of zero pages is only likely in the bulk stage. I don't know if its reasonable to add a special encoding for that. I will sent a v2 of my previous revert patch addressing Erics concerns shortly. Peter ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [Qemu-ppc] broken incoming migration 2013-06-09 2:16 ` Wenchao Xia 2013-06-09 2:34 ` Alexey Kardashevskiy @ 2013-06-09 2:53 ` Benjamin Herrenschmidt 2013-06-12 14:00 ` Paolo Bonzini 1 sibling, 1 reply; 49+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-09 2:53 UTC (permalink / raw) To: Wenchao Xia Cc: Alexey Kardashevskiy, Peter Lieven, qemu-devel, qemu-ppc, Paolo Bonzini, David Gibson On Sun, 2013-06-09 at 10:16 +0800, Wenchao Xia wrote: > If an page was not received and destination knows that page should > exist according to total size, fill it with zero at destination, would > it solve the problem? The easiest way to do that is to not write to those pages at the destination to begin with, when initializing the VM... Is there any way to know that a VM is being setup as a migration target or not ? Ben. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [Qemu-ppc] broken incoming migration 2013-06-09 2:53 ` Benjamin Herrenschmidt @ 2013-06-12 14:00 ` Paolo Bonzini 2013-06-12 14:11 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 49+ messages in thread From: Paolo Bonzini @ 2013-06-12 14:00 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, Peter Lieven, qemu-devel, qemu-ppc, Wenchao Xia, David Gibson Il 08/06/2013 22:53, Benjamin Herrenschmidt ha scritto: > On Sun, 2013-06-09 at 10:16 +0800, Wenchao Xia wrote: >> If an page was not received and destination knows that page should >> exist according to total size, fill it with zero at destination, would >> it solve the problem? > > The easiest way to do that is to not write to those pages at the > destination to begin with, when initializing the VM... Is there any way > to know that a VM is being setup as a migration target or not ? There is the "incoming" variable in vl.c (currently not a global), but I suspect Peter's patch could have also broken loadvm. It could quickly become a rat hole. The only bug we have is not a performance bug related to compression; it's that writing zero pages breaks overcommit. Let's fix that, and only that. Paolo ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [Qemu-ppc] broken incoming migration 2013-06-12 14:00 ` Paolo Bonzini @ 2013-06-12 14:11 ` Benjamin Herrenschmidt 2013-06-12 20:10 ` Paolo Bonzini 0 siblings, 1 reply; 49+ messages in thread From: Benjamin Herrenschmidt @ 2013-06-12 14:11 UTC (permalink / raw) To: Paolo Bonzini Cc: Alexey Kardashevskiy, Peter Lieven, qemu-devel, qemu-ppc, Wenchao Xia, David Gibson On Wed, 2013-06-12 at 10:00 -0400, Paolo Bonzini wrote: > The only bug we have is not a performance bug related to compression; > it's that writing zero pages breaks overcommit. Let's fix that, and > only that. Right, do we have a way to madvise "throw away" these instead ? Or do we have a way to track that the platform init code did write something there and only clear *those* pages ? Cheers, Ben. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [Qemu-ppc] broken incoming migration 2013-06-12 14:11 ` Benjamin Herrenschmidt @ 2013-06-12 20:10 ` Paolo Bonzini 2013-06-13 2:41 ` Wenchao Xia 0 siblings, 1 reply; 49+ messages in thread From: Paolo Bonzini @ 2013-06-12 20:10 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexey Kardashevskiy, Peter Lieven, qemu-devel, qemu-ppc, Wenchao Xia, David Gibson Il 12/06/2013 10:11, Benjamin Herrenschmidt ha scritto: > On Wed, 2013-06-12 at 10:00 -0400, Paolo Bonzini wrote: >> The only bug we have is not a performance bug related to compression; >> it's that writing zero pages breaks overcommit. Let's fix that, and >> only that. > > Right, do we have a way to madvise "throw away" these instead ? We already do that, but apparently that madvise is asynchronous. > Or do we > have a way to track that the platform init code did write something > there and only clear *those* pages ? No need for; since it's copy-on-write, not copy-on-read :) we can just check for pages that are zero and not rewrite them with zeros. That's what Peter's patches do, I'll review them right away. Paolo ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [Qemu-ppc] broken incoming migration 2013-06-12 20:10 ` Paolo Bonzini @ 2013-06-13 2:41 ` Wenchao Xia 0 siblings, 0 replies; 49+ messages in thread From: Wenchao Xia @ 2013-06-13 2:41 UTC (permalink / raw) To: Paolo Bonzini Cc: Alexey Kardashevskiy, Peter Lieven, qemu-devel, qemu-ppc, David Gibson 于 2013-6-13 4:10, Paolo Bonzini 写道: > Il 12/06/2013 10:11, Benjamin Herrenschmidt ha scritto: >> On Wed, 2013-06-12 at 10:00 -0400, Paolo Bonzini wrote: >>> The only bug we have is not a performance bug related to compression; >>> it's that writing zero pages breaks overcommit. Let's fix that, and >>> only that. >> >> Right, do we have a way to madvise "throw away" these instead ? > > We already do that, but apparently that madvise is asynchronous. > >> Or do we >> have a way to track that the platform init code did write something >> there and only clear *those* pages ? > > No need for; since it's copy-on-write, not copy-on-read :) we can just > check for pages that are zero and not rewrite them with zeros. That's I think it is the right way to improve overcommit without breaking anything. > what Peter's patches do, I'll review them right away. > > Paolo > > -- Best Regards Wenchao Xia ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-05-30 14:38 ` Peter Lieven 2013-05-30 14:41 ` Paolo Bonzini @ 2013-06-03 10:04 ` Alexey Kardashevskiy 2013-06-04 10:56 ` Peter Lieven 1 sibling, 1 reply; 49+ messages in thread From: Alexey Kardashevskiy @ 2013-06-03 10:04 UTC (permalink / raw) To: Peter Lieven; +Cc: Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson On 05/31/2013 12:38 AM, Peter Lieven wrote: > > > > Am 30.05.2013 um 15:41 schrieb "Paolo Bonzini" <pbonzini@redhat.com>: > >> Il 30/05/2013 11:08, Peter Lieven ha scritto: >>> Am 30.05.2013 10:18, schrieb Alexey Kardashevskiy: >>>> On 05/30/2013 05:49 PM, Paolo Bonzini wrote: >>>>> Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto: >>>>>> Hi! >>>>>> >>>>>> I found the migration broken on pseries platform, specifically, this patch >>>>>> broke it: >>>>>> >>>>>> f1c72795af573b24a7da5eb52375c9aba8a37972 >>>>>> migration: do not sent zero pages in bulk stage >>>>>> >>>>>> The idea is not to send zero pages to the destination guest which is >>>>>> expected to have 100% empty RAM. >>>>>> >>>>>> However on pseries plaftorm the guest always has some stuff in the RAM as a >>>>>> part of initialization (device tree, system firmware and rtas (?)) so it is >>>>>> not completely empty. As the source guest cannot detect this, it skips some >>>>>> pages during migration and we get a broken destination guest. Bug. >>>>>> >>>>>> While the idea is ok in general, I do not see any easy way to fix it as >>>>>> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information >>>>>> about whether we are about to receive a migration or not (-incoming >>>>>> parameter) and we cannot move device-tree and system firmware >>>>>> initialization anywhere else. >>>>>> >>>>>> ram_bulk_stage is static and cannot be disabled from the platform >>>>>> initialization code. >>>>>> >>>>>> So what would the community suggest? >>>>> Revert the patch. :) >>>> I'll wait for 24 hours (forgot to cc: the author) and then post a revert >>>> patch :) >>> does this problem only occur on pseries emulation? >> >> Probably not. On a PC, it would occur if you had 4K of zeros in the >> source BIOS but not in the destination BIOS. When you reboot, the BIOS >> image is wrong. >> >>> not sending zero pages is not only a performance benefit it also makes >>> overcomitted memory usable. the madv_dontneed seems to kick in asynchronously >>> and memory is not available immediately. >> >> You could also scan the page for nonzero values before writing it. > > i had this in mind, but then choosed the other approach.... turned out to be a bad idea. > > alexey: i will prepare a patch later today, could you then please verify it fixes your problem. Yes I can, where is the patch? :) > paolo: would we still need the madvise or is it enough to not write the zeroes? > > Peter > >> >> Paolo > -- Alexey ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-06-03 10:04 ` [Qemu-devel] " Alexey Kardashevskiy @ 2013-06-04 10:56 ` Peter Lieven 0 siblings, 0 replies; 49+ messages in thread From: Peter Lieven @ 2013-06-04 10:56 UTC (permalink / raw) To: Alexey Kardashevskiy; +Cc: Paolo Bonzini, qemu-ppc, qemu-devel, David Gibson On 03.06.2013 12:04, Alexey Kardashevskiy wrote: > On 05/31/2013 12:38 AM, Peter Lieven wrote: >> >> >> Am 30.05.2013 um 15:41 schrieb "Paolo Bonzini" <pbonzini@redhat.com>: >> >>> Il 30/05/2013 11:08, Peter Lieven ha scritto: >>>> Am 30.05.2013 10:18, schrieb Alexey Kardashevskiy: >>>>> On 05/30/2013 05:49 PM, Paolo Bonzini wrote: >>>>>> Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto: >>>>>>> Hi! >>>>>>> >>>>>>> I found the migration broken on pseries platform, specifically, this patch >>>>>>> broke it: >>>>>>> >>>>>>> f1c72795af573b24a7da5eb52375c9aba8a37972 >>>>>>> migration: do not sent zero pages in bulk stage >>>>>>> >>>>>>> The idea is not to send zero pages to the destination guest which is >>>>>>> expected to have 100% empty RAM. >>>>>>> >>>>>>> However on pseries plaftorm the guest always has some stuff in the RAM as a >>>>>>> part of initialization (device tree, system firmware and rtas (?)) so it is >>>>>>> not completely empty. As the source guest cannot detect this, it skips some >>>>>>> pages during migration and we get a broken destination guest. Bug. >>>>>>> >>>>>>> While the idea is ok in general, I do not see any easy way to fix it as >>>>>>> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information >>>>>>> about whether we are about to receive a migration or not (-incoming >>>>>>> parameter) and we cannot move device-tree and system firmware >>>>>>> initialization anywhere else. >>>>>>> >>>>>>> ram_bulk_stage is static and cannot be disabled from the platform >>>>>>> initialization code. >>>>>>> >>>>>>> So what would the community suggest? >>>>>> Revert the patch. :) >>>>> I'll wait for 24 hours (forgot to cc: the author) and then post a revert >>>>> patch :) >>>> does this problem only occur on pseries emulation? >>> Probably not. On a PC, it would occur if you had 4K of zeros in the >>> source BIOS but not in the destination BIOS. When you reboot, the BIOS >>> image is wrong. >>> >>>> not sending zero pages is not only a performance benefit it also makes >>>> overcomitted memory usable. the madv_dontneed seems to kick in asynchronously >>>> and memory is not available immediately. >>> You could also scan the page for nonzero values before writing it. >> i had this in mind, but then choosed the other approach.... turned out to be a bad idea. >> >> alexey: i will prepare a patch later today, could you then please verify it fixes your problem. > > Yes I can, where is the patch? :) its on my todo for today. sorry, have been a bit busy lately. Peter ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-05-30 13:00 ` Paolo Bonzini 2013-05-30 13:38 ` Alexey Kardashevskiy 2013-05-30 14:38 ` Peter Lieven @ 2013-06-08 8:24 ` Wenchao Xia 2 siblings, 0 replies; 49+ messages in thread From: Wenchao Xia @ 2013-06-08 8:24 UTC (permalink / raw) To: Paolo Bonzini Cc: Alexey Kardashevskiy, qemu-ppc, Peter Lieven, qemu-devel, David Gibson 于 2013-5-30 21:00, Paolo Bonzini 写道: > Il 30/05/2013 11:08, Peter Lieven ha scritto: >> Am 30.05.2013 10:18, schrieb Alexey Kardashevskiy: >>> On 05/30/2013 05:49 PM, Paolo Bonzini wrote: >>>> Il 30/05/2013 09:44, Alexey Kardashevskiy ha scritto: >>>>> Hi! >>>>> >>>>> I found the migration broken on pseries platform, specifically, this patch >>>>> broke it: >>>>> >>>>> f1c72795af573b24a7da5eb52375c9aba8a37972 >>>>> migration: do not sent zero pages in bulk stage >>>>> >>>>> The idea is not to send zero pages to the destination guest which is >>>>> expected to have 100% empty RAM. >>>>> >>>>> However on pseries plaftorm the guest always has some stuff in the RAM as a >>>>> part of initialization (device tree, system firmware and rtas (?)) so it is >>>>> not completely empty. As the source guest cannot detect this, it skips some >>>>> pages during migration and we get a broken destination guest. Bug. >>>>> >>>>> While the idea is ok in general, I do not see any easy way to fix it as >>>>> neither QEMUMachine::init nor QEMUMachine::reset callbacks has information >>>>> about whether we are about to receive a migration or not (-incoming >>>>> parameter) and we cannot move device-tree and system firmware >>>>> initialization anywhere else. >>>>> >>>>> ram_bulk_stage is static and cannot be disabled from the platform >>>>> initialization code. >>>>> >>>>> So what would the community suggest? >>>> Revert the patch. :) >>> I'll wait for 24 hours (forgot to cc: the author) and then post a revert >>> patch :) >>> >> does this problem only occur on pseries emulation? > > Probably not. On a PC, it would occur if you had 4K of zeros in the > source BIOS but not in the destination BIOS. When you reboot, the BIOS > image is wrong. > So one way is to migrate BIOS contents together? >> not sending zero pages is not only a performance benefit it also makes >> overcomitted memory usable. the madv_dontneed seems to kick in asynchronously >> and memory is not available immediately. > > You could also scan the page for nonzero values before writing it. > > Paolo > -- Best Regards Wenchao Xia ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] broken incoming migration 2013-05-30 7:44 [Qemu-devel] broken incoming migration Alexey Kardashevskiy 2013-05-30 7:49 ` Alexey Kardashevskiy 2013-05-30 7:49 ` Paolo Bonzini @ 2013-05-30 10:18 ` Peter Maydell 2 siblings, 0 replies; 49+ messages in thread From: Peter Maydell @ 2013-05-30 10:18 UTC (permalink / raw) To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, David Gibson On 30 May 2013 08:44, Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > I found the migration broken on pseries platform, specifically, this patch > broke it: > > f1c72795af573b24a7da5eb52375c9aba8a37972 > migration: do not sent zero pages in bulk stage > > The idea is not to send zero pages to the destination guest which is > expected to have 100% empty RAM. This just seems like a broken assumption to me. If the migration protocol is going to skip sending zero pages it should do it by saying "there's a zero page here" and have the destination memset the page to zero. thanks -- PMM ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2013-06-13 2:41 UTC | newest] Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-05-30 7:44 [Qemu-devel] broken incoming migration Alexey Kardashevskiy 2013-05-30 7:49 ` Alexey Kardashevskiy 2013-05-30 7:49 ` Paolo Bonzini 2013-05-30 8:18 ` Alexey Kardashevskiy 2013-05-30 9:08 ` Peter Lieven 2013-05-30 9:31 ` Alexey Kardashevskiy 2013-05-30 13:00 ` Paolo Bonzini 2013-05-30 13:38 ` Alexey Kardashevskiy 2013-05-30 14:08 ` Paolo Bonzini 2013-05-30 14:38 ` Peter Lieven 2013-05-30 14:41 ` Paolo Bonzini 2013-06-04 13:52 ` Peter Lieven 2013-06-04 14:14 ` Paolo Bonzini 2013-06-04 14:38 ` Peter Lieven 2013-06-04 14:40 ` Paolo Bonzini 2013-06-04 14:48 ` Peter Lieven 2013-06-04 15:17 ` Paolo Bonzini 2013-06-04 19:15 ` Peter Lieven 2013-06-05 3:37 ` Alexey Kardashevskiy 2013-06-05 6:09 ` Peter Lieven 2013-06-09 4:12 ` liu ping fan 2013-06-09 7:22 ` Peter Lieven 2013-06-04 15:10 ` Peter Lieven 2013-06-08 8:27 ` Wenchao Xia 2013-06-08 8:30 ` Alexey Kardashevskiy 2013-06-09 2:16 ` Wenchao Xia 2013-06-09 2:34 ` Alexey Kardashevskiy 2013-06-09 2:52 ` [Qemu-devel] [Qemu-ppc] " Benjamin Herrenschmidt 2013-06-09 3:01 ` Alexey Kardashevskiy 2013-06-09 3:01 ` [Qemu-devel] " Wenchao Xia 2013-06-09 3:09 ` Alexey Kardashevskiy 2013-06-09 3:31 ` Wenchao Xia 2013-06-09 7:27 ` Peter Lieven 2013-06-10 6:39 ` Alexey Kardashevskiy 2013-06-10 6:50 ` Peter Lieven 2013-06-10 6:55 ` Alexey Kardashevskiy 2013-06-10 8:44 ` Peter Lieven 2013-06-10 9:10 ` Alexey Kardashevskiy 2013-06-10 9:33 ` [Qemu-devel] [Qemu-ppc] " Benjamin Herrenschmidt 2013-06-10 9:42 ` Peter Lieven 2013-06-09 2:53 ` Benjamin Herrenschmidt 2013-06-12 14:00 ` Paolo Bonzini 2013-06-12 14:11 ` Benjamin Herrenschmidt 2013-06-12 20:10 ` Paolo Bonzini 2013-06-13 2:41 ` Wenchao Xia 2013-06-03 10:04 ` [Qemu-devel] " Alexey Kardashevskiy 2013-06-04 10:56 ` Peter Lieven 2013-06-08 8:24 ` Wenchao Xia 2013-05-30 10:18 ` Peter Maydell
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.