From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:32976) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UlVtO-0003Sw-2w for qemu-devel@nongnu.org; Sat, 08 Jun 2013 23:02:08 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UlVtM-0004zy-La for qemu-devel@nongnu.org; Sat, 08 Jun 2013 23:02:06 -0400 Received: from e23smtp04.au.ibm.com ([202.81.31.146]:36153) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UlVtL-0004y6-Sg for qemu-devel@nongnu.org; Sat, 08 Jun 2013 23:02:04 -0400 Received: from /spool/local by e23smtp04.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Sun, 9 Jun 2013 12:48:11 +1000 Message-ID: <51B3EFFA.4040608@linux.vnet.ibm.com> Date: Sun, 09 Jun 2013 11:01:14 +0800 From: Wenchao Xia MIME-Version: 1.0 References: <51A7036A.3050407@ozlabs.ru> <51A7049F.6040207@redhat.com> <51A70B3D.90609@ozlabs.ru> <51A71705.6060009@kamp.de> <51A74D79.7040204@redhat.com> <2765FDFA-8050-4AA3-8621-7E9EA2C89F9C@kamp.de> <51A764FC.7080705@redhat.com> <51ADF122.70307@kamp.de> <51ADF637.7060804@redhat.com> <51ADFBCE.3080200@kamp.de> <51ADFC7A.7030009@redhat.com> <51AE035A.5070301@kamp.de> <51B2EB0A.7000704@linux.vnet.ibm.com> <51B2EBA2.5060401@ozlabs.ru> <51B3E58C.50301@linux.vnet.ibm.com> <51B3E9A8.5010705@ozlabs.ru> In-Reply-To: <51B3E9A8.5010705@ozlabs.ru> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] broken incoming migration List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alexey Kardashevskiy Cc: "qemu-ppc@nongnu.org" , Paolo Bonzini , Peter Lieven , "qemu-devel@nongnu.org" , David Gibson 于 2013-6-9 10:34, Alexey Kardashevskiy 写道: > On 06/09/2013 12:16 PM, Wenchao Xia wrote: >> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道: >>> On 06/08/2013 06:27 PM, Wenchao Xia wrote: >>>>> On 04.06.2013 16:40, Paolo Bonzini wrote: >>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto: >>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote: >>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto: >>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote: >>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto: >>>>>>>>>>>>> You could also scan the page for nonzero values before writing it. >>>>>>>>>>> i had this in mind, but then choosed the other approach.... turned >>>>>>>>>>> out to be a bad idea. >>>>>>>>>>> >>>>>>>>>>> alexey: i will prepare a patch later today, could you then please >>>>>>>>>>> verify it fixes your problem. >>>>>>>>>>> >>>>>>>>>>> paolo: would we still need the madvise or is it enough to not write >>>>>>>>>>> the zeroes? >>>>>>>>>> It should be enough to not write them. >>>>>>>>> Problem: checking the pages for zero allocates them. even at the >>>>>>>>> source. >>>>>>>> It doesn't look like. I tried this program and top doesn't show an >>>>>>>> increasing amount of reserved memory: >>>>>>>> >>>>>>>> #include >>>>>>>> #include >>>>>>>> int main() >>>>>>>> { >>>>>>>> char *x = malloc(500 << 20); >>>>>>>> int i, j; >>>>>>>> for (i = 0; i < 500; i += 10) { >>>>>>>> for (j = 0; j < 10 << 20; j += 4096) { >>>>>>>> *(volatile char*) (x + (i << 20) + j); >>>>>>>> } >>>>>>>> getchar(); >>>>>>>> } >>>>>>>> } >>>>>>> strange. we are talking about RSS size, right? >>>>>> None of the three top values change, and only VIRT is >500 MB. >>>>>> >>>>>>> is the malloc above using mmapped memory? >>>>>> Yes. >>>>>> >>>>>>> which kernel version do you use? >>>>>> 3.9. >>>>>> >>>>>>> what avoids allocating the memory for me is the following (with >>>>>>> whatever side effects it has ;-)) >>>>>> This would also fail to migrate any page that is swapped out, breaking >>>>>> overcommit in a more subtle way. :) >>>>>> >>>>>> Paolo >>>>> the following does also not allocate memory, but qemu does... >>>>> >>>> Hi, Peter >>>> As the patch writes >>>> >>>> "not sending zero pages breaks migration if a page is zero >>>> at the source but not at the destination." >>>> >>>> I don't understand why it would be trouble, shouldn't all page >>>> not received in dest be treated as zero pages? >>> >>> >>> How would the destination guest know if some page must be cleared? The >>> previous patch (which Peter reverted) did not send anything for the pages >>> which were zero on the source side. >>> >>> >> If an page was not received and destination knows that page should >> exist according to total size, fill it with zero at destination, would >> it solve the problem? > > It is _live_ migration, the source sends changes, same pages can change and > be sent several times. So we would need to turn tracking on on the > destination to know if some page was received from the source or changed by > the destination itself (by writing there bios/firmware images, etc) and > then clear pages which were touched by the destination and were not sent by > the source. OK, I can understand the problem is, for example: Destination boots up with 0x0000-0xFFFF filled with bios image. Source forgot to send zero pages in 0x0000-0xFFFF. After migration destination got 0x0000-0xFFFF dirty(different with source) Thanks for explain. This seems refer to the migration protocol: how should the guest treat unsent pages. The patch causing the problem, actually treat zero pages as "not to sent" at source, but another half is missing: treat "not received" as zero pages at destination. I guess if second half is added, problem is gone: after page transfer completed, before destination resume, fill zero in "not received" pages. > > Or we do not make guesses, the source sends everything and the destination > simply checks if a page which is empty on the source is empty on the > destination and avoid writing zeroes to it. Looks simpler to me and this is > what the new patch does. > > >> >>> >>>> Also, you mean following code is from qemu and it does not allocate >>>> memory with you gcc right? Maybe it is related to KVM, how about >>>> turn off KVM and retry following code in qemu? >>>> >>>>> #include >>>>> #include >>>>> #include >>>>> #include >>>>> #include >>>>> #include >>>>> #include >>>>> #include >>>>> #include >>>>> >>>>> #if defined __SSE2__ >>>>> #include >>>>> #define VECTYPE __m128i >>>>> #define SPLAT(p) _mm_set1_epi8(*(p)) >>>>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == >>>>> 0xFFFF) >>>>> #else >>>>> #define VECTYPE unsigned long >>>>> #define SPLAT(p) (*(p) * (~0UL / 255)) >>>>> #define ALL_EQ(v1, v2) ((v1) == (v2)) >>>>> #endif >>>>> >>>>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8 >>>>> >>>>> /* Round number down to multiple */ >>>>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m)) >>>>> >>>>> /* Round number up to multiple */ >>>>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m)) >>>>> >>>>> #define QEMU_VMALLOC_ALIGN (256 * 4096) >>>>> >>>>> /* alloc shared memory pages */ >>>>> void *qemu_anon_ram_alloc(size_t size) >>>>> { >>>>> size_t align = QEMU_VMALLOC_ALIGN; >>>>> size_t total = size + align - getpagesize(); >>>>> void *ptr = mmap(0, total, PROT_READ | PROT_WRITE, >>>>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); >>>>> size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - >>>>> (uintptr_t)ptr; >>>>> >>>>> if (ptr == MAP_FAILED) { >>>>> fprintf(stderr, "Failed to allocate %zu B: %s\n", >>>>> size, strerror(errno)); >>>>> abort(); >>>>> } >>>>> >>>>> ptr += offset; >>>>> total -= offset; >>>>> >>>>> if (offset > 0) { >>>>> munmap(ptr - offset, offset); >>>>> } >>>>> if (total > size) { >>>>> munmap(ptr + size, total - size); >>>>> } >>>>> >>>>> return ptr; >>>>> } >>>>> >>>>> static inline int >>>>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len) >>>>> { >>>>> return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR >>>>> * sizeof(VECTYPE)) == 0 >>>>> && ((uintptr_t) buf) % sizeof(VECTYPE) == 0); >>>>> } >>>>> >>>>> size_t buffer_find_nonzero_offset(const void *buf, size_t len) >>>>> { >>>>> const VECTYPE *p = buf; >>>>> const VECTYPE zero = (VECTYPE){0}; >>>>> size_t i; >>>>> >>>>> if (!len) { >>>>> return 0; >>>>> } >>>>> >>>>> assert(can_use_buffer_find_nonzero_offset(buf, len)); >>>>> >>>>> for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) { >>>>> if (!ALL_EQ(p[i], zero)) { >>>>> return i * sizeof(VECTYPE); >>>>> } >>>>> } >>>>> >>>>> for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; >>>>> i < len / sizeof(VECTYPE); >>>>> i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) { >>>>> VECTYPE tmp0 = p[i + 0] | p[i + 1]; >>>>> VECTYPE tmp1 = p[i + 2] | p[i + 3]; >>>>> VECTYPE tmp2 = p[i + 4] | p[i + 5]; >>>>> VECTYPE tmp3 = p[i + 6] | p[i + 7]; >>>>> VECTYPE tmp01 = tmp0 | tmp1; >>>>> VECTYPE tmp23 = tmp2 | tmp3; >>>>> if (!ALL_EQ(tmp01 | tmp23, zero)) { >>>>> break; >>>>> } >>>>> } >>>>> >>>>> return i * sizeof(VECTYPE); >>>>> } >>>>> >>>>> int main() >>>>> { >>>>> //char *x = malloc(1024 << 20); >>>>> char *x = qemu_anon_ram_alloc(1024 << 20); >>>>> >>>>> int i, j; >>>>> int ret = 0; >>>>> struct rusage rusage; >>>>> for (i = 0; i < 500; i ++) { >>>>> for (j = 0; j < 10 << 20; j += 4096) { >>>>> ret += buffer_find_nonzero_offset((char*) (x + (i << 20) >>>>> + j), 4096); >>>>> } >>>>> getrusage( RUSAGE_SELF, &rusage ); >>>>> printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10), >>>>> rusage.ru_maxrss); >>>>> getchar(); >>>>> } >>>>> printf("%d zero pages\n", ret); >>>>> } >>>>> >>>> >>>> >>> >>> >> >> > > -- Best Regards Wenchao Xia