From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:58211) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f5q76-0001wb-97 for qemu-devel@nongnu.org; Tue, 10 Apr 2018 06:03:00 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1f5q70-0004Y2-37 for qemu-devel@nongnu.org; Tue, 10 Apr 2018 06:02:56 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:46404 helo=mx1.redhat.com) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1f5q6z-0004Wl-T6 for qemu-devel@nongnu.org; Tue, 10 Apr 2018 06:02:50 -0400 Date: Tue, 10 Apr 2018 11:02:36 +0100 From: "Dr. David Alan Gilbert" Message-ID: <20180410100235.GC2559@work-vm> References: <20180404080600.GA10540@xz-mi> <0a48a834f08d064eaa3eb4ef1b41235f@linux.vnet.ibm.com> <20180409185747.GL2449@work-vm> <20180410112255.7485f2a7@umbus.fritz.box> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180410112255.7485f2a7@umbus.fritz.box> Subject: Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining() List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: David Gibson Cc: Balamuruhan S , Peter Xu , qemu-devel@nongnu.org, quintela@redhat.com * David Gibson (dgibson@redhat.com) wrote: > On Mon, 9 Apr 2018 19:57:47 +0100 > "Dr. David Alan Gilbert" wrote: > > > * Balamuruhan S (bala24@linux.vnet.ibm.com) wrote: > > > On 2018-04-04 13:36, Peter Xu wrote: > > > > On Wed, Apr 04, 2018 at 11:55:14AM +0530, Balamuruhan S wrote: > [snip] > > > > > > - postcopy: that'll let you start the destination VM even without > > > > > > transferring all the RAMs before hand > > > > > > > > > > I am seeing issue in postcopy migration between POWER8(16M) -> > > > > > POWER9(1G) > > > > > where the hugepage size is different. I am trying to enable it but > > > > > host > > > > > start > > > > > address have to be aligned with 1G page size in > > > > > ram_block_discard_range(), > > > > > which I am debugging further to fix it. > > > > > > > > I thought the huge page size needs to be matched on both side > > > > currently for postcopy but I'm not sure. > > > > > > you are right! it should be matched, but we need to support > > > POWER8(16M) -> POWER9(1G) > > > > > > > CC Dave (though I think Dave's still on PTO). > > > > There's two problems there: > > a) Postcopy with really big huge pages is a problem, because it takes > > a long time to send the whole 1G page over the network and the vCPU > > is paused during that time; for example on a 10Gbps link, it takes > > about 1 second to send a 1G page, so that's a silly time to keep > > the vCPU paused. > > > > b) Mismatched pagesizes are a problem on postcopy; we require that the > > whole of a hostpage is sent continuously, so that it can be > > atomically placed in memory, the source knows to do this based on > > the page sizes that it sees. There are some other cases as well > > (e.g. discards have to be page aligned.) > > I'm not entirely clear on what mismatched means here. Mismatched > between where and where? I *think* the relevant thing is a mismatch > between host backing page size on source and destination, but I'm not > certain. Right. As I understand it, we make no requirements on (an x86) guest as to what page sizes it uses given any particular host page sizes. > > Both of the problems are theoretically fixable; but neither case is > > easy. > > (b) could be fixed by sending the hugepage size back to the source, > > so that it knows to perform alignments on a larger boundary to it's > > own RAM blocks. > > Sounds feasible, but like something that will take some thought and > time upstream. Yes; it's not too bad. > > (a) is a much much harder problem; one *idea* would be a major > > reorganisation of the kernels hugepage + userfault code to somehow > > allow them to temporarily present as normal pages rather than a > > hugepage. > > Yeah... for Power specifically, I think doing that would be really > hard, verging on impossible, because of the way the MMU is > virtualized. Well.. it's probably not too bad for a native POWER9 > guest (using the radix MMU), but the issue here is for POWER8 compat > guests which use the hash MMU. My idea was to fill the pagetables for that hugepage using small page entries but using the physical hugepages memory; so that once we're done we'd flip it back to being a single hugepage entry. (But my understanding is that doesn't fit at all into the way the kernel hugepage code works). > > Does P9 really not have a hugepage that's smaller than 1G? > > It does (2M), but we can't use it in this situation. As hinted above, > POWER9 has two very different MMU modes, hash and radix. In hash mode > (which is similar to POWER8 and earlier CPUs) the hugepage sizes are > 16M and 16G, in radix mode (more like x86) they are 2M and 1G. > > POWER9 hosts always run in radix mode. Or at least, we only support > running them in radix mode. We support both radix mode and hash mode > guests, the latter including all POWER8 compat mode guests. > > The next complication is because the way the hash virtualization works, > any page used by the guest must be HPA-contiguous, not just > GPA-contiguous. Which means that any pagesize used by the guest must > be smaller or equal than the host pagesizes used to back the guest. > We (sort of) cope with that by only advertising the 16M pagesize to the > guest if all guest RAM is backed by >= 16M pages. > > But that advertisement only happens at guest boot. So if we migrate a > guest from POWER8, backed by 16M pages to POWER9 backed by 2M pages, > the guest still thinks it can use 16M pages and jams up. (I'm in the > middle of upstream work to make the failure mode less horrible). > > So, the only way to run a POWER8 compat mode guest with access to 16M > pages on a POWER9 radix mode host is using 1G hugepages on the host > side. Ah ok; I'm not seeing an easy answer here. The only vague thing I can think of is if you gave P9 a fake 16M hugepage mode, that did all HPA and mappings in 16M chunks (using 8 x 2M page entries). Dave > -- > David Gibson > Principal Software Engineer, Virtualization, Red Hat -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK