Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining()

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: David Gibson <dgibson@redhat.com>
Cc: Balamuruhan S <bala24@linux.vnet.ibm.com>,
	Peter Xu <peterx@redhat.com>,
	qemu-devel@nongnu.org, quintela@redhat.com
Subject: Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining()
Date: Tue, 10 Apr 2018 11:02:36 +0100	[thread overview]
Message-ID: <20180410100235.GC2559@work-vm> (raw)
In-Reply-To: <20180410112255.7485f2a7@umbus.fritz.box>

* David Gibson (dgibson@redhat.com) wrote:
> On Mon, 9 Apr 2018 19:57:47 +0100
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Balamuruhan S (bala24@linux.vnet.ibm.com) wrote:
> > > On 2018-04-04 13:36, Peter Xu wrote:  
> > > > On Wed, Apr 04, 2018 at 11:55:14AM +0530, Balamuruhan S wrote:
> [snip]
> > > > > > - postcopy: that'll let you start the destination VM even without
> > > > > >   transferring all the RAMs before hand  
> > > > > 
> > > > > I am seeing issue in postcopy migration between POWER8(16M) ->
> > > > > POWER9(1G)
> > > > > where the hugepage size is different. I am trying to enable it but
> > > > > host
> > > > > start
> > > > > address have to be aligned with 1G page size in
> > > > > ram_block_discard_range(),
> > > > > which I am debugging further to fix it.  
> > > > 
> > > > I thought the huge page size needs to be matched on both side
> > > > currently for postcopy but I'm not sure.  
> > > 
> > > you are right! it should be matched, but we need to support
> > > POWER8(16M) -> POWER9(1G)
> > >   
> > > > CC Dave (though I think Dave's still on PTO).  
> > 
> > There's two problems there:
> >   a) Postcopy with really big huge pages is a problem, because it takes
> >      a long time to send the whole 1G page over the network and the vCPU
> >      is paused during that time;  for example on a 10Gbps link, it takes
> >      about 1 second to send a 1G page, so that's a silly time to keep
> >      the vCPU paused.
> > 
> >   b) Mismatched pagesizes are a problem on postcopy; we require that the
> >      whole of a hostpage is sent continuously, so that it can be
> >      atomically placed in memory, the source knows to do this based on
> >      the page sizes that it sees.  There are some other cases as well 
> >      (e.g. discards have to be page aligned.)
> 
> I'm not entirely clear on what mismatched means here.  Mismatched
> between where and where?  I *think* the relevant thing is a mismatch
> between host backing page size on source and destination, but I'm not
> certain.

Right.  As I understand it, we make no requirements on (an x86) guest
as to what page sizes it uses given any particular host page sizes.

> > Both of the problems are theoretically fixable; but neither case is
> > easy.
> > (b) could be fixed by sending the hugepage size back to the source,
> > so that it knows to perform alignments on a larger boundary to it's
> > own RAM blocks.
> 
> Sounds feasible, but like something that will take some thought and
> time upstream.

Yes; it's not too bad.

> > (a) is a much much harder problem; one *idea* would be a major
> > reorganisation of the kernels hugepage + userfault code to somehow
> > allow them to temporarily present as normal pages rather than a
> > hugepage.
> 
> Yeah... for Power specifically, I think doing that would be really
> hard, verging on impossible, because of the way the MMU is
> virtualized.  Well.. it's probably not too bad for a native POWER9
> guest (using the radix MMU), but the issue here is for POWER8 compat
> guests which use the hash MMU.

My idea was to fill the pagetables for that hugepage using small page
entries but using the physical hugepages memory; so that once we're
done we'd flip it back to being a single hugepage entry.
(But my understanding is that doesn't fit at all into the way the kernel
hugepage code works).

> > Does P9 really not have a hugepage that's smaller than 1G?
> 
> It does (2M), but we can't use it in this situation.  As hinted above,
> POWER9 has two very different MMU modes, hash and radix.  In hash mode
> (which is similar to POWER8 and earlier CPUs) the hugepage sizes are
> 16M and 16G, in radix mode (more like x86) they are 2M and 1G.
> 
> POWER9 hosts always run in radix mode.  Or at least, we only support
> running them in radix mode.  We support both radix mode and hash mode
> guests, the latter including all POWER8 compat mode guests.
> 
> The next complication is because the way the hash virtualization works,
> any page used by the guest must be HPA-contiguous, not just
> GPA-contiguous.  Which means that any pagesize used by the guest must
> be smaller or equal than the host pagesizes used to back the guest.
> We (sort of) cope with that by only advertising the 16M pagesize to the
> guest if all guest RAM is backed by >= 16M pages.
> 
> But that advertisement only happens at guest boot.  So if we migrate a
> guest from POWER8, backed by 16M pages to POWER9 backed by 2M pages,
> the guest still thinks it can use 16M pages and jams up.  (I'm in the
> middle of upstream work to make the failure mode less horrible).
> 
> So, the only way to run a POWER8 compat mode guest with access to 16M
> pages on a POWER9 radix mode host is using 1G hugepages on the host
> side.

Ah ok;  I'm not seeing an easy answer here.
The only vague thing I can think of is if you gave P9 a fake 16M
hugepage mode, that did all HPA and mappings in 16M chunks (using 8 x 2M
page entries).

Dave

> -- 
> David Gibson <dgibson@redhat.com>
> Principal Software Engineer, Virtualization, Red Hat

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK