All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pagupta@redhat.com>,
	David Hildenbrand <david@redhat.com>,
	qemu-devel@nongnu.org, Paolo Bonzini <pbonzini@redhat.com>,
	eblake@redhat.com, stefanha@redhat.com
Subject: Re: [Qemu-devel] [PATCH v3 3/3] virtio-pmem: should we make it migratable???
Date: Tue, 8 May 2018 10:44:40 +0100	[thread overview]
Message-ID: <20180508094439.GE2500@work-vm> (raw)
In-Reply-To: <20180507101228.0d7b7b66@redhat.com>

* Igor Mammedov (imammedo@redhat.com) wrote:
> On Fri, 4 May 2018 13:26:51 +0100
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Igor Mammedov (imammedo@redhat.com) wrote:
> > > On Thu, 26 Apr 2018 03:37:51 -0400 (EDT)
> > > Pankaj Gupta <pagupta@redhat.com> wrote:
> > > 
> > > trimming CC list to keep people that might be interested in the topic
> > > and renaming thread to reflect it.
> > >   
> > > > > > > > > > >> +
> > > > > > > > > > >> +    memory_region_add_subregion(&hpms->mr, addr - hpms->base,
> > > > > > > > > > >> mr);    
> > > > > > > > > > > missing vmstate registration?    
> > > > > > > > > > 
> > > > > > > > > > Missed this one: To be called by the caller. Important because e.g.
> > > > > > > > > > for
> > > > > > > > > > virtio-pmem we don't want this (I assume :) ).    
> > > > > > > > > if pmem isn't on shared storage, then We'd probably want to migrate
> > > > > > > > > it as well, otherwise target would experience data loss.
> > > > > > > > > Anyways, I'd just reat it as normal RAM in migration case    
> > > > > > > > 
> > > > > > > > Main difference between RAM and pmem it acts like combination of RAM
> > > > > > > > and
> > > > > > > > disk.
> > > > > > > > Saying this, in normal use-case size would be 100 GB's - few TB's
> > > > > > > > range.
> > > > > > > > I am not sure we really want to migrate it for non-shared storage
> > > > > > > > use-case.    
> > > > > > > with non shared storage you'd have to migrate it target host but
> > > > > > > with shared storage it might be possible to flush it and use directly
> > > > > > > from target host. That probably won't work right out of box and would
> > > > > > > need some sort of synchronization between src/dst hosts.    
> > > > > > 
> > > > > > Shared storage should work out of the box.
> > > > > > Only thing is data in destination
> > > > > > host will be cache cold and existing pages in cache should be invalidated
> > > > > > first.
> > > > > > But if we migrate entire fake DAX RAMstate it will populate destination
> > > > > > host page
> > > > > > cache including pages while were idle in source host. This would
> > > > > > unnecessarily
> > > > > > create entropy in destination host.
> > > > > > 
> > > > > > To me this feature don't make much sense. Problem which we are solving is:
> > > > > > Efficiently use guest RAM.    
> > > > > What would live migration handover flow look like in case of
> > > > > guest constantly dirting memory provided by virtio-pmem and
> > > > > and sometimes issuing async flush req along with it?    
> > > > 
> > > > Dirty entire pmem (disk) at once not a usual scenario. Some part of disk/pmem
> > > > would get dirty and we need to handle that. I just want to say moving entire
> > > > pmem (disk) is not efficient solution because we are using this solution to
> > > > manage guest memory efficiently. Otherwise it will be like any block device copy
> > > > with non-shared storage.     
> > > not sure if we can use block layer analogy here.
> > >   
> > > > > > > The same applies to nv/pc-dimm as well, as backend file easily could be
> > > > > > > on pmem storage as well.    
> > > > > > 
> > > > > > Are you saying backing file is in actual actual nvdimm hardware? we don't
> > > > > > need
> > > > > > emulation at all.    
> > > > > depends on if file is on DAX filesystem, but your argument about
> > > > > migrating huge 100Gb- TB's range applies in this case as well.
> > > > >     
> > > > > >     
> > > > > > > 
> > > > > > > Maybe for now we should migrate everything so it would work in case of
> > > > > > > non shared NVDIMM on host. And then later add migration-less capability
> > > > > > > to all of them.    
> > > > > > 
> > > > > > not sure I agree.    
> > > > > So would you inhibit migration in case of non shared backend storage,
> > > > > to avoid loosing data since they aren't migrated?    
> > > > 
> > > > I am just thinking what features we want to support with pmem. And live migration
> > > > with shared storage is the one which comes to my mind.
> > > > 
> > > > If live migration with non-shared storage is what we want to support (I don't know
> > > > yet) we can add this? Even with shared storage it would copy entire pmem state?  
> > > Perhaps we should register vmstate like for normal ram and use something similar to
> > >   http://lists.gnu.org/archive/html/qemu-devel/2018-04/msg00003.html this
> > > to skip shared memory on migration.
> > > In this case we could use this for pc-dimms as well.
> > > 
> > > David,
> > >  what's your take on it?  
> > 
> > My feel is that something is going to have to migrate it, I'm just not
> > sure how.
> > So let me just check I understand:
> >   a) It's potentially huge
> yep, assume it could be in storage quarantines (100s of Gb)
> 
> >   b) It's a RAMBlock
> it is

Well, the good news is migration is going to try and migrate it.
The bad news is migration is going to try and migrate it.

> >   c) It's backed by ????
> >      c1) Something machine local - i.e. a physical lump of flash in a
> >          socket rather than something sharable by machines?
> it's backed by memory-backend-foo, so it could be really anything (RAM,
> file on local or shared storage, file descriptor)

OK, something is going to have to know whether it's on shared storage or
not and do something different in the two cases.   If it's shared
storage then we need to find a way to stop migration trying to migrate
it, because migrating data to the other host when both hosts are really
backed by the same thing ends up with a corrupt mess; we've had block
storage do that when they don't realise they're on an NFS share.
There are a few patches on the list to exclude migration from some
RAMBlock's, so we can build on that once we figure out how we know
if it's shared or not.

If it is shared, then we've also got to worry about consistency to
ensure that the last few writes on the source make it to the destination
before the destination starts, that the destination hasn't cached
any old stuff, and that a failing migraiton lands back on the source
without the destination having changed anything.

> >   d) It can potentially be rapidly changing as the guest writes to it?
> it's sort of like NVDIMM but without NVDIMM interface, it uses virtio to
> to force flushing instead. Otherwise it's directly mapped into guest
> address space, so guest can do anything with it including fast dirtying.

OK.

Getting Postcopy to work with it might be a solution; but depending what
the underlying fd looks like will probably need some kernel changes to
get userfaultfd to work on it.
Postcopy on huge memories should work, but watch out for downtime
due to sending the discard bitmaps.

(cc'd in Eric and Stefan)

Dave

> 
> > Dave
> > 
> > > > Thanks,
> > > > Pankaj
> > > >    
> > > > > 
> > > > >     
> > > > > > > > One reason why nvdimm added vmstate info could be: still there would be
> > > > > > > > transient
> > > > > > > > writes in memory with fake DAX and there is no way(till now) to flush
> > > > > > > > the
> > > > > > > > guest
> > > > > > > > writes. But with virtio-pmem we can flush such writes before migration
> > > > > > > > and
> > > > > > > > automatically
> > > > > > > > at destination host with shared disk we will have updated data.    
> > > > > > > nvdimm has concept of flush address hint (may be not implemented in qemu
> > > > > > > yet)
> > > > > > > but it can flush. The only reason I'm buying into virtio-mem idea
> > > > > > > is that would allow async flush queues which would reduce number
> > > > > > > of vmexits.    
> > > > > > 
> > > > > > Thats correct.
> > > > > > 
> > > > > > Thanks,
> > > > > > Pankaj
> > > > > > 
> > > > > >      
> > > > > 
> > > > > 
> > > > >     
> > > >   
> > >   
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

  parent reply	other threads:[~2018-05-08  9:44 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-20 12:34 [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice David Hildenbrand
2018-04-20 12:34 ` [Qemu-devel] [PATCH v3 1/3] pc-dimm: factor out MemoryDevice interface David Hildenbrand
2018-04-22  4:26   ` David Gibson
2018-04-22  8:21     ` David Hildenbrand
2018-04-22 10:10       ` David Gibson
2018-04-23  9:52         ` David Hildenbrand
2018-04-22  5:09   ` Pankaj Gupta
2018-04-22  8:26     ` David Hildenbrand
2018-04-20 12:34 ` [Qemu-devel] [PATCH v3 2/3] machine: make MemoryHotplugState accessible via the machine David Hildenbrand
2018-04-23  3:28   ` David Gibson
2018-04-23  9:36     ` David Hildenbrand
2018-04-23 10:44       ` David Gibson
2018-04-23 11:11         ` David Hildenbrand
2018-04-20 12:34 ` [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code David Hildenbrand
2018-04-23 12:19   ` Igor Mammedov
2018-04-23 12:44     ` David Hildenbrand
2018-04-24 13:28       ` Igor Mammedov
2018-04-24 13:39         ` David Hildenbrand
2018-04-24 14:38           ` Igor Mammedov
2018-04-23 12:52     ` David Hildenbrand
2018-04-24 13:31       ` Igor Mammedov
2018-04-24 13:41         ` David Hildenbrand
2018-04-24 14:44           ` Igor Mammedov
2018-04-24 15:23             ` David Hildenbrand
2018-04-25  5:45         ` Pankaj Gupta
2018-04-25 13:23           ` Igor Mammedov
2018-04-25 13:56             ` Pankaj Gupta
2018-04-25 15:26               ` Igor Mammedov
2018-04-26  7:37                 ` Pankaj Gupta
2018-05-04  9:13                   ` [Qemu-devel] [PATCH v3 3/3] virtio-pmem: should we make it migratable??? Igor Mammedov
2018-05-04  9:30                     ` David Hildenbrand
2018-05-04 11:59                       ` Pankaj Gupta
2018-05-04 12:26                     ` Dr. David Alan Gilbert
2018-05-07  8:12                       ` Igor Mammedov
2018-05-07 11:19                         ` Pankaj Gupta
2018-05-08  9:44                         ` Dr. David Alan Gilbert [this message]
2018-04-23 14:44     ` [Qemu-devel] [PATCH v3 3/3] pc-dimm: factor out address space logic into MemoryDevice code David Hildenbrand
2018-04-22  4:58 ` [Qemu-devel] [PATCH v3 0/3] pc-dimm: factor out MemoryDevice Pankaj Gupta
2018-04-22  8:20   ` David Hildenbrand
2018-04-23  4:58     ` Pankaj Gupta
2018-04-23 12:31 ` Igor Mammedov
2018-04-23 12:50   ` David Hildenbrand
2018-04-23 15:32   ` Pankaj Gupta
2018-04-23 16:35     ` David Hildenbrand
2018-04-24 14:00     ` Igor Mammedov
2018-04-24 15:42       ` David Hildenbrand
2018-04-25 12:15         ` Igor Mammedov
2018-04-25 12:46           ` David Hildenbrand
2018-04-25 13:15             ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180508094439.GE2500@work-vm \
    --to=dgilbert@redhat.com \
    --cc=david@redhat.com \
    --cc=eblake@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=pagupta@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.