qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Tian Kevin <kevin.tian@intel.com>,
	qemu-devel@nongnu.org, Juan Quintela <quintela@redhat.com>
Subject: Re: [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets
Date: Sat, 2 May 2020 17:04:23 -0400	[thread overview]
Message-ID: <20200502210423.GB6299@xz-x1> (raw)
In-Reply-To: <20200429143227.GC4280@xz-x1>

On Wed, Apr 29, 2020 at 10:32:27AM -0400, Peter Xu wrote:
> On Wed, Apr 29, 2020 at 02:26:07PM +0100, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > This RFC series starts from the fact that we will sync dirty bitmap when
> > > removing a memslot for KVM.  IIUC that was majorly to maintain the dirty bitmap
> > > even across a system reboot.
> > > 
> > > This series wants to move that sync from kvm memslot removal to system reset.
> > > 
> > > (I still don't know why the reset system will still need to keep the RAM status
> > >  before the reset.  I thought things like kdump might use this to retrieve info
> > >  from previous kernel panic, however IIUC that's not what kdump is doing now.
> > >  Anyway, I'd be more than glad if anyone knows the real scenario behind
> > >  this...)
> > 
> > Aren't there pages written by the BIOS that are read by the system as it
> > comes up through reset - so you need those pages intact?
> > (But I don't think that slot gets removed? Or does it - the bios has
> > some weird aliasing)
> 
> Do you know which section it writes to?  I can trace a bit to see...
> 
> My previous understanding is that if the RAM starts to be written then it
> shouldn't change layout anymore, but I could be wrong.
> 
> > 
> > > The current solution (sync at kvm memslot removal) works in most cases, but:
> > > 
> > >   - it will be merely impossible to work for dirty ring, and,
> > 
> > Why doesn't that work with dirty ring?
> 
> Because strict sync is hard imho... I shouldn't say it's impossible, but it
> should depend on the fixes for existing kvm get dirty log, so it's even harder
> (please see below).
> 
> I'll start with normal get dirty log.  Currently, qemu's KVM syncs dirty with:
> 
>   - remove memslot
>     - KVM_GET_DIRTY_LOG on memslot
>     - KVM_SET_USER_MEMORY_REGION to remove
> 
> But it's racy because writes can happen between the two steps.  One correct way
> to do that is:
> 
>   - remove memslot
>     - KVM_SET_USER_MEMORY_REGION to set it READONLY [a]
>       - kick all vcpu out synchronously to make sure dirty bits are flushed to
>         KVM per-memslot dirty bitmaps [b]
>     - KVM_GET_DIRTY_LOG on memslot
>     - KVM_SET_USER_MEMORY_REGION to remove
> 
> The fix should both contain a QEMU fix at [a] to first mark it as READONLY so
> no continuous write could happen before get dirty log, and another fix in
> kernel [b] to do synchronously flush each vcpu to make sure PML is flushed
> (which could be very awkward; I proposed an idea [1] but I didn't get any
> positive feedback yet).
> 
> [1] https://lore.kernel.org/qemu-devel/20200402223240.GG103677@xz-x1/
> 
> So I think it's not easy to make it fully right even for get dirty log.
> 
> For dirty ring, the sync will look like:
> 
>   - remove memslot
>     - KVM_SET_USER_MEMORY_REGION to set it READONLY [a]
>       - kick all vcpu out synchronously to make sure dirty bits are flushed to
>         KVM per-memslot dirty bitmaps [b]
>     - collect dirty ring [c]
> 
> So it will depend on us fixing [a,b] first, then because dirty ring is per-vcpu
> not per-slot, [c] needs to collect all data for all vcpus (across all slots).
> So it will try to take all slots_lock even if we're removing a single small
> memslot with one slots_lock taken already.  I think we can still do it, but as
> you see, it's awkward and harder than the legacy get dirty log process.
> 
> I was always wondering why this is that useful and in what cases we're using
> this.  If system reset is the only place then it'll be very clean to move it
> there because that's a very high point of the stack where we only have a BQL,
> at the same time we paused all the vcpus anyway so most nasty things are gone.
> We just do a simple sync like in this series and it'll nicely work for both
> dirty log and dirty ring (then we don't need to fix [a] and [b] either).

Paolo,

Do you have any opinion on this series?  Or more generally comments on what's
your preference to fix the known dirty sync issue with memslot removal would be
welcomed too.  For my own opinion, I prefer the approach with this series so we
won't need to touch the kernel at all, and it also paves the way for dirty
ring.  In all cases, I do think collecting dirty bit of a removing slot is
slightly awkward already.

Dave has valid conerns about other cases besides reboot where we might still
want to keep the dirty bits, on either (1) during BIOS/... boots, or (2) guest
unmaps of device mmio regions that were backed up by RAM.  I can continue to
look into theses, however before that I'd appreciate to know whether the
direction is correct and acceptable.

IMHO the worst case of a series like this could be that we got regression on
dirty bit loss, but then we can simply revert the last patch to verify or fix
it (or we can fix the missing case instead).

Thanks,

> 
> > 
> > >   - it has an existing flaw on race condition. [1]
> > > 
> > > So if system reset is the only thing we care here, I'm thinking whether we can
> > > move this sync explicitly to system reset so we do a global sync there instead
> > > of sync every time when memory layout changed and caused memory removals.  I
> > > think it can be more explict to sync during system reset, and also with that
> > > context it will be far easier for kvm dirty ring to provide the same logic.
> > > 
> > > This is totally RFC because I'm still trying to find whether there will be
> > > other cases besides system reset that we want to keep the dirty bits for a
> > > to-be-removed memslot (real memory removals like unplugging memory shouldn't
> > > matter, because we won't care about the dirty bits if it's never going to be
> > > there anymore, not to mention we won't allow such things during a migration).
> > > So far I don't see any.
> > 
> > I'm still unusure when slot removal happens for real; but if it's
> > happening for RAM on PCI devices, then that would make sense as
> > something you might want to keep.
> 
> Yeah indeed.  Though I'm also curious on a real user of that either that will
> frequently map/unmap a MMIO region of device ram.  Also there's a workaround of
> this issue too by marking the region as dirty when unmap the MMIO region, but I
> haven't checked whether that's easy to do in qemu, assuming not that much...
> 
> Thanks,
> 
> -- 
> Peter Xu

-- 
Peter Xu



  reply	other threads:[~2020-05-02 21:05 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-28 19:42 [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets Peter Xu
2020-04-28 19:42 ` [PATCH RFC 1/4] migration: Export migration_bitmap_sync_precopy() Peter Xu
2020-05-06  9:58   ` Dr. David Alan Gilbert
2020-04-28 19:42 ` [PATCH RFC 2/4] migration: Introduce migrate_is_precopy() Peter Xu
2020-05-06 10:05   ` Dr. David Alan Gilbert
2020-05-06 14:52     ` Peter Xu
2020-04-28 19:42 ` [PATCH RFC 3/4] vl: Sync dirty bits for system resets during precopy Peter Xu
2020-05-06 10:53   ` Dr. David Alan Gilbert
2020-05-06 15:38     ` Peter Xu
2020-04-28 19:42 ` [PATCH RFC 4/4] kvm: No need to sync dirty bitmap before memslot removal any more Peter Xu
2020-04-29 13:26 ` [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets Dr. David Alan Gilbert
2020-04-29 14:32   ` Peter Xu
2020-05-02 21:04     ` Peter Xu [this message]
2020-05-23 22:49       ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200502210423.GB6299@xz-x1 \
    --to=peterx@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=kevin.tian@intel.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).