linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Tian, Kevin" <kevin.tian@intel.com>
To: Peter Xu <peterx@redhat.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	"Christopherson, Sean J" <sean.j.christopherson@intel.com>,
	Christophe de Dinechin <dinechin@redhat.com>,
	"Zhao, Yan Y" <yan.y.zhao@intel.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"Vitaly Kuznetsov" <vkuznets@redhat.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>
Subject: RE: [PATCH v8 00/14] KVM: Dirty ring interface
Date: Sun, 26 Apr 2020 10:29:51 +0000	[thread overview]
Message-ID: <AADFC41AFE54684AB9EE6CBC0274A5D19D8CD4C8@SHSMSX104.ccr.corp.intel.com> (raw)
In-Reply-To: <20200424141944.GA41816@xz-x1>

> From: Peter Xu
> Sent: Friday, April 24, 2020 10:20 PM
> 
> On Fri, Apr 24, 2020 at 06:01:46AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Thursday, April 23, 2020 11:23 PM
> > >
> > > On Thu, Apr 23, 2020 at 06:28:43AM +0000, Tian, Kevin wrote:
> > > > > From: Peter Xu <peterx@redhat.com>
> > > > > Sent: Thursday, April 23, 2020 2:52 AM
> > > > >
> > > > > Hi,
> > > > >
> > > > > TL;DR: I'm thinking whether we should record pure GPA/GFN instead
> of
> > > > > (slot_id,
> > > > > slot_offset) tuple for dirty pages in kvm dirty ring to unbind
> > > kvm_dirty_gfn
> > > > > with memslots.
> > > > >
> > > > > (A slightly longer version starts...)
> > > > >
> > > > > The problem is that binding dirty tracking operations to KVM
> memslots is
> > > a
> > > > > restriction that needs synchronization to memslot changes, which
> further
> > > > > needs
> > > > > synchronization across all the vcpus because they're the consumers of
> > > > > memslots.
> > > > > E.g., when we remove a memory slot, we need to flush all the dirty
> bits
> > > > > correctly before we do the removal of the memslot.  That's actually an
> > > > > known
> > > > > defect for QEMU/KVM [1] (I bet it could be a defect for many other
> > > > > hypervisors...) right now with current dirty logging.  Meanwhile, even
> if
> > > we
> > > > > fix it, that procedure is not scale at all, and error prone to dead locks.
> > > > >
> > > > > Here memory removal is really an (still corner-cased but relatively)
> > > important
> > > > > scenario to think about for dirty logging comparing to memory
> additions
> > > &
> > > > > movings.  Because memory addition will always have no initial dirty
> page,
> > > > > and
> > > > > we don't really move RAM a lot (or do we ever?!) for a general VM
> use
> > > case.
> > > > >
> > > > > Then I went a step back to think about why we need these dirty bit
> > > > > information
> > > > > after all if the memslot is going to be removed?
> > > > >
> > > > > There're two cases:
> > > > >
> > > > >   - When the memslot is going to be removed forever, then the dirty
> > > > > information
> > > > >     is indeed meaningless and can be dropped, and,
> > > > >
> > > > >   - When the memslot is going to be removed but quickly added back
> with
> > > > > changed
> > > > >     size, then we need to keep those dirty bits because it's just a
> commmon
> > > > > way
> > > > >     to e.g. punch an MMIO hole in an existing RAM region (here I'd
> confess
> > > I
> > > > >     feel like using "slot_id" to identify memslot is really unfriendly
> syscall
> > > > >     design for things like "hole punchings" in the RAM address space...
> > > > >     However such "punch hold" operation is really needed even for a
> > > common
> > > > >     guest for either system reboots or device hotplugs, etc.).
> > > >
> > > > why would device hotplug punch a hole in an existing RAM region?
> > >
> > > I thought it could happen because I used to trace the KVM ioctls and see
> the
> > > memslot changes during driver loading.  But later when I tried to hotplug
> a
> >
> > Is there more detail why driver loading may lead to memslot changes?
> 
> E.g., I can observe these after Linux loads and before the prompt, which is a
> simplest VM with default devices on:
> 
> 41874@1587736345.192636:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.192760:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.193884:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.193956:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.195788:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.195838:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.196769:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.196827:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.197787:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.197832:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.198777:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.198836:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.200491:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.200537:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.201592:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.201649:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.202415:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.202461:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.203169:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.203225:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.204037:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.204083:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.204983:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.205041:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.205940:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.206022:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.206981:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.207038:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41875@1587736351.141052:kvm_set_user_memory Slot#9 flags=0x1
> gpa=0xa0000 size=0x10000 ua=0x7fadf6800000 ret=0
> 
> After a careful look, I noticed it's only the VGA device mostly turning slot 3
> off & on.  Frankly speaking I don't know why it happens to do so.
> 
> >
> > > device I do see that it won't...  The new MMIO regions are added only
> into
> > > 0xfe000000 for a virtio-net:
> > >
> > >   00000000fe000000-00000000fe000fff (prio 0, i/o): virtio-pci-common
> > >   00000000fe001000-00000000fe001fff (prio 0, i/o): virtio-pci-isr
> > >   00000000fe002000-00000000fe002fff (prio 0, i/o): virtio-pci-device
> > >   00000000fe003000-00000000fe003fff (prio 0, i/o): virtio-pci-notify
> > >   00000000fe840000-00000000fe84002f (prio 0, i/o): msix-table
> > >   00000000fe840800-00000000fe840807 (prio 0, i/o): msix-pba
> > >
> > > Does it mean that device plugging is guaranteed to not trigger RAM
> changes?
> >
> > I'd think so. Otherwise from guest p.o.v any device hotplug implies doing
> > a memory hot-unplug first then it's a bad design.
> 
> Right that's what I was confused about.  Then maybe you're right. :)
> 
> >
> > > I
> > > am really curious about what cases we need to consider in which we
> need to
> > > keep
> > > the dirty bits for a memory removal, and if system reset is the only case,
> then
> > > it could be even easier (because we might be able to avoid the sync in
> > > memory
> > > removal but do that once in a sys reset hook)...
> >
> > Possibly memory hot-unplug, as allowed by recent virtio-mem?
> 
> That should belong to the case where dirty bits do not matter at all after the
> removal, right?  I would be mostly curious about when we (1) remove a
> memory
> slot, and at the meantime (2) we still care about the dirty bits of that slot.

I remember one case. PCIe spec defines a resizable BAR capability, allowing
hardware to communicate supported resource sizes and software to set
an optimal size back to hardware. If such a capability is presented to guest
and the BAR is backed by memory, it's possible to observe a removal-and-
add-back scenario. However in such case, the spec requires the software 
to clear memory space enable bit in command register before changing 
the BAR size. I suppose such thing should happen at boot phase where
dirty bits related to old BAR size don't matter. 

> 
> I'll see whether I can remove the dirty bit sync in kvm_set_phys_mem(),
> which I
> think is really nasty.
> 
> >
> > btw VFIO faces a similar problem when unmapping a DMA range (e.g.
> when
> > vIOMMU is enabled) in dirty log phase. There could be some dirty bits
> which are
> > not retrieved when unmapping happens. VFIO chooses to return the dirty
> > bits in a buffer passed in the unmapping parameters. Can memslot
> interface
> > do similar thing by allowing the userspace to specify a buffer pointer to
> hold
> > whatever dirty pages recorded for the slot that is being removed?
> 
> Yes I think we can, but may not be necessary.  Actually IMHO CPU access to
> pages are slightly different to device DMAs in that we can do these sequence
> to
> collect the dirty bits of a slot safely:
> 
>   - mark slot as READONLY
>   - KVM_GET_DIRTY_LOG on the slot
>   - finally remove the slot
> 
> I guess VFIO cannot do that because there's no way to really "mark the
> region
> as read-only" for a device because DMA could happen and DMA would fail
> then
> when writting to a readonly slot.
> 
> On the KVM/CPU side, after we mark the slot as READONLY then the CPU
> writes
> will page fault and fallback to the QEMU userspace, then QEMU will take care
> of
> the writes (so those writes could be slow but still working) then even if we
> mark it READONLY it won't fail the writes but just fallback to QEMU.

Yes, you are right.

> 
> Btw, since we're discussing the VFIO dirty logging across memory removal...
> the
> unmapping DMA range you're talking about needs to be added back later,
> right?

pure memory removal doesn't need add-back. Or are you specifically
referring to removal-and-add-back case?

> Then what if the device does DMA after the removal but before it's added
> back?

then just got IOMMU page fault, but I guess that I didn't get your real 
question...

> I think this is a more general question not for dirty logging but also for when
> dirty logging is not enabled - I never understand how this could be fixed with
> existing facilities.
> 

Thanks,
Kevin

  reply	other threads:[~2020-04-26 10:30 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-31 18:59 [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
2020-03-31 18:59 ` [PATCH v8 01/14] KVM: X86: Change parameter for fast_page_fault tracepoint Peter Xu
2020-03-31 18:59 ` [PATCH v8 02/14] KVM: Cache as_id in kvm_memory_slot Peter Xu
2020-03-31 18:59 ` [PATCH v8 03/14] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
2020-04-23 20:39   ` Sean Christopherson
2020-04-24 15:21     ` Peter Xu
2020-04-27 18:10       ` Sean Christopherson
2020-04-28 20:22         ` Peter Xu
2020-03-31 18:59 ` [PATCH v8 04/14] KVM: Pass in kvm pointer into mark_page_dirty_in_slot() Peter Xu
2020-03-31 18:59 ` [PATCH v8 05/14] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
2020-03-31 18:59 ` [PATCH v8 06/14] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
2020-03-31 18:59 ` [PATCH v8 07/14] KVM: Don't allocate dirty bitmap if dirty ring is enabled Peter Xu
2020-03-31 18:59 ` [PATCH v8 08/14] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
2020-04-01  7:04   ` Andrew Jones
2020-03-31 18:59 ` [PATCH v8 09/14] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
2020-03-31 18:59 ` [PATCH v8 10/14] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
2020-03-31 18:59 ` [PATCH v8 11/14] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
2020-04-01  7:03   ` Andrew Jones
2020-04-01 23:24     ` Peter Xu
2020-03-31 18:59 ` [PATCH v8 12/14] KVM: selftests: Add dirty ring buffer test Peter Xu
2020-03-31 18:59 ` [PATCH v8 13/14] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
2020-04-01  7:48   ` Andrew Jones
2020-03-31 19:00 ` [PATCH v8 14/14] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
2020-04-22 18:51 ` [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
2020-04-23  6:28   ` Tian, Kevin
2020-04-23 15:22     ` Peter Xu
2020-04-24  6:01       ` Tian, Kevin
2020-04-24 14:19         ` Peter Xu
2020-04-26 10:29           ` Tian, Kevin [this message]
2020-04-27 14:27             ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=AADFC41AFE54684AB9EE6CBC0274A5D19D8CD4C8@SHSMSX104.ccr.corp.intel.com \
    --to=kevin.tian@intel.com \
    --cc=alex.williamson@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=dinechin@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=sean.j.christopherson@intel.com \
    --cc=vkuznets@redhat.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).