From: "Michael S. Tsirkin" <mst@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Christophe de Dinechin <dinechin@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Sean Christopherson <sean.j.christopherson@intel.com>,
Yan Zhao <yan.y.zhao@intel.com>,
Alex Williamson <alex.williamson@redhat.com>,
Jason Wang <jasowang@redhat.com>,
Kevin Kevin <kevin.tian@intel.com>,
Vitaly Kuznetsov <vkuznets@redhat.com>,
"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
Lei Cao <lei.cao@stratus.com>
Subject: Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
Date: Thu, 9 Jan 2020 14:35:46 -0500 [thread overview]
Message-ID: <20200109141634-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <20200109191514.GD36997@xz-x1>
On Thu, Jan 09, 2020 at 02:15:14PM -0500, Peter Xu wrote:
> On Thu, Jan 09, 2020 at 11:29:28AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2020 at 09:57:20AM -0500, Peter Xu wrote:
> > > This patch is heavily based on previous work from Lei Cao
> > > <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> > >
> > > KVM currently uses large bitmaps to track dirty memory. These bitmaps
> > > are copied to userspace when userspace queries KVM for its dirty page
> > > information. The use of bitmaps is mostly sufficient for live
> > > migration, as large parts of memory are be dirtied from one log-dirty
> > > pass to another. However, in a checkpointing system, the number of
> > > dirty pages is small and in fact it is often bounded---the VM is
> > > paused when it has dirtied a pre-defined number of pages. Traversing a
> > > large, sparsely populated bitmap to find set bits is time-consuming,
> > > as is copying the bitmap to user-space.
> > >
> > > A similar issue will be there for live migration when the guest memory
> > > is huge while the page dirty procedure is trivial. In that case for
> > > each dirty sync we need to pull the whole dirty bitmap to userspace
> > > and analyse every bit even if it's mostly zeros.
> > >
> > > The preferred data structure for above scenarios is a dense list of
> > > guest frame numbers (GFN).
> >
> > No longer, this uses an array of structs.
>
> (IMHO it's more or less a wording thing, because it's still an array
> of GFNs behind it...)
>
> [...]
>
> > > +Dirty GFNs (Guest Frame Numbers) are stored in the dirty_gfns array.
> > > +For each of the dirty entry it's defined as:
> > > +
> > > +struct kvm_dirty_gfn {
> > > + __u32 pad;
> >
> > How about sticking a length here?
> > This way huge pages can be dirtied in one go.
>
> As we've discussed previously, current KVM tracks dirty in 4K page
> only, so it seems to be something that is not easily covered in this
> series.
>
> We probably need to justify on having KVM to track huge pages first,
> or at least a trend that we're going to do that, then we can properly
> reserve it here.
>
> >
> > > + __u32 slot; /* as_id | slot_id */
> > > + __u64 offset;
> > > +};
> > > +
> > > +Most of the ring structure is used by KVM internally, while only the
> > > +indices are exposed to userspace:
> > > +
> > > +struct kvm_dirty_ring_indices {
> > > + __u32 avail_index; /* set by kernel */
> > > + __u32 fetch_index; /* set by userspace */
> > > +};
> > > +
> > > +The two indices in the ring buffer are free running counters.
> > > +
> > > +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> > > +to enable this capability for the new guest and set the size of the
> > > +rings. It is only allowed before creating any vCPU, and the size of
> > > +the ring must be a power of two.
> >
> >
> > I know index design is popular, but testing with virtio showed
> > that it's better to just have a flags field marking
> > an entry as valid. In particular this gets rid of the
> > running counters and power of two limitations.
> > It also removes the need for a separate index page, which is nice.
>
> Firstly, note that the separate index page has already been dropped
> since V2, so we don't need to worry on that.
changelog would be nice.
So now, how does userspace tell kvm it's done with the ring?
> Regarding dropping the indices: I feel like it can be done, though we
> probably need two extra bits for each GFN entry, for example:
>
> - Bit 0 of the GFN address to show whether this is a valid publish
> of dirty gfn
>
> - Bit 1 of the GFN address to show whether this is collected by the
> user
I wonder whether you will end up reinventing virtio.
You are already pretty close with avail/used bits in flags.
> We can also use the padding field, but just want to show the idea
> first.
>
> Then for each GFN we can go through state changes like this (things
> like "00b" stands for "bit1 bit0" values):
>
> 00b (invalid GFN) ->
> 01b (valid gfn published by kernel, which is dirty) ->
> 10b (gfn dirty page collected by userspace) ->
> 00b (gfn reset by kernel, so goes back to invalid gfn)
>
> And we should always guarantee that both the userspace and KVM walks
> the GFN array in a linear manner, for example, KVM must publish a new
> GFN with bit 1 set right after the previous publish of GFN. Vice
> versa to the userspace when it collects the dirty GFN and mark bit 2.
>
> Michael, do you mean something like this?
>
> I think it should work logically, however IIUC it can expose more
> security risks, say, dirty ring is different from virtio in that
> userspace is not trusted,
In what sense?
> while for virtio, both sides (hypervisor,
> and the guest driver) are trusted.
What gave you the impression guest is trusted in virtio?
> Above means we need to do these to
> change to the new design:
>
> - Allow the GFN array to be mapped as writable by userspace (so that
> userspace can publish bit 2),
>
> - The userspace must be trusted to follow the design (just imagine
> what if the userspace overwrites a GFN when it publishes bit 2
> over a valid dirty gfn entry? KVM could wrongly unprotect a page
> for the guest...).
You mean protect, right? So what?
> While if we use the indices, we restrict the userspace to only be able
> to write to one index only (which is the reset_index). That's all it
> can do to mess things up (and it could never as long as we properly
> validate the reset_index when read, which only happens during
> KVM_RESET_DIRTY_RINGS and is very rare). From that pov, it seems the
> indices solution still has its benefits.
So if you mess up index how is this different?
I agree RO page kind of feels safer generally though.
I will have to re-read how does the ring works though,
my comments were based on the old assumption of mmaped
page with indices.
> >
> >
> >
> > > The larger the ring buffer, the less
> > > +likely the ring is full and the VM is forced to exit to userspace. The
> > > +optimal size depends on the workload, but it is recommended that it be
> > > +at least 64 KiB (4096 entries).
> >
> > Where's this number coming from? Given you have indices as well,
> > 4K size rings is likely to cause cache contention.
>
> I think we've had some similar discussion in previous versions on the
> size of ring. Again imho it's really something that may not have a
> direct clue as long as it's big enough (4K should be).
>
> Regarding to the cache contention: could you explain more?
4K is a whole cache way. 64K 16 ways. If there's anything else is a hot
path then you are pushing everything out of cache. To re-read how do
indices work so see whether an index is on hot path or not. If yes your
structure won't fit in L1 cache which is not great.
> Do you
> have a suggestion on the size of ring instead considering the issue?
>
> [...]
I'll have to re-learn how do things work with indices gone
from shared memory.
> > > +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > > +{
> > > + u32 cur_slot, next_slot;
> > > + u64 cur_offset, next_offset;
> > > + unsigned long mask;
> > > + u32 fetch;
> > > + int count = 0;
> > > + struct kvm_dirty_gfn *entry;
> > > + struct kvm_dirty_ring_indices *indices = ring->indices;
> > > + bool first_round = true;
> > > +
> > > + fetch = READ_ONCE(indices->fetch_index);
> >
> > So this does not work if the data cache is virtually tagged.
> > Which to the best of my knowledge isn't the case on any
> > CPU kvm supports. However it might not stay being the
> > case forever. Worth at least commenting.
>
> This is the read side. IIUC even if with virtually tagged archs, we
> should do the flushing on the write side rather than the read side,
> and that should be enough?
No.
See e.g. Documentation/core-api/cachetlb.rst
``void flush_dcache_page(struct page *page)``
Any time the kernel writes to a page cache page, _OR_
the kernel is about to read from a page cache page and
user space shared/writable mappings of this page potentially
exist, this routine is called.
> Also, I believe this is the similar question that Jason has asked in
> V2. Sorry I should mention this earlier, but I didn't address that in
> this series because if we need to do so we probably need to do it
> kvm-wise, rather than only in this series.
You need to document these things.
> I feel like it's missing
> probably only because all existing KVM supported archs do not have
> virtual-tagged caches as you mentioned.
But is that a fact? ARM has such a variety of CPUs,
I can't really tell. Did you research this to make sure?
> If so, I would prefer if you
> can allow me to ignore that issue until KVM starts to support such an
> arch.
Document limitations pls. Don't ignore them.
> >
> >
> > > +
> > > + /*
> > > + * Note that fetch_index is written by the userspace, which
> > > + * should not be trusted. If this happens, then it's probably
> > > + * that the userspace has written a wrong fetch_index.
> > > + */
> > > + if (fetch - ring->reset_index > ring->size)
> > > + return -EINVAL;
> > > +
> > > + if (fetch == ring->reset_index)
> > > + return 0;
> > > +
> > > + /* This is only needed to make compilers happy */
> > > + cur_slot = cur_offset = mask = 0;
> > > + while (ring->reset_index != fetch) {
> > > + entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > > + next_slot = READ_ONCE(entry->slot);
> > > + next_offset = READ_ONCE(entry->offset);
> >
> > What is this READ_ONCE doing? Entries are only written by kernel
> > and it's under lock.
>
> The entries are written in kvm_dirty_ring_push() where there should
> have no lock (there's one wmb() though to guarantee ordering of these
> and the index update).
>
> With the wmb(), the write side should guarantee to make it to memory.
> For the read side here, I think it's still good to have it to make
> sure we read from memory?
>
> >
> > > + ring->reset_index++;
> > > + count++;
> > > + /*
> > > + * Try to coalesce the reset operations when the guest is
> > > + * scanning pages in the same slot.
> > > + */
> > > + if (!first_round && next_slot == cur_slot) {
> > > + s64 delta = next_offset - cur_offset;
> > > +
> > > + if (delta >= 0 && delta < BITS_PER_LONG) {
> > > + mask |= 1ull << delta;
> > > + continue;
> > > + }
> > > +
> > > + /* Backwards visit, careful about overflows! */
> > > + if (delta > -BITS_PER_LONG && delta < 0 &&
> > > + (mask << -delta >> -delta) == mask) {
> > > + cur_offset = next_offset;
> > > + mask = (mask << -delta) | 1;
> > > + continue;
> > > + }
> > > + }
> >
> > Well how important is this logic? Because it will not be
> > too effective on an SMP system, so don't you need a per-cpu ring?
>
> It's my fault to have omit the high-level design in the cover letter,
> but we do have per-vcpu ring now. Actually that's what we only have
> (we dropped the per-vm ring already) so ring access does not need lock
> any more.
>
> This logic is good because kvm_reset_dirty_gfn, especially inside that
> there's kvm_arch_mmu_enable_log_dirty_pt_masked() that supports masks,
> so it would be good to do the reset for continuous pages (or page
> that's close enough) in a single shot.
>
> >
> >
> >
> > > + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > > + cur_slot = next_slot;
> > > + cur_offset = next_offset;
> > > + mask = 1;
> > > + first_round = false;
> > > + }
> > > + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > > +
> > > + trace_kvm_dirty_ring_reset(ring);
> > > +
> > > + return count;
> > > +}
> > > +
> > > +void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> > > +{
> > > + struct kvm_dirty_gfn *entry;
> > > + struct kvm_dirty_ring_indices *indices = ring->indices;
> > > +
> > > + /* It should never get full */
> > > + WARN_ON_ONCE(kvm_dirty_ring_full(ring));
> > > +
> > > + entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> > > + entry->slot = slot;
> > > + entry->offset = offset;
> > > + /*
> > > + * Make sure the data is filled in before we publish this to
> > > + * the userspace program. There's no paired kernel-side reader.
> > > + */
> > > + smp_wmb();
> > > + ring->dirty_index++;
> >
> >
> > Do I understand it correctly that the ring is shared between CPUs?
> > If so I don't understand why it's safe for SMP guests.
> > Don't you need atomics or locking?
>
> No, it's per-vcpu.
>
> Thanks,
>
> --
> Peter Xu
next prev parent reply other threads:[~2020-01-09 19:36 UTC|newest]
Thread overview: 82+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
2020-01-09 14:57 ` [PATCH v3 01/21] vfio: introduce vfio_iova_rw to read/write a range of IOVAs Peter Xu
2020-01-09 14:57 ` [PATCH v3 02/21] drm/i915/gvt: subsitute kvm_read/write_guest with vfio_iova_rw Peter Xu
2020-01-09 14:57 ` [PATCH v3 03/21] KVM: Remove kvm_read_guest_atomic() Peter Xu
2020-01-09 14:57 ` [PATCH v3 04/21] KVM: Add build-time error check on kvm_run size Peter Xu
2020-01-09 14:57 ` [PATCH v3 05/21] KVM: X86: Change parameter for fast_page_fault tracepoint Peter Xu
2020-01-09 14:57 ` [PATCH v3 06/21] KVM: X86: Don't take srcu lock in init_rmode_identity_map() Peter Xu
2020-01-09 14:57 ` [PATCH v3 07/21] KVM: Cache as_id in kvm_memory_slot Peter Xu
2020-01-09 14:57 ` [PATCH v3 08/21] KVM: X86: Drop x86_set_memory_region() Peter Xu
2020-01-09 14:57 ` [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
2020-01-19 9:01 ` Paolo Bonzini
2020-01-20 6:45 ` Peter Xu
2020-01-21 15:56 ` Sean Christopherson
2020-01-21 16:14 ` Paolo Bonzini
2020-01-28 5:50 ` Peter Xu
2020-01-28 18:24 ` Sean Christopherson
2020-01-31 15:08 ` Peter Xu
2020-01-31 19:33 ` Sean Christopherson
2020-01-31 20:28 ` Peter Xu
2020-01-31 20:36 ` Sean Christopherson
2020-01-31 20:55 ` Peter Xu
2020-01-31 21:29 ` Sean Christopherson
2020-01-31 22:16 ` Peter Xu
2020-01-31 22:20 ` Sean Christopherson
2020-01-09 14:57 ` [PATCH v3 10/21] KVM: Pass in kvm pointer into mark_page_dirty_in_slot() Peter Xu
2020-01-09 14:57 ` [PATCH v3 11/21] KVM: Move running VCPU from ARM to common code Peter Xu
2020-01-09 14:57 ` [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
2020-01-09 16:29 ` Michael S. Tsirkin
2020-01-09 16:56 ` Alex Williamson
2020-01-09 19:21 ` Peter Xu
2020-01-09 19:36 ` Michael S. Tsirkin
2020-01-09 19:15 ` Peter Xu
2020-01-09 19:35 ` Michael S. Tsirkin [this message]
2020-01-09 20:19 ` Peter Xu
2020-01-09 22:18 ` Michael S. Tsirkin
2020-01-10 15:29 ` Peter Xu
2020-01-12 6:24 ` Michael S. Tsirkin
2020-01-14 20:01 ` Peter Xu
2020-01-15 6:50 ` Michael S. Tsirkin
2020-01-15 15:20 ` Peter Xu
2020-01-19 9:09 ` Paolo Bonzini
2020-01-19 10:12 ` Michael S. Tsirkin
2020-01-20 7:29 ` Peter Xu
2020-01-20 7:47 ` Michael S. Tsirkin
2020-01-21 8:29 ` Peter Xu
2020-01-21 10:25 ` Paolo Bonzini
2020-01-21 10:24 ` Paolo Bonzini
2020-01-11 4:49 ` kbuild test robot
2020-01-11 23:19 ` kbuild test robot
2020-01-15 6:47 ` Michael S. Tsirkin
2020-01-15 15:27 ` Peter Xu
2020-01-16 8:38 ` Michael S. Tsirkin
2020-01-16 16:27 ` Peter Xu
2020-01-17 9:50 ` Michael S. Tsirkin
2020-01-20 6:48 ` Peter Xu
2020-01-09 14:57 ` [PATCH v3 13/21] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
2020-01-09 14:57 ` [PATCH v3 14/21] KVM: Don't allocate dirty bitmap if dirty ring is enabled Peter Xu
2020-01-09 16:41 ` Peter Xu
2020-01-09 14:57 ` [PATCH v3 15/21] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
2020-01-09 14:57 ` [PATCH v3 16/21] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
2020-01-09 14:57 ` [PATCH v3 17/21] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
2020-01-09 14:57 ` [PATCH v3 18/21] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
2020-01-09 14:57 ` [PATCH v3 19/21] KVM: selftests: Add dirty ring buffer test Peter Xu
2020-01-09 14:57 ` [PATCH v3 20/21] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
2020-01-09 14:57 ` [PATCH v3 21/21] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
2020-01-09 15:59 ` [PATCH v3 00/21] KVM: Dirty ring interface Michael S. Tsirkin
2020-01-09 16:17 ` Peter Xu
2020-01-09 16:40 ` Michael S. Tsirkin
2020-01-09 17:08 ` Peter Xu
2020-01-09 19:08 ` Michael S. Tsirkin
2020-01-09 19:39 ` Peter Xu
2020-01-09 20:42 ` Paolo Bonzini
2020-01-09 22:28 ` Michael S. Tsirkin
2020-01-10 15:10 ` Peter Xu
2020-01-09 16:47 ` Alex Williamson
2020-01-09 17:58 ` Peter Xu
2020-01-09 19:13 ` Michael S. Tsirkin
2020-01-09 19:23 ` Peter Xu
2020-01-09 19:37 ` Michael S. Tsirkin
2020-01-09 20:51 ` Paolo Bonzini
2020-01-09 22:21 ` Michael S. Tsirkin
2020-01-19 9:11 ` Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200109141634-mutt-send-email-mst@kernel.org \
--to=mst@redhat.com \
--cc=alex.williamson@redhat.com \
--cc=dgilbert@redhat.com \
--cc=dinechin@redhat.com \
--cc=jasowang@redhat.com \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=lei.cao@stratus.com \
--cc=linux-kernel@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=sean.j.christopherson@intel.com \
--cc=vkuznets@redhat.com \
--cc=yan.y.zhao@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).