kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Christophe de Dinechin <dinechin@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Sean Christopherson <sean.j.christopherson@intel.com>,
	Yan Zhao <yan.y.zhao@intel.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	Kevin Kevin <kevin.tian@intel.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	Lei Cao <lei.cao@stratus.com>
Subject: Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
Date: Thu, 9 Jan 2020 17:18:24 -0500	[thread overview]
Message-ID: <20200109171154-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <20200109201916.GH36997@xz-x1>

On Thu, Jan 09, 2020 at 03:19:16PM -0500, Peter Xu wrote:
> > > while for virtio, both sides (hypervisor,
> > > and the guest driver) are trusted.
> > 
> > What gave you the impression guest is trusted in virtio?
> 
> Hmm... maybe when I know virtio can bypass vIOMMU as long as it
> doesn't provide IOMMU_PLATFORM flag? :)

If guest driver does not provide IOMMU_PLATFORM, and device does,
then negotiation fails.

> I think it's logical to trust a virtio guest kernel driver, could you
> guide me on what I've missed?


guest driver is assumed to be part of guest kernel. It can't
do anything kernel can't do anyway.

> > 
> > 
> > >  Above means we need to do these to
> > > change to the new design:
> > > 
> > >   - Allow the GFN array to be mapped as writable by userspace (so that
> > >     userspace can publish bit 2),
> > > 
> > >   - The userspace must be trusted to follow the design (just imagine
> > >     what if the userspace overwrites a GFN when it publishes bit 2
> > >     over a valid dirty gfn entry?  KVM could wrongly unprotect a page
> > >     for the guest...).
> > 
> > You mean protect, right?  So what?
> 
> Yes, I mean with that, more things are uncertain from userspace.  It
> seems easier to me that we restrict the userspace with one index.

Donnu how to treat vague statements like this.  You need to be specific
with threat models. Otherwise there's no way to tell whether code is
secure.

> > 
> > > While if we use the indices, we restrict the userspace to only be able
> > > to write to one index only (which is the reset_index).  That's all it
> > > can do to mess things up (and it could never as long as we properly
> > > validate the reset_index when read, which only happens during
> > > KVM_RESET_DIRTY_RINGS and is very rare).  From that pov, it seems the
> > > indices solution still has its benefits.
> > 
> > So if you mess up index how is this different?
> 
> We can't mess up much with that.  We simply check fetch_index (sorry I
> meant this when I said reset_index, anyway it's the only index that we
> expose to userspace) to make sure:
> 
>   reset_index <= fetch_index <= dirty_index
> 
> Otherwise we fail the ioctl.  With that, we're 100% safe.

safe from what? userspace can mess up guest memory trivially.
for example skip sending some memory or send junk.

> > 
> > I agree RO page kind of feels safer generally though.
> > 
> > I will have to re-read how does the ring works though,
> > my comments were based on the old assumption of mmaped
> > page with indices.
> 
> Yes, sorry again for a bad cover letter.
> 
> It's basically the same as before, just that we only have per-vcpu
> ring now, and the indices are exposed from kvm_run so we don't need
> the extra page, but we still expose that via mmap.

So that's why changelogs are useful.
Can you please write a changelog for this version so I don't
need to re-read all of it? Thanks!

> > 
> > 
> > 
> > > > 
> > > > 
> > > > 
> > > > >  The larger the ring buffer, the less
> > > > > +likely the ring is full and the VM is forced to exit to userspace. The
> > > > > +optimal size depends on the workload, but it is recommended that it be
> > > > > +at least 64 KiB (4096 entries).
> > > > 
> > > > Where's this number coming from? Given you have indices as well,
> > > > 4K size rings is likely to cause cache contention.
> > > 
> > > I think we've had some similar discussion in previous versions on the
> > > size of ring.  Again imho it's really something that may not have a
> > > direct clue as long as it's big enough (4K should be).
> > > 
> > > Regarding to the cache contention: could you explain more?
> > 
> > 4K is a whole cache way. 64K 16 ways.  If there's anything else is a hot
> > path then you are pushing everything out of cache.  To re-read how do
> > indices work so see whether an index is on hot path or not. If yes your
> > structure won't fit in L1 cache which is not great.
> 
> I'm not sure whether I get the point correct, but logically we
> shouldn't read the whole ring buffer as a whole, but only partly (just
> like when we say the ring shouldn't even reach soft-full).  Even if we
> read the whole ring, I don't see a difference here comparing to when
> we read a huge array of data (e.g. "char buf[65536]") in any program
> that covers 64K range - I don't see a good way to fix this but read
> the whole chunk in.  It seems to be common in programs where we have
> big dataset.
> 
> [...]
> 
> > > > > +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > > > > +{
> > > > > +	u32 cur_slot, next_slot;
> > > > > +	u64 cur_offset, next_offset;
> > > > > +	unsigned long mask;
> > > > > +	u32 fetch;
> > > > > +	int count = 0;
> > > > > +	struct kvm_dirty_gfn *entry;
> > > > > +	struct kvm_dirty_ring_indices *indices = ring->indices;
> > > > > +	bool first_round = true;
> > > > > +
> > > > > +	fetch = READ_ONCE(indices->fetch_index);
> > > > 
> > > > So this does not work if the data cache is virtually tagged.
> > > > Which to the best of my knowledge isn't the case on any
> > > > CPU kvm supports. However it might not stay being the
> > > > case forever. Worth at least commenting.
> > > 
> > > This is the read side.  IIUC even if with virtually tagged archs, we
> > > should do the flushing on the write side rather than the read side,
> > > and that should be enough?
> > 
> > No.
> > See e.g.  Documentation/core-api/cachetlb.rst
> > 
> >   ``void flush_dcache_page(struct page *page)``
> > 
> >         Any time the kernel writes to a page cache page, _OR_
> >         the kernel is about to read from a page cache page and
> >         user space shared/writable mappings of this page potentially
> >         exist, this routine is called.
> 
> But I don't understand why.  I feel like for such arch even the
> userspace must flush cache after publishing data onto shared memories,
> otherwise if the shared memory is between two userspace processes
> they'll get inconsistent state.  Then if with that, I'm confused on
> why the read side needs to flush it again.
> 
> > 
> > 
> > > Also, I believe this is the similar question that Jason has asked in
> > > V2.  Sorry I should mention this earlier, but I didn't address that in
> > > this series because if we need to do so we probably need to do it
> > > kvm-wise, rather than only in this series.
> > 
> > You need to document these things.
> > 
> > >  I feel like it's missing
> > > probably only because all existing KVM supported archs do not have
> > > virtual-tagged caches as you mentioned.
> > 
> > But is that a fact? ARM has such a variety of CPUs,
> > I can't really tell. Did you research this to make sure?
> 
> I didn't.  I only tried to find all callers of flush_dcache_page()
> through the whole Linux tree and I cannot see any kvm related code.
> To make this simple, let me address the dcache flushing issue in the
> next post.
> 
> Thanks,
> 
> -- 
> Peter Xu


  reply	other threads:[~2020-01-09 22:18 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
2020-01-09 14:57 ` [PATCH v3 01/21] vfio: introduce vfio_iova_rw to read/write a range of IOVAs Peter Xu
2020-01-09 14:57 ` [PATCH v3 02/21] drm/i915/gvt: subsitute kvm_read/write_guest with vfio_iova_rw Peter Xu
2020-01-09 14:57 ` [PATCH v3 03/21] KVM: Remove kvm_read_guest_atomic() Peter Xu
2020-01-09 14:57 ` [PATCH v3 04/21] KVM: Add build-time error check on kvm_run size Peter Xu
2020-01-09 14:57 ` [PATCH v3 05/21] KVM: X86: Change parameter for fast_page_fault tracepoint Peter Xu
2020-01-09 14:57 ` [PATCH v3 06/21] KVM: X86: Don't take srcu lock in init_rmode_identity_map() Peter Xu
2020-01-09 14:57 ` [PATCH v3 07/21] KVM: Cache as_id in kvm_memory_slot Peter Xu
2020-01-09 14:57 ` [PATCH v3 08/21] KVM: X86: Drop x86_set_memory_region() Peter Xu
2020-01-09 14:57 ` [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
2020-01-19  9:01   ` Paolo Bonzini
2020-01-20  6:45     ` Peter Xu
2020-01-21 15:56   ` Sean Christopherson
2020-01-21 16:14     ` Paolo Bonzini
2020-01-28  5:50     ` Peter Xu
2020-01-28 18:24       ` Sean Christopherson
2020-01-31 15:08         ` Peter Xu
2020-01-31 19:33           ` Sean Christopherson
2020-01-31 20:28             ` Peter Xu
2020-01-31 20:36               ` Sean Christopherson
2020-01-31 20:55                 ` Peter Xu
2020-01-31 21:29                   ` Sean Christopherson
2020-01-31 22:16                     ` Peter Xu
2020-01-31 22:20                       ` Sean Christopherson
2020-01-09 14:57 ` [PATCH v3 10/21] KVM: Pass in kvm pointer into mark_page_dirty_in_slot() Peter Xu
2020-01-09 14:57 ` [PATCH v3 11/21] KVM: Move running VCPU from ARM to common code Peter Xu
2020-01-09 14:57 ` [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
2020-01-09 16:29   ` Michael S. Tsirkin
2020-01-09 16:56     ` Alex Williamson
2020-01-09 19:21       ` Peter Xu
2020-01-09 19:36         ` Michael S. Tsirkin
2020-01-09 19:15     ` Peter Xu
2020-01-09 19:35       ` Michael S. Tsirkin
2020-01-09 20:19         ` Peter Xu
2020-01-09 22:18           ` Michael S. Tsirkin [this message]
2020-01-10 15:29             ` Peter Xu
2020-01-12  6:24               ` Michael S. Tsirkin
2020-01-14 20:01         ` Peter Xu
2020-01-15  6:50           ` Michael S. Tsirkin
2020-01-15 15:20             ` Peter Xu
2020-01-19  9:09       ` Paolo Bonzini
2020-01-19 10:12         ` Michael S. Tsirkin
2020-01-20  7:29           ` Peter Xu
2020-01-20  7:47             ` Michael S. Tsirkin
2020-01-21  8:29               ` Peter Xu
2020-01-21 10:25                 ` Paolo Bonzini
2020-01-21 10:24             ` Paolo Bonzini
2020-01-11  4:49   ` kbuild test robot
2020-01-11 23:19   ` kbuild test robot
2020-01-15  6:47   ` Michael S. Tsirkin
2020-01-15 15:27     ` Peter Xu
2020-01-16  8:38   ` Michael S. Tsirkin
2020-01-16 16:27     ` Peter Xu
2020-01-17  9:50       ` Michael S. Tsirkin
2020-01-20  6:48         ` Peter Xu
2020-01-09 14:57 ` [PATCH v3 13/21] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
2020-01-09 14:57 ` [PATCH v3 14/21] KVM: Don't allocate dirty bitmap if dirty ring is enabled Peter Xu
2020-01-09 16:41   ` Peter Xu
2020-01-09 14:57 ` [PATCH v3 15/21] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
2020-01-09 14:57 ` [PATCH v3 16/21] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
2020-01-09 14:57 ` [PATCH v3 17/21] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
2020-01-09 14:57 ` [PATCH v3 18/21] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
2020-01-09 14:57 ` [PATCH v3 19/21] KVM: selftests: Add dirty ring buffer test Peter Xu
2020-01-09 14:57 ` [PATCH v3 20/21] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
2020-01-09 14:57 ` [PATCH v3 21/21] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
2020-01-09 15:59 ` [PATCH v3 00/21] KVM: Dirty ring interface Michael S. Tsirkin
2020-01-09 16:17   ` Peter Xu
2020-01-09 16:40     ` Michael S. Tsirkin
2020-01-09 17:08       ` Peter Xu
2020-01-09 19:08         ` Michael S. Tsirkin
2020-01-09 19:39           ` Peter Xu
2020-01-09 20:42             ` Paolo Bonzini
2020-01-09 22:28             ` Michael S. Tsirkin
2020-01-10 15:10               ` Peter Xu
2020-01-09 16:47 ` Alex Williamson
2020-01-09 17:58   ` Peter Xu
2020-01-09 19:13     ` Michael S. Tsirkin
2020-01-09 19:23       ` Peter Xu
2020-01-09 19:37         ` Michael S. Tsirkin
2020-01-09 20:51       ` Paolo Bonzini
2020-01-09 22:21         ` Michael S. Tsirkin
2020-01-19  9:11 ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200109171154-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=dinechin@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=lei.cao@stratus.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=sean.j.christopherson@intel.com \
    --cc=vkuznets@redhat.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).