kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sean Christopherson <sean.j.christopherson@intel.com>
To: Peter Xu <peterx@redhat.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Christophe de Dinechin <dinechin@redhat.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Yan Zhao <yan.y.zhao@intel.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	Kevin Kevin <kevin.tian@intel.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
Date: Fri, 31 Jan 2020 11:33:01 -0800	[thread overview]
Message-ID: <20200131193301.GC18946@linux.intel.com> (raw)
In-Reply-To: <20200131150832.GA740148@xz-x1>

On Fri, Jan 31, 2020 at 10:08:32AM -0500, Peter Xu wrote:
> On Tue, Jan 28, 2020 at 10:24:03AM -0800, Sean Christopherson wrote:
> > On Tue, Jan 28, 2020 at 01:50:05PM +0800, Peter Xu wrote:
> > > On Tue, Jan 21, 2020 at 07:56:57AM -0800, Sean Christopherson wrote:
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index c4d3972dcd14..ff97782b3919 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -9584,7 +9584,15 @@ void kvm_arch_sync_events(struct kvm *kvm)
> > > > >  	kvm_free_pit(kvm);
> > > > >  }
> > > > >  
> > > > > -int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> > > > > +/*
> > > > > + * If `uaddr' is specified, `*uaddr' will be returned with the
> > > > > + * userspace address that was just allocated.  `uaddr' is only
> > > > > + * meaningful if the function returns zero, and `uaddr' will only be
> > > > > + * valid when with either the slots_lock or with the SRCU read lock
> > > > > + * held.  After we release the lock, the returned `uaddr' will be invalid.
> > > > 
> > > > This is all incorrect.  Neither of those locks has any bearing on the
> > > > validity of the hva.  slots_lock does as the name suggests and prevents
> > > > concurrent writes to the memslots.  The SRCU lock ensures the implicit
> > > > memslots lookup in kvm_clear_guest_page() won't result in a use-after-free
> > > > due to derefencing old memslots.
> > > > 
> > > > Neither of those has anything to do with the userspace address, they're
> > > > both fully tied to KVM's gfn->hva lookup.  As Paolo pointed out, KVM's
> > > > mapping is instead tied to the lifecycle of the VM.  Note, even *that* has
> > > > no bearing on the validity of the mapping or address as KVM only increments
> > > > mm_count, not mm_users, i.e. guarantees the mm struct itself won't be freed
> > > > but doesn't ensure the vmas or associated pages tables are valid.
> > > > 
> > > > Which is the entire point of using __copy_{to,from}_user(), as they
> > > > gracefully handle the scenario where the process has not valid mapping
> > > > and/or translation for the address.
> > > 
> > > Sorry I don't understand.
> > > 
> > > I do think either the slots_lock or SRCU would protect at least the
> > > existing kvm.memslots, and if so at least the previous vm_mmap()
> > > return value should still be valid.
> > 
> > Nope.  kvm->slots_lock only protects gfn->hva lookups, e.g. userspace can
> > munmap() the range at any time.
> 
> Do we need to consider that?  If the userspace did this then it'll
> corrupt itself, and imho private memory slot is not anything special
> here comparing to the user memory slots.  For example, the userspace
> can unmap any region after KVM_SET_USER_MEMORY_REGION ioctl even if
> the region is filled into some of the userspace_addr of
> kvm_userspace_memory_region, so the cached userspace_addr can be
> invalid, then kvm_write_guest_page() can fail too with the same
> reason.  IMHO kvm only need to make sure it handles the failure path
> then it's perfectly fine.

Yes?  No?  My point is that your original comment's assertion that "'uaddr'
will only be valid when with either the slots_lock or with the SRCU read
lock held." is wrong and misleading.

> > > I agree that __copy_to_user() will protect us from many cases from process
> > > mm pov (which allows page faults inside), but again if the kvm.memslots is
> > > changed underneath us then it's another story, IMHO, and that's why we need
> > > either the lock or SRCU.
> > 
> > No, again, slots_lock and SRCU only protect gfn->hva lookups.
> 
> Yes, then could you further explain why do you think we don't need the
> slot lock?  

For the same reason we don't take mmap_sem, it gains us nothing, i.e. KVM
still has to use copy_{to,from}_user().

In the proposed __x86_set_memory_region() refactor, vmx_set_tss_addr()
would be provided the hva of the memory region.  Since slots_lock and SRCU
only protect gfn->hva, why would KVM take slots_lock since it already has
the hva?

> > > Or are you assuming that (1) __x86_set_memory_region() is only for the
> > > 3 private kvm memslots, 
> > 
> > It's not an assumption, the entire purpose of __x86_set_memory_region()
> > is to provide support for private KVM memslots.
> > 
> > > and (2) currently the kvm private memory slots will never change after VM
> > > is created and before VM is destroyed?
> > 
> > No, I'm not assuming the private memslots are constant, e.g. the flow in
> > question, vmx_set_tss_addr() is directly tied to an unprotected ioctl().
> 
> Why it's unprotected?

Because it doesn't need to be protected.

> Now vmx_set_tss_add() is protected by the slots lock so concurrent operation
> is safe, also it'll return -EEXIST if called for more than once.

Returning -EEXIST is an ABI change, e.g. userspace can currently call
KVM_SET_TSS_ADDR any number of times, it just needs to ensure proper
serialization between calls.

If you want to change the ABI, then submit a patch to do exactly that.
But don't bury an ABI change under the pretense that it's a bug fix.

> [1]
> 
> > 
> > KVM's sole responsible for vmx_set_tss_addr() is to not crash the kernel.
> > Userspace is responsible for ensuring it doesn't break its guests, e.g.
> > that multiple calls to KVM_SET_TSS_ADDR are properly serialized.
> > 
> > In the existing code, KVM ensures it doesn't crash by holding the SRCU lock
> > for the duration of init_rmode_tss() so that the gfn->hva lookups in
> > kvm_clear_guest_page() don't dereference a stale memslots array.
> 
> Here in the current master branch we have both the RCU lock and the
> slot lock held, that's why I think we can safely remove the RCU lock
> as long as we're still holding the slots lock.  We can't do the
> reverse because otherwise multiple KVM_SET_TSS_ADDR could race.

Your wording is all messed up.  "we have both the RCU lock and the slot
lock held" is wrong.  KVM holds slot_lock around __x86_set_memory_region(),
because changing the memslots must be mutually exclusive.  It then *drops*
slots_lock because it's done writing the memslots and grabs the SRCU lock
in order to protect the gfn->hva lookups done by init_rmode_tss().  It
*intentionally* drops slots_lock because writing init_rmode_tss() does not
need to be a mutually exclusive operation, per KVM's existing ABI.

If KVM held both slots_lock and SRCU then __x86_set_memory_region() would
deadlock on synchronize_srcu().

> > In no way
> > does that ensure the validity of the resulting hva,
> 
> Yes, but as I mentioned, I don't think it's an issue to be considered
> by KVM, otherwise we should have the same issue all over the places
> when we fetch the cached userspace_addr from any user slots.

Huh?  Of course it's an issue that needs to be considered by KVM, e.g.
kvm_{read,write}_guest_cached() aren't using __copy_{to,}from_user() for
giggles.

> > e.g. multiple calls to
> > KVM_SET_TSS_ADDR would race to set vmx->tss_addr and so init_rmode_tss()
> > could be operating on a stale gpa.
> 
> Please refer to [1].
> 
> I just want to double-confirm on what we're discussing now. Are you
> sure you're suggesting that we should remove the slot lock in
> init_rmode_tss()?  Asked because you discussed quite a bit on how the
> slot lock should protect GPA->HVA, about concurrency and so on, then
> I'm even more comfused...

Yes, if init_rmode_tss() is provided the hva then it does not need to
grab srcu_read_lock(&kvm->srcu) because it can directly call
__copy_{to,from}_user() instead of bouncing through the KVM helpers that
translate a gfn to hva.

The code can look like this.  That being said, I've completely lost track
of why __x86_set_memory_region() needs to provide the hva, i.e. have no
idea if we *should* do this, or it would be better to keep the current
code, which would be slower, but less custom.

static int init_rmode_tss(void __user *hva)
{
	const void *zero_page = (const void *)__va(page_to_phys(ZERO_PAGE(0)));
	u16 data = TSS_BASE_SIZE + TSS_REDIRECTION_SIZE;
	int r;

	r = __copy_to_user(hva, zero_page, PAGE_SIZE);
	if (r)
		return -EFAULT;

	r = __copy_to_user(hva + TSS_IOPB_BASE_OFFSET, &data, sizeof(u16))
	if (r)
		return -EFAULT;

	hva += PAGE_SIZE;
	r = __copy_to_user(hva + PAGE_SIZE, zero_page, PAGE_SIZE);
	if (r)
		return -EFAULT;

	hva += PAGE_SIZE;
	r = __copy_to_user(hva + PAGE_SIZE, zero_page, PAGE_SIZE);
	if (r)
		return -EFAULT;

	data = ~0;
	hva += RMODE_TSS_SIZE - 2 * PAGE_SIZE - 1;
	r = __copy_to_user(hva, &data, sizeof(u16))
	if (r)
		return -EFAULT;
}

static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
{
	void __user *hva;

	if (enable_unrestricted_guest)
		return 0;

	mutex_lock(&kvm->slots_lock);
	hva = __x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, addr,
				      PAGE_SIZE * 3);
	mutex_unlock(&kvm->slots_lock);

	if (IS_ERR(hva))
		return PTR_ERR(hva);

	to_kvm_vmx(kvm)->tss_addr = addr;
	return init_rmode_tss(hva);
}

Yes, userspace can corrupt its VM by invoking KVM_SET_TSS_ADDR multiple
times without serializing the calls, but that's already true today.

  reply	other threads:[~2020-01-31 19:33 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
2020-01-09 14:57 ` [PATCH v3 01/21] vfio: introduce vfio_iova_rw to read/write a range of IOVAs Peter Xu
2020-01-09 14:57 ` [PATCH v3 02/21] drm/i915/gvt: subsitute kvm_read/write_guest with vfio_iova_rw Peter Xu
2020-01-09 14:57 ` [PATCH v3 03/21] KVM: Remove kvm_read_guest_atomic() Peter Xu
2020-01-09 14:57 ` [PATCH v3 04/21] KVM: Add build-time error check on kvm_run size Peter Xu
2020-01-09 14:57 ` [PATCH v3 05/21] KVM: X86: Change parameter for fast_page_fault tracepoint Peter Xu
2020-01-09 14:57 ` [PATCH v3 06/21] KVM: X86: Don't take srcu lock in init_rmode_identity_map() Peter Xu
2020-01-09 14:57 ` [PATCH v3 07/21] KVM: Cache as_id in kvm_memory_slot Peter Xu
2020-01-09 14:57 ` [PATCH v3 08/21] KVM: X86: Drop x86_set_memory_region() Peter Xu
2020-01-09 14:57 ` [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
2020-01-19  9:01   ` Paolo Bonzini
2020-01-20  6:45     ` Peter Xu
2020-01-21 15:56   ` Sean Christopherson
2020-01-21 16:14     ` Paolo Bonzini
2020-01-28  5:50     ` Peter Xu
2020-01-28 18:24       ` Sean Christopherson
2020-01-31 15:08         ` Peter Xu
2020-01-31 19:33           ` Sean Christopherson [this message]
2020-01-31 20:28             ` Peter Xu
2020-01-31 20:36               ` Sean Christopherson
2020-01-31 20:55                 ` Peter Xu
2020-01-31 21:29                   ` Sean Christopherson
2020-01-31 22:16                     ` Peter Xu
2020-01-31 22:20                       ` Sean Christopherson
2020-01-09 14:57 ` [PATCH v3 10/21] KVM: Pass in kvm pointer into mark_page_dirty_in_slot() Peter Xu
2020-01-09 14:57 ` [PATCH v3 11/21] KVM: Move running VCPU from ARM to common code Peter Xu
2020-01-09 14:57 ` [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
2020-01-09 16:29   ` Michael S. Tsirkin
2020-01-09 16:56     ` Alex Williamson
2020-01-09 19:21       ` Peter Xu
2020-01-09 19:36         ` Michael S. Tsirkin
2020-01-09 19:15     ` Peter Xu
2020-01-09 19:35       ` Michael S. Tsirkin
2020-01-09 20:19         ` Peter Xu
2020-01-09 22:18           ` Michael S. Tsirkin
2020-01-10 15:29             ` Peter Xu
2020-01-12  6:24               ` Michael S. Tsirkin
2020-01-14 20:01         ` Peter Xu
2020-01-15  6:50           ` Michael S. Tsirkin
2020-01-15 15:20             ` Peter Xu
2020-01-19  9:09       ` Paolo Bonzini
2020-01-19 10:12         ` Michael S. Tsirkin
2020-01-20  7:29           ` Peter Xu
2020-01-20  7:47             ` Michael S. Tsirkin
2020-01-21  8:29               ` Peter Xu
2020-01-21 10:25                 ` Paolo Bonzini
2020-01-21 10:24             ` Paolo Bonzini
2020-01-11  4:49   ` kbuild test robot
2020-01-11 23:19   ` kbuild test robot
2020-01-15  6:47   ` Michael S. Tsirkin
2020-01-15 15:27     ` Peter Xu
2020-01-16  8:38   ` Michael S. Tsirkin
2020-01-16 16:27     ` Peter Xu
2020-01-17  9:50       ` Michael S. Tsirkin
2020-01-20  6:48         ` Peter Xu
2020-01-09 14:57 ` [PATCH v3 13/21] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
2020-01-09 14:57 ` [PATCH v3 14/21] KVM: Don't allocate dirty bitmap if dirty ring is enabled Peter Xu
2020-01-09 16:41   ` Peter Xu
2020-01-09 14:57 ` [PATCH v3 15/21] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
2020-01-09 14:57 ` [PATCH v3 16/21] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
2020-01-09 14:57 ` [PATCH v3 17/21] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
2020-01-09 14:57 ` [PATCH v3 18/21] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
2020-01-09 14:57 ` [PATCH v3 19/21] KVM: selftests: Add dirty ring buffer test Peter Xu
2020-01-09 14:57 ` [PATCH v3 20/21] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
2020-01-09 14:57 ` [PATCH v3 21/21] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
2020-01-09 15:59 ` [PATCH v3 00/21] KVM: Dirty ring interface Michael S. Tsirkin
2020-01-09 16:17   ` Peter Xu
2020-01-09 16:40     ` Michael S. Tsirkin
2020-01-09 17:08       ` Peter Xu
2020-01-09 19:08         ` Michael S. Tsirkin
2020-01-09 19:39           ` Peter Xu
2020-01-09 20:42             ` Paolo Bonzini
2020-01-09 22:28             ` Michael S. Tsirkin
2020-01-10 15:10               ` Peter Xu
2020-01-09 16:47 ` Alex Williamson
2020-01-09 17:58   ` Peter Xu
2020-01-09 19:13     ` Michael S. Tsirkin
2020-01-09 19:23       ` Peter Xu
2020-01-09 19:37         ` Michael S. Tsirkin
2020-01-09 20:51       ` Paolo Bonzini
2020-01-09 22:21         ` Michael S. Tsirkin
2020-01-19  9:11 ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200131193301.GC18946@linux.intel.com \
    --to=sean.j.christopherson@intel.com \
    --cc=alex.williamson@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=dinechin@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=vkuznets@redhat.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).