All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Christophe de Dinechin <dinechin@redhat.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Yan Zhao <yan.y.zhao@intel.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	Kevin Kevin <kevin.tian@intel.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
Date: Fri, 31 Jan 2020 15:28:24 -0500	[thread overview]
Message-ID: <20200131202824.GA7063@xz-x1> (raw)
In-Reply-To: <20200131193301.GC18946@linux.intel.com>

On Fri, Jan 31, 2020 at 11:33:01AM -0800, Sean Christopherson wrote:
> On Fri, Jan 31, 2020 at 10:08:32AM -0500, Peter Xu wrote:
> > On Tue, Jan 28, 2020 at 10:24:03AM -0800, Sean Christopherson wrote:
> > > On Tue, Jan 28, 2020 at 01:50:05PM +0800, Peter Xu wrote:
> > > > On Tue, Jan 21, 2020 at 07:56:57AM -0800, Sean Christopherson wrote:
> > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > index c4d3972dcd14..ff97782b3919 100644
> > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > @@ -9584,7 +9584,15 @@ void kvm_arch_sync_events(struct kvm *kvm)
> > > > > >  	kvm_free_pit(kvm);
> > > > > >  }
> > > > > >  
> > > > > > -int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> > > > > > +/*
> > > > > > + * If `uaddr' is specified, `*uaddr' will be returned with the
> > > > > > + * userspace address that was just allocated.  `uaddr' is only
> > > > > > + * meaningful if the function returns zero, and `uaddr' will only be
> > > > > > + * valid when with either the slots_lock or with the SRCU read lock
> > > > > > + * held.  After we release the lock, the returned `uaddr' will be invalid.
> > > > > 
> > > > > This is all incorrect.  Neither of those locks has any bearing on the
> > > > > validity of the hva.  slots_lock does as the name suggests and prevents
> > > > > concurrent writes to the memslots.  The SRCU lock ensures the implicit
> > > > > memslots lookup in kvm_clear_guest_page() won't result in a use-after-free
> > > > > due to derefencing old memslots.
> > > > > 
> > > > > Neither of those has anything to do with the userspace address, they're
> > > > > both fully tied to KVM's gfn->hva lookup.  As Paolo pointed out, KVM's
> > > > > mapping is instead tied to the lifecycle of the VM.  Note, even *that* has
> > > > > no bearing on the validity of the mapping or address as KVM only increments
> > > > > mm_count, not mm_users, i.e. guarantees the mm struct itself won't be freed
> > > > > but doesn't ensure the vmas or associated pages tables are valid.
> > > > > 
> > > > > Which is the entire point of using __copy_{to,from}_user(), as they
> > > > > gracefully handle the scenario where the process has not valid mapping
> > > > > and/or translation for the address.
> > > > 
> > > > Sorry I don't understand.
> > > > 
> > > > I do think either the slots_lock or SRCU would protect at least the
> > > > existing kvm.memslots, and if so at least the previous vm_mmap()
> > > > return value should still be valid.
> > > 
> > > Nope.  kvm->slots_lock only protects gfn->hva lookups, e.g. userspace can
> > > munmap() the range at any time.
> > 
> > Do we need to consider that?  If the userspace did this then it'll
> > corrupt itself, and imho private memory slot is not anything special
> > here comparing to the user memory slots.  For example, the userspace
> > can unmap any region after KVM_SET_USER_MEMORY_REGION ioctl even if
> > the region is filled into some of the userspace_addr of
> > kvm_userspace_memory_region, so the cached userspace_addr can be
> > invalid, then kvm_write_guest_page() can fail too with the same
> > reason.  IMHO kvm only need to make sure it handles the failure path
> > then it's perfectly fine.
> 
> Yes?  No?  My point is that your original comment's assertion that "'uaddr'
> will only be valid when with either the slots_lock or with the SRCU read
> lock held." is wrong and misleading.

Yes I'll fix that.

> 
> > > > I agree that __copy_to_user() will protect us from many cases from process
> > > > mm pov (which allows page faults inside), but again if the kvm.memslots is
> > > > changed underneath us then it's another story, IMHO, and that's why we need
> > > > either the lock or SRCU.
> > > 
> > > No, again, slots_lock and SRCU only protect gfn->hva lookups.
> > 
> > Yes, then could you further explain why do you think we don't need the
> > slot lock?  
> 
> For the same reason we don't take mmap_sem, it gains us nothing, i.e. KVM
> still has to use copy_{to,from}_user().
> 
> In the proposed __x86_set_memory_region() refactor, vmx_set_tss_addr()
> would be provided the hva of the memory region.  Since slots_lock and SRCU
> only protect gfn->hva, why would KVM take slots_lock since it already has
> the hva?

OK so you're suggesting to unlock the lock earlier to not cover
init_rmode_tss() rather than dropping the whole lock...  Yes it looks
good to me.  I think that's the major confusion I got.

> 
> > > > Or are you assuming that (1) __x86_set_memory_region() is only for the
> > > > 3 private kvm memslots, 
> > > 
> > > It's not an assumption, the entire purpose of __x86_set_memory_region()
> > > is to provide support for private KVM memslots.
> > > 
> > > > and (2) currently the kvm private memory slots will never change after VM
> > > > is created and before VM is destroyed?
> > > 
> > > No, I'm not assuming the private memslots are constant, e.g. the flow in
> > > question, vmx_set_tss_addr() is directly tied to an unprotected ioctl().
> > 
> > Why it's unprotected?
> 
> Because it doesn't need to be protected.
> 
> > Now vmx_set_tss_add() is protected by the slots lock so concurrent operation
> > is safe, also it'll return -EEXIST if called for more than once.
> 
> Returning -EEXIST is an ABI change, e.g. userspace can currently call
> KVM_SET_TSS_ADDR any number of times, it just needs to ensure proper
> serialization between calls.
> 
> If you want to change the ABI, then submit a patch to do exactly that.
> But don't bury an ABI change under the pretense that it's a bug fix.

Could you explain what do you mean by "ABI change"?

I was talking about the original code, not after applying the
patchset.  To be explicit, I mean [a] below:

int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size,
			    unsigned long *uaddr)
{
	int i, r;
	unsigned long hva;
	struct kvm_memslots *slots = kvm_memslots(kvm);
	struct kvm_memory_slot *slot, old;

	/* Called with kvm->slots_lock held.  */
	if (WARN_ON(id >= KVM_MEM_SLOTS_NUM))
		return -EINVAL;

	slot = id_to_memslot(slots, id);
	if (size) {
		if (slot->npages)
			return -EEXIST;  <------------------------ [a]
        }
        ...
}

> 
> > [1]
> > 
> > > 
> > > KVM's sole responsible for vmx_set_tss_addr() is to not crash the kernel.
> > > Userspace is responsible for ensuring it doesn't break its guests, e.g.
> > > that multiple calls to KVM_SET_TSS_ADDR are properly serialized.
> > > 
> > > In the existing code, KVM ensures it doesn't crash by holding the SRCU lock
> > > for the duration of init_rmode_tss() so that the gfn->hva lookups in
> > > kvm_clear_guest_page() don't dereference a stale memslots array.
> > 
> > Here in the current master branch we have both the RCU lock and the
> > slot lock held, that's why I think we can safely remove the RCU lock
> > as long as we're still holding the slots lock.  We can't do the
> > reverse because otherwise multiple KVM_SET_TSS_ADDR could race.
> 
> Your wording is all messed up.  "we have both the RCU lock and the slot
> lock held" is wrong.

I did mess up with 2a5755bb21ee2.  We didn't take both lock here,
sorry.

> KVM holds slot_lock around __x86_set_memory_region(),
> because changing the memslots must be mutually exclusive.  It then *drops*
> slots_lock because it's done writing the memslots and grabs the SRCU lock
> in order to protect the gfn->hva lookups done by init_rmode_tss().  It
> *intentionally* drops slots_lock because writing init_rmode_tss() does not
> need to be a mutually exclusive operation, per KVM's existing ABI.
> 
> If KVM held both slots_lock and SRCU then __x86_set_memory_region() would
> deadlock on synchronize_srcu().
> 
> > > In no way
> > > does that ensure the validity of the resulting hva,
> > 
> > Yes, but as I mentioned, I don't think it's an issue to be considered
> > by KVM, otherwise we should have the same issue all over the places
> > when we fetch the cached userspace_addr from any user slots.
> 
> Huh?  Of course it's an issue that needs to be considered by KVM, e.g.
> kvm_{read,write}_guest_cached() aren't using __copy_{to,}from_user() for
> giggles.

The cache is for the GPA->HVA translation (struct gfn_to_hva_cache),
we still use __copy_{to,}from_user() upon the HVAs, no?

> 
> > > e.g. multiple calls to
> > > KVM_SET_TSS_ADDR would race to set vmx->tss_addr and so init_rmode_tss()
> > > could be operating on a stale gpa.
> > 
> > Please refer to [1].
> > 
> > I just want to double-confirm on what we're discussing now. Are you
> > sure you're suggesting that we should remove the slot lock in
> > init_rmode_tss()?  Asked because you discussed quite a bit on how the
> > slot lock should protect GPA->HVA, about concurrency and so on, then
> > I'm even more comfused...
> 
> Yes, if init_rmode_tss() is provided the hva then it does not need to
> grab srcu_read_lock(&kvm->srcu) because it can directly call
> __copy_{to,from}_user() instead of bouncing through the KVM helpers that
> translate a gfn to hva.
> 
> The code can look like this.  That being said, I've completely lost track
> of why __x86_set_memory_region() needs to provide the hva, i.e. have no
> idea if we *should* do this, or it would be better to keep the current
> code, which would be slower, but less custom.
> 
> static int init_rmode_tss(void __user *hva)
> {
> 	const void *zero_page = (const void *)__va(page_to_phys(ZERO_PAGE(0)));
> 	u16 data = TSS_BASE_SIZE + TSS_REDIRECTION_SIZE;
> 	int r;
> 
> 	r = __copy_to_user(hva, zero_page, PAGE_SIZE);
> 	if (r)
> 		return -EFAULT;
> 
> 	r = __copy_to_user(hva + TSS_IOPB_BASE_OFFSET, &data, sizeof(u16))
> 	if (r)
> 		return -EFAULT;
> 
> 	hva += PAGE_SIZE;
> 	r = __copy_to_user(hva + PAGE_SIZE, zero_page, PAGE_SIZE);
> 	if (r)
> 		return -EFAULT;
> 
> 	hva += PAGE_SIZE;
> 	r = __copy_to_user(hva + PAGE_SIZE, zero_page, PAGE_SIZE);
> 	if (r)
> 		return -EFAULT;
> 
> 	data = ~0;
> 	hva += RMODE_TSS_SIZE - 2 * PAGE_SIZE - 1;
> 	r = __copy_to_user(hva, &data, sizeof(u16))
> 	if (r)
> 		return -EFAULT;
> }
> 
> static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
> {
> 	void __user *hva;
> 
> 	if (enable_unrestricted_guest)
> 		return 0;
> 
> 	mutex_lock(&kvm->slots_lock);
> 	hva = __x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, addr,
> 				      PAGE_SIZE * 3);
> 	mutex_unlock(&kvm->slots_lock);
> 
> 	if (IS_ERR(hva))
> 		return PTR_ERR(hva);
> 
> 	to_kvm_vmx(kvm)->tss_addr = addr;
> 	return init_rmode_tss(hva);
> }
> 
> Yes, userspace can corrupt its VM by invoking KVM_SET_TSS_ADDR multiple
> times without serializing the calls, but that's already true today.

But I still don't see why we have any problem here.  Only the first
thread will get the slots_lock here and succeed this ioctl.  The rest
threads will fail with -EEXIST, no?

-- 
Peter Xu


  reply	other threads:[~2020-01-31 20:28 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
2020-01-09 14:57 ` [PATCH v3 01/21] vfio: introduce vfio_iova_rw to read/write a range of IOVAs Peter Xu
2020-01-09 14:57 ` [PATCH v3 02/21] drm/i915/gvt: subsitute kvm_read/write_guest with vfio_iova_rw Peter Xu
2020-01-09 14:57 ` [PATCH v3 03/21] KVM: Remove kvm_read_guest_atomic() Peter Xu
2020-01-09 14:57 ` [PATCH v3 04/21] KVM: Add build-time error check on kvm_run size Peter Xu
2020-01-09 14:57 ` [PATCH v3 05/21] KVM: X86: Change parameter for fast_page_fault tracepoint Peter Xu
2020-01-09 14:57 ` [PATCH v3 06/21] KVM: X86: Don't take srcu lock in init_rmode_identity_map() Peter Xu
2020-01-09 14:57 ` [PATCH v3 07/21] KVM: Cache as_id in kvm_memory_slot Peter Xu
2020-01-09 14:57 ` [PATCH v3 08/21] KVM: X86: Drop x86_set_memory_region() Peter Xu
2020-01-09 14:57 ` [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
2020-01-19  9:01   ` Paolo Bonzini
2020-01-20  6:45     ` Peter Xu
2020-01-21 15:56   ` Sean Christopherson
2020-01-21 16:14     ` Paolo Bonzini
2020-01-28  5:50     ` Peter Xu
2020-01-28 18:24       ` Sean Christopherson
2020-01-31 15:08         ` Peter Xu
2020-01-31 19:33           ` Sean Christopherson
2020-01-31 20:28             ` Peter Xu [this message]
2020-01-31 20:36               ` Sean Christopherson
2020-01-31 20:55                 ` Peter Xu
2020-01-31 21:29                   ` Sean Christopherson
2020-01-31 22:16                     ` Peter Xu
2020-01-31 22:20                       ` Sean Christopherson
2020-01-09 14:57 ` [PATCH v3 10/21] KVM: Pass in kvm pointer into mark_page_dirty_in_slot() Peter Xu
2020-01-09 14:57 ` [PATCH v3 11/21] KVM: Move running VCPU from ARM to common code Peter Xu
2020-01-09 14:57 ` [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
2020-01-09 16:29   ` Michael S. Tsirkin
2020-01-09 16:56     ` Alex Williamson
2020-01-09 19:21       ` Peter Xu
2020-01-09 19:36         ` Michael S. Tsirkin
2020-01-09 19:15     ` Peter Xu
2020-01-09 19:35       ` Michael S. Tsirkin
2020-01-09 20:19         ` Peter Xu
2020-01-09 22:18           ` Michael S. Tsirkin
2020-01-10 15:29             ` Peter Xu
2020-01-12  6:24               ` Michael S. Tsirkin
2020-01-14 20:01         ` Peter Xu
2020-01-15  6:50           ` Michael S. Tsirkin
2020-01-15 15:20             ` Peter Xu
2020-01-19  9:09       ` Paolo Bonzini
2020-01-19 10:12         ` Michael S. Tsirkin
2020-01-20  7:29           ` Peter Xu
2020-01-20  7:47             ` Michael S. Tsirkin
2020-01-21  8:29               ` Peter Xu
2020-01-21 10:25                 ` Paolo Bonzini
2020-01-21 10:24             ` Paolo Bonzini
2020-01-11  4:49   ` kbuild test robot
2020-01-11  4:49     ` kbuild test robot
2020-01-11 23:19   ` kbuild test robot
2020-01-11 23:19     ` kbuild test robot
2020-01-15  6:47   ` Michael S. Tsirkin
2020-01-15 15:27     ` Peter Xu
2020-01-16  8:38   ` Michael S. Tsirkin
2020-01-16 16:27     ` Peter Xu
2020-01-17  9:50       ` Michael S. Tsirkin
2020-01-20  6:48         ` Peter Xu
2020-01-09 14:57 ` [PATCH v3 13/21] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
2020-01-09 14:57 ` [PATCH v3 14/21] KVM: Don't allocate dirty bitmap if dirty ring is enabled Peter Xu
2020-01-09 16:41   ` Peter Xu
2020-01-09 14:57 ` [PATCH v3 15/21] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
2020-01-09 14:57 ` [PATCH v3 16/21] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
2020-01-09 14:57 ` [PATCH v3 17/21] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
2020-01-09 14:57 ` [PATCH v3 18/21] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
2020-01-09 14:57 ` [PATCH v3 19/21] KVM: selftests: Add dirty ring buffer test Peter Xu
2020-01-09 14:57 ` [PATCH v3 20/21] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
2020-01-09 14:57 ` [PATCH v3 21/21] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
2020-01-09 15:59 ` [PATCH v3 00/21] KVM: Dirty ring interface Michael S. Tsirkin
2020-01-09 16:17   ` Peter Xu
2020-01-09 16:40     ` Michael S. Tsirkin
2020-01-09 17:08       ` Peter Xu
2020-01-09 19:08         ` Michael S. Tsirkin
2020-01-09 19:39           ` Peter Xu
2020-01-09 20:42             ` Paolo Bonzini
2020-01-09 22:28             ` Michael S. Tsirkin
2020-01-10 15:10               ` Peter Xu
2020-01-09 16:47 ` Alex Williamson
2020-01-09 17:58   ` Peter Xu
2020-01-09 19:13     ` Michael S. Tsirkin
2020-01-09 19:23       ` Peter Xu
2020-01-09 19:37         ` Michael S. Tsirkin
2020-01-09 20:51       ` Paolo Bonzini
2020-01-09 22:21         ` Michael S. Tsirkin
2020-01-19  9:11 ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200131202824.GA7063@xz-x1 \
    --to=peterx@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=dinechin@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=sean.j.christopherson@intel.com \
    --cc=vkuznets@redhat.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.