kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Christophe de Dinechin <dinechin@redhat.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Yan Zhao <yan.y.zhao@intel.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	Kevin Kevin <kevin.tian@intel.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
Date: Tue, 28 Jan 2020 13:50:05 +0800	[thread overview]
Message-ID: <20200128055005.GB662081@xz-x1> (raw)
In-Reply-To: <20200121155657.GA7923@linux.intel.com>

On Tue, Jan 21, 2020 at 07:56:57AM -0800, Sean Christopherson wrote:
> On Thu, Jan 09, 2020 at 09:57:17AM -0500, Peter Xu wrote:
> > Originally, we have three code paths that can dirty a page without
> > vcpu context for X86:
> > 
> >   - init_rmode_identity_map
> >   - init_rmode_tss
> >   - kvmgt_rw_gpa
> > 
> > init_rmode_identity_map and init_rmode_tss will be setup on
> > destination VM no matter what (and the guest cannot even see them), so
> > it does not make sense to track them at all.
> > 
> > To do this, allow __x86_set_memory_region() to return the userspace
> > address that just allocated to the caller.  Then in both of the
> > functions we directly write to the userspace address instead of
> > calling kvm_write_*() APIs.  We need to make sure that we have the
> > slots_lock held when accessing the userspace address.
> > 
> > Another trivial change is that we don't need to explicitly clear the
> > identity page table root in init_rmode_identity_map() because no
> > matter what we'll write to the whole page with 4M huge page entries.
> > 
> > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  3 +-
> >  arch/x86/kvm/svm.c              |  3 +-
> >  arch/x86/kvm/vmx/vmx.c          | 68 ++++++++++++++++-----------------
> >  arch/x86/kvm/x86.c              | 18 +++++++--
> >  4 files changed, 51 insertions(+), 41 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index eb6673c7d2e3..f536d139b3d2 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1618,7 +1618,8 @@ void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu);
> >  
> >  int kvm_is_in_guest(void);
> >  
> > -int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size);
> > +int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size,
> > +			    unsigned long *uaddr);
> 
> No need for a new param, just return a "void __user *" (or "void *" if the
> __user part requires lots of casting) and use ERR_PTR() to encode errors in
> the return value.  I.e. return the userspace address.
> 
> The refactoring to return the address should be done in a separate patch as
> prep work for the move to __copy_to_user().

Yes this sounds cleaner, will do.

> 
> >  bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu);
> >  bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu);
> >  
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index 8f1b715dfde8..03a344ce7b66 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -1698,7 +1698,8 @@ static int avic_init_access_page(struct kvm_vcpu *vcpu)
> >  	ret = __x86_set_memory_region(kvm,
> >  				      APIC_ACCESS_PAGE_PRIVATE_MEMSLOT,
> >  				      APIC_DEFAULT_PHYS_BASE,
> > -				      PAGE_SIZE);
> > +				      PAGE_SIZE,
> > +				      NULL);
> >  	if (ret)
> >  		goto out;
> >  
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 7e3d370209e0..62175a246bcc 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -3441,34 +3441,28 @@ static bool guest_state_valid(struct kvm_vcpu *vcpu)
> >  	return true;
> >  }
> >  
> > -static int init_rmode_tss(struct kvm *kvm)
> > +static int init_rmode_tss(struct kvm *kvm, unsigned long *uaddr)
> 
> uaddr is not a pointer to an unsigned long, it's a pointer to a TSS.  Given
> that it's dereferenced as a "void __user *", it's probably best passed as
> exactly that.
> 
> This code also needs to be tested by doing unrestricted_guest=0 when
> loading kvm_intel, because it's obviously broken.  __x86_set_memory_region()
> takes an "unsigned long *", interpreted as a "pointer to a usersepace
> address", i.e. a "void __user **".  But the callers are treating the param
> as a "unsigned long in userpace", e.g. init_rmode_identity_map() declares
> uaddr as an "unsigned long *", when really it should be declaring a
> straight "unsigned long" and passing "&uaddr".  The only thing that saves
> KVM from dereferencing a bad pointer in __x86_set_memory_region() is that
> uaddr is initialized to NULL 

Yes it's broken.  Thanks very much for figuring it out.  I'll test
unrestricted_guest=N.

> 
> >  {
> > -	gfn_t fn;
> > +	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> >  	u16 data = 0;
> >  	int idx, r;
> >  
> > -	idx = srcu_read_lock(&kvm->srcu);
> > -	fn = to_kvm_vmx(kvm)->tss_addr >> PAGE_SHIFT;
> > -	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE);
> > -	if (r < 0)
> > -		goto out;
> > +	for (idx = 0; idx < 3; idx++) {
> > +		r = __copy_to_user((void __user *)uaddr + PAGE_SIZE * idx,
> > +				   zero_page, PAGE_SIZE);
> > +		if (r)
> > +			return -EFAULT;
> > +	}
> > +
> >  	data = TSS_BASE_SIZE + TSS_REDIRECTION_SIZE;
> > -	r = kvm_write_guest_page(kvm, fn++, &data,
> > -			TSS_IOPB_BASE_OFFSET, sizeof(u16));
> > -	if (r < 0)
> > -		goto out;
> > -	r = kvm_clear_guest_page(kvm, fn++, 0, PAGE_SIZE);
> > -	if (r < 0)
> > -		goto out;
> > -	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE);
> > -	if (r < 0)
> > -		goto out;
> > +	r = __copy_to_user((void __user *)uaddr + TSS_IOPB_BASE_OFFSET,
> > +			   &data, sizeof(data));
> > +	if (r)
> > +		return -EFAULT;
> > +
> >  	data = ~0;
> > -	r = kvm_write_guest_page(kvm, fn, &data,
> > -				 RMODE_TSS_SIZE - 2 * PAGE_SIZE - 1,
> > -				 sizeof(u8));
> > -out:
> > -	srcu_read_unlock(&kvm->srcu, idx);
> > +	r = __copy_to_user((void __user *)uaddr - 1, &data, sizeof(data));
> > +
> >  	return r;
> 
> Why not "return __copy_to_user();"?

Sure.

> 
> >  }
> >  
> > @@ -3478,6 +3472,7 @@ static int init_rmode_identity_map(struct kvm *kvm)
> >  	int i, r = 0;
> >  	kvm_pfn_t identity_map_pfn;
> >  	u32 tmp;
> > +	unsigned long *uaddr = NULL;
> 
> Again, not a pointer to an unsigned long.
> 
> >  	/* Protect kvm_vmx->ept_identity_pagetable_done. */
> >  	mutex_lock(&kvm->slots_lock);
> > @@ -3490,21 +3485,21 @@ static int init_rmode_identity_map(struct kvm *kvm)
> >  	identity_map_pfn = kvm_vmx->ept_identity_map_addr >> PAGE_SHIFT;
> >  
> >  	r = __x86_set_memory_region(kvm, IDENTITY_PAGETABLE_PRIVATE_MEMSLOT,
> > -				    kvm_vmx->ept_identity_map_addr, PAGE_SIZE);
> > +				    kvm_vmx->ept_identity_map_addr, PAGE_SIZE,
> > +				    uaddr);
> >  	if (r < 0)
> >  		goto out;
> >  
> > -	r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE);
> > -	if (r < 0)
> > -		goto out;
> >  	/* Set up identity-mapping pagetable for EPT in real mode */
> >  	for (i = 0; i < PT32_ENT_PER_PAGE; i++) {
> >  		tmp = (i << 22) + (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
> >  			_PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
> > -		r = kvm_write_guest_page(kvm, identity_map_pfn,
> > -				&tmp, i * sizeof(tmp), sizeof(tmp));
> > -		if (r < 0)
> > +		r = __copy_to_user((void __user *)uaddr + i * sizeof(tmp),
> > +				   &tmp, sizeof(tmp));
> > +		if (r) {
> > +			r = -EFAULT;
> >  			goto out;
> > +		}
> >  	}
> >  	kvm_vmx->ept_identity_pagetable_done = true;
> >  
> > @@ -3537,7 +3532,7 @@ static int alloc_apic_access_page(struct kvm *kvm)
> >  	if (kvm->arch.apic_access_page_done)
> >  		goto out;
> >  	r = __x86_set_memory_region(kvm, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT,
> > -				    APIC_DEFAULT_PHYS_BASE, PAGE_SIZE);
> > +				    APIC_DEFAULT_PHYS_BASE, PAGE_SIZE, NULL);
> >  	if (r)
> >  		goto out;
> >  
> > @@ -4478,19 +4473,22 @@ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
> >  static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
> >  {
> >  	int ret;
> > +	unsigned long *uaddr = NULL;
> >  
> >  	if (enable_unrestricted_guest)
> >  		return 0;
> >  
> >  	mutex_lock(&kvm->slots_lock);
> >  	ret = __x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, addr,
> > -				      PAGE_SIZE * 3);
> > -	mutex_unlock(&kvm->slots_lock);
> > -
> > +				      PAGE_SIZE * 3, uaddr);
> >  	if (ret)
> > -		return ret;
> > +		goto out;
> > +
> >  	to_kvm_vmx(kvm)->tss_addr = addr;
> > -	return init_rmode_tss(kvm);
> > +	ret = init_rmode_tss(kvm, uaddr);
> > +out:
> > +	mutex_unlock(&kvm->slots_lock);
> 
> Unnecessary, see below.

Do you mean that we don't even need the lock?

I feel like this could at least fail lockdep.  More below.

[1]

> 
> > +	return ret;
> >  }
> >  
> >  static int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index c4d3972dcd14..ff97782b3919 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -9584,7 +9584,15 @@ void kvm_arch_sync_events(struct kvm *kvm)
> >  	kvm_free_pit(kvm);
> >  }
> >  
> > -int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> > +/*
> > + * If `uaddr' is specified, `*uaddr' will be returned with the
> > + * userspace address that was just allocated.  `uaddr' is only
> > + * meaningful if the function returns zero, and `uaddr' will only be
> > + * valid when with either the slots_lock or with the SRCU read lock
> > + * held.  After we release the lock, the returned `uaddr' will be invalid.
> 
> This is all incorrect.  Neither of those locks has any bearing on the
> validity of the hva.  slots_lock does as the name suggests and prevents
> concurrent writes to the memslots.  The SRCU lock ensures the implicit
> memslots lookup in kvm_clear_guest_page() won't result in a use-after-free
> due to derefencing old memslots.
> 
> Neither of those has anything to do with the userspace address, they're
> both fully tied to KVM's gfn->hva lookup.  As Paolo pointed out, KVM's
> mapping is instead tied to the lifecycle of the VM.  Note, even *that* has
> no bearing on the validity of the mapping or address as KVM only increments
> mm_count, not mm_users, i.e. guarantees the mm struct itself won't be freed
> but doesn't ensure the vmas or associated pages tables are valid.
> 
> Which is the entire point of using __copy_{to,from}_user(), as they
> gracefully handle the scenario where the process has not valid mapping
> and/or translation for the address.

Sorry I don't understand.

I do think either the slots_lock or SRCU would protect at least the
existing kvm.memslots, and if so at least the previous vm_mmap()
return value should still be valid.  I agree that __copy_to_user()
will protect us from many cases from process mm pov (which allows page
faults inside), but again if the kvm.memslots is changed underneath us
then it's another story, IMHO, and that's why we need either the lock
or SRCU.

Or are you assuming that (1) __x86_set_memory_region() is only for the
3 private kvm memslots, and (2) currently the kvm private memory slots
will never change after VM is created and before VM is destroyed?  If
so, I agree with you.  However I don't see why we need to restrict
__x86_set_memory_region() with that assumption, after all taking a
lock is not expensive in this slow path.  Even if so, we'd better
comment above __x86_set_memory_region() about this, so we know that we
should not use __x86_set_memory_region() for future kvm internal
memslots that are prone to change during VM's lifecycle (while
currently it seems to be a very general interface).

Please let me know if I misunderstood your point.

Thanks,

-- 
Peter Xu


  parent reply	other threads:[~2020-01-28  5:50 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
2020-01-09 14:57 ` [PATCH v3 01/21] vfio: introduce vfio_iova_rw to read/write a range of IOVAs Peter Xu
2020-01-09 14:57 ` [PATCH v3 02/21] drm/i915/gvt: subsitute kvm_read/write_guest with vfio_iova_rw Peter Xu
2020-01-09 14:57 ` [PATCH v3 03/21] KVM: Remove kvm_read_guest_atomic() Peter Xu
2020-01-09 14:57 ` [PATCH v3 04/21] KVM: Add build-time error check on kvm_run size Peter Xu
2020-01-09 14:57 ` [PATCH v3 05/21] KVM: X86: Change parameter for fast_page_fault tracepoint Peter Xu
2020-01-09 14:57 ` [PATCH v3 06/21] KVM: X86: Don't take srcu lock in init_rmode_identity_map() Peter Xu
2020-01-09 14:57 ` [PATCH v3 07/21] KVM: Cache as_id in kvm_memory_slot Peter Xu
2020-01-09 14:57 ` [PATCH v3 08/21] KVM: X86: Drop x86_set_memory_region() Peter Xu
2020-01-09 14:57 ` [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
2020-01-19  9:01   ` Paolo Bonzini
2020-01-20  6:45     ` Peter Xu
2020-01-21 15:56   ` Sean Christopherson
2020-01-21 16:14     ` Paolo Bonzini
2020-01-28  5:50     ` Peter Xu [this message]
2020-01-28 18:24       ` Sean Christopherson
2020-01-31 15:08         ` Peter Xu
2020-01-31 19:33           ` Sean Christopherson
2020-01-31 20:28             ` Peter Xu
2020-01-31 20:36               ` Sean Christopherson
2020-01-31 20:55                 ` Peter Xu
2020-01-31 21:29                   ` Sean Christopherson
2020-01-31 22:16                     ` Peter Xu
2020-01-31 22:20                       ` Sean Christopherson
2020-01-09 14:57 ` [PATCH v3 10/21] KVM: Pass in kvm pointer into mark_page_dirty_in_slot() Peter Xu
2020-01-09 14:57 ` [PATCH v3 11/21] KVM: Move running VCPU from ARM to common code Peter Xu
2020-01-09 14:57 ` [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
2020-01-09 16:29   ` Michael S. Tsirkin
2020-01-09 16:56     ` Alex Williamson
2020-01-09 19:21       ` Peter Xu
2020-01-09 19:36         ` Michael S. Tsirkin
2020-01-09 19:15     ` Peter Xu
2020-01-09 19:35       ` Michael S. Tsirkin
2020-01-09 20:19         ` Peter Xu
2020-01-09 22:18           ` Michael S. Tsirkin
2020-01-10 15:29             ` Peter Xu
2020-01-12  6:24               ` Michael S. Tsirkin
2020-01-14 20:01         ` Peter Xu
2020-01-15  6:50           ` Michael S. Tsirkin
2020-01-15 15:20             ` Peter Xu
2020-01-19  9:09       ` Paolo Bonzini
2020-01-19 10:12         ` Michael S. Tsirkin
2020-01-20  7:29           ` Peter Xu
2020-01-20  7:47             ` Michael S. Tsirkin
2020-01-21  8:29               ` Peter Xu
2020-01-21 10:25                 ` Paolo Bonzini
2020-01-21 10:24             ` Paolo Bonzini
2020-01-11  4:49   ` kbuild test robot
2020-01-11 23:19   ` kbuild test robot
2020-01-15  6:47   ` Michael S. Tsirkin
2020-01-15 15:27     ` Peter Xu
2020-01-16  8:38   ` Michael S. Tsirkin
2020-01-16 16:27     ` Peter Xu
2020-01-17  9:50       ` Michael S. Tsirkin
2020-01-20  6:48         ` Peter Xu
2020-01-09 14:57 ` [PATCH v3 13/21] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
2020-01-09 14:57 ` [PATCH v3 14/21] KVM: Don't allocate dirty bitmap if dirty ring is enabled Peter Xu
2020-01-09 16:41   ` Peter Xu
2020-01-09 14:57 ` [PATCH v3 15/21] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
2020-01-09 14:57 ` [PATCH v3 16/21] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
2020-01-09 14:57 ` [PATCH v3 17/21] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
2020-01-09 14:57 ` [PATCH v3 18/21] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
2020-01-09 14:57 ` [PATCH v3 19/21] KVM: selftests: Add dirty ring buffer test Peter Xu
2020-01-09 14:57 ` [PATCH v3 20/21] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
2020-01-09 14:57 ` [PATCH v3 21/21] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
2020-01-09 15:59 ` [PATCH v3 00/21] KVM: Dirty ring interface Michael S. Tsirkin
2020-01-09 16:17   ` Peter Xu
2020-01-09 16:40     ` Michael S. Tsirkin
2020-01-09 17:08       ` Peter Xu
2020-01-09 19:08         ` Michael S. Tsirkin
2020-01-09 19:39           ` Peter Xu
2020-01-09 20:42             ` Paolo Bonzini
2020-01-09 22:28             ` Michael S. Tsirkin
2020-01-10 15:10               ` Peter Xu
2020-01-09 16:47 ` Alex Williamson
2020-01-09 17:58   ` Peter Xu
2020-01-09 19:13     ` Michael S. Tsirkin
2020-01-09 19:23       ` Peter Xu
2020-01-09 19:37         ` Michael S. Tsirkin
2020-01-09 20:51       ` Paolo Bonzini
2020-01-09 22:21         ` Michael S. Tsirkin
2020-01-19  9:11 ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200128055005.GB662081@xz-x1 \
    --to=peterx@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=dinechin@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=sean.j.christopherson@intel.com \
    --cc=vkuznets@redhat.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).