KVM Archive on lore.kernel.org
 help / color / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Sean Christopherson <seanjc@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Andy Lutomirski <luto@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Jim Mattson <jmattson@google.com>,
	David Rientjes <rientjes@google.com>,
	"Edgecombe, Rick P" <rick.p.edgecombe@intel.com>,
	"Kleen, Andi" <andi.kleen@intel.com>,
	"Yamahata, Isaku" <isaku.yamahata@intel.com>,
	Erdem Aktas <erdemaktas@google.com>,
	Steve Rutherford <srutherford@google.com>,
	Peter Gonda <pgonda@google.com>,
	David Hildenbrand <david@redhat.com>,
	x86@kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages
Date: Mon, 19 Apr 2021 17:26:02 +0300
Message-ID: <20210419142602.khjbzktk5tk5l6lk@box.shutemov.name> (raw)
In-Reply-To: <YHnJtvXdrZE+AfM3@google.com>

On Fri, Apr 16, 2021 at 05:30:30PM +0000, Sean Christopherson wrote:
> On Fri, Apr 16, 2021, Kirill A. Shutemov wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 1b404e4d7dd8..f8183386abe7 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -8170,6 +8170,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> >  		kvm_sched_yield(vcpu->kvm, a0);
> >  		ret = 0;
> >  		break;
> > +	case KVM_HC_ENABLE_MEM_PROTECTED:
> > +		ret = kvm_protect_memory(vcpu->kvm);
> > +		break;
> > +	case KVM_HC_MEM_SHARE:
> > +		ret = kvm_share_memory(vcpu->kvm, a0, a1);
> 
> Can you take a look at a proposed hypercall interface for SEV live migration and
> holler if you (or anyone else) thinks it will have extensibility issues?
> 
> https://lkml.kernel.org/r/93d7f2c2888315adc48905722574d89699edde33.1618498113.git.ashish.kalra@amd.com

Will look closer. Thanks.

> > @@ -1868,11 +1874,17 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> >  		flags |= FOLL_WRITE;
> >  	if (async)
> >  		flags |= FOLL_NOWAIT;
> > +	if (kvm->mem_protected)
> > +		flags |= FOLL_ALLOW_POISONED;
> 
> This is unsafe, only the flows that are mapping the PFN into the guest should
> use ALLOW_POISONED, e.g. __kvm_map_gfn() should fail on a poisoned page.

That's true for TDX. I prototyped with pure KVM with minimal modification
to the guest. We had to be more permissive for the reason. It will go
away for TDX.

> > -static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> > -				 void *data, int offset, int len)
> > +int copy_from_guest(struct kvm *kvm, void *data, unsigned long hva, int len)
> > +{
> > +	int offset = offset_in_page(hva);
> > +	struct page *page;
> > +	int npages, seg;
> > +	void *vaddr;
> > +
> > +	if (!IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) ||
> > +	    !kvm->mem_protected) {
> > +		return __copy_from_user(data, (void __user *)hva, len);
> > +	}
> > +
> > +	might_fault();
> > +	kasan_check_write(data, len);
> > +	check_object_size(data, len, false);
> > +
> > +	while ((seg = next_segment(len, offset)) != 0) {
> > +		npages = get_user_pages_unlocked(hva, 1, &page,
> > +						 FOLL_ALLOW_POISONED);
> > +		if (npages != 1)
> > +			return -EFAULT;
> > +
> > +		if (!kvm_page_allowed(kvm, page))
> > +			return -EFAULT;
> > +
> > +		vaddr = kmap_atomic(page);
> > +		memcpy(data, vaddr + offset, seg);
> > +		kunmap_atomic(vaddr);
> 
> Why is KVM allowed to access a poisoned page?  I would expect shared pages to
> _not_ be poisoned.  Except for pure software emulation of SEV, KVM can't access
> guest private memory.

Again, it's not going to be in TDX implementation.


> I like the idea of using "special" PTE value to denote guest private memory,
> e.g. in this RFC, HWPOISON.  But I strongly dislike having KVM involved in the
> manipulation of the special flag/value.
> 
> Today, userspace owns the gfn->hva translations and the kernel effectively owns
> the hva->pfn translations (with input from userspace).  KVM just connects the
> dots.
> 
> Having KVM own the shared/private transitions means KVM is now part owner of the
> entire gfn->hva->pfn translation, i.e. KVM is effectively now a secondary MMU
> and a co-owner of the primary MMU.  This creates locking madness, e.g. KVM taking
> mmap_sem for write, mmu_lock under page lock, etc..., and also takes control away
> from userspace.  E.g. userspace strategy could be to use a separate backing/pool
> for shared memory and change the gfn->hva translation (memslots) in reaction to
> a shared/private conversion.  Automatically swizzling things in KVM takes away
> that option.
> 
> IMO, KVM should be entirely "passive" in this process, e.g. the guest shares or
> protects memory, userspace calls into the kernel to change state, and the kernel
> manages the page tables to prevent bad actors.  KVM simply does the plumbing for
> the guest page tables.

That's a new perspective for me. Very interesting.

Let's see how it can look like:

 - KVM only allows poisoned pages (or whatever flag we end up using for
   protection) in the private mappings. SIGBUS otherwise.

 - Poisoned pages must be tied to the KVM instance to be allowed in the
   private mappings. Like kvm->id in the current prototype. SIGBUS
   otherwise.

 - Pages get poisoned on fault in if the VMA has a new vmflag set.

 - Fault in of a poisoned page leads to hwpoison entry. Userspace cannot
   access such pages.

 - Poisoned pages produced this way get unpoisoned on free.

 - The new VMA flag set by userspace. mprotect(2)?

 - Add a new GUP flag to retrive such pages from the userspace mapping.
   Used only for private mapping population.

 - Shared gfn ranges managed by userspace, based on hypercalls from the
   guest.

 - Shared mappings get populated via normal VMA. Any poisoned pages here
   would lead to SIGBUS.

So far it looks pretty straight-forward.

The only thing that I don't understand is at way point the page gets tied
to the KVM instance. Currently we do it just before populating shadow
entries, but it would not work with the new scheme: as we poison pages
on fault it they may never get inserted into shadow entries. That's not
good as we rely on the info to unpoison page on free.

Maybe we should tie VMA to the KVM instance on setting the vmflags?
I donno.

Any comments?

-- 
 Kirill A. Shutemov

  parent reply index

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-16 15:40 [RFCv2 00/13] TDX and guest memory unmapping Kirill A. Shutemov
2021-04-16 15:40 ` [RFCv2 01/13] x86/mm: Move force_dma_unencrypted() to common code Kirill A. Shutemov
2021-04-16 15:40 ` [RFCv2 02/13] x86/kvm: Introduce KVM memory protection feature Kirill A. Shutemov
2021-04-16 16:10   ` Borislav Petkov
2021-04-19 10:10     ` Kirill A. Shutemov
2021-04-16 15:40 ` [RFCv2 03/13] x86/kvm: Make DMA pages shared Kirill A. Shutemov
2021-04-16 15:40 ` [RFCv2 04/13] x86/kvm: Use bounce buffers for KVM memory protection Kirill A. Shutemov
2021-04-16 16:21   ` Dave Hansen
2021-04-16 15:40 ` [RFCv2 05/13] x86/kvmclock: Share hvclock memory with the host Kirill A. Shutemov
2021-04-16 15:40 ` [RFCv2 06/13] x86/realmode: Share trampoline area if KVM memory protection enabled Kirill A. Shutemov
2021-04-19 16:49   ` Dave Hansen
2021-04-16 15:41 ` [RFCv2 07/13] mm: Add hwpoison_entry_to_pfn() and hwpoison_entry_to_page() Kirill A. Shutemov
2021-04-16 15:41 ` [RFCv2 08/13] mm/gup: Add FOLL_ALLOW_POISONED Kirill A. Shutemov
2021-04-16 15:41 ` [RFCv2 09/13] shmem: Fail shmem_getpage_gfp() on poisoned pages Kirill A. Shutemov
2021-04-16 15:41 ` [RFCv2 10/13] mm: Keep page reference for hwpoison entries Kirill A. Shutemov
2021-04-16 15:41 ` [RFCv2 11/13] mm: Replace hwpoison entry with present PTE if page got unpoisoned Kirill A. Shutemov
2021-04-16 15:41 ` [RFCv2 12/13] KVM: passdown struct kvm to hva_to_pfn_slow() Kirill A. Shutemov
2021-04-16 15:41 ` [RFCv2 13/13] KVM: unmap guest memory using poisoned pages Kirill A. Shutemov
2021-04-16 17:30   ` Sean Christopherson
2021-04-19 11:32     ` Xiaoyao Li
2021-04-19 14:26     ` Kirill A. Shutemov [this message]
2021-04-19 16:01       ` Sean Christopherson
2021-04-19 16:40         ` Kirill A. Shutemov
2021-04-19 18:09           ` Sean Christopherson
2021-04-19 18:12             ` David Hildenbrand
2021-04-19 18:53             ` Kirill A. Shutemov
2021-04-19 20:09               ` Sean Christopherson
2021-04-19 22:57                 ` Kirill A. Shutemov
2021-04-20 17:13                   ` Sean Christopherson
2021-05-21 12:31                     ` Kirill A. Shutemov
2021-05-26 19:46                       ` Sean Christopherson
2021-05-31 20:07                         ` Kirill A. Shutemov
2021-06-02 17:51                           ` Sean Christopherson
2021-06-02 23:33                             ` Kirill A. Shutemov
2021-06-03 19:46                               ` Sean Christopherson
2021-06-04 14:29                                 ` Kirill A. Shutemov
2021-06-04 17:16                       ` Andy Lutomirski
2021-06-04 17:54                         ` Kirill A. Shutemov
2021-04-16 16:46 ` [RFCv2 00/13] TDX and guest memory unmapping Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210419142602.khjbzktk5tk5l6lk@box.shutemov.name \
    --to=kirill@shutemov.name \
    --cc=andi.kleen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=erdemaktas@google.com \
    --cc=isaku.yamahata@intel.com \
    --cc=jmattson@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pgonda@google.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=rientjes@google.com \
    --cc=seanjc@google.com \
    --cc=srutherford@google.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git