All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ashish Kalra <ashish.kalra@amd.com>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: "Singh, Brijesh" <brijesh.singh@amd.com>,
	Steve Rutherford <srutherford@google.com>,
	Sean Christopherson <seanjc@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Joerg Roedel <joro@8bytes.org>, Borislav Petkov <bp@suse.de>,
	"Lendacky, Thomas" <Thomas.Lendacky@amd.com>,
	X86 ML <x86@kernel.org>, KVM list <kvm@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	"dovmurik@linux.vnet.ibm.com" <dovmurik@linux.vnet.ibm.com>,
	"tobin@ibm.com" <tobin@ibm.com>,
	"jejb@linux.ibm.com" <jejb@linux.ibm.com>,
	"frankeh@us.ibm.com" <frankeh@us.ibm.com>
Subject: Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
Date: Wed, 6 Jan 2021 23:05:55 +0000	[thread overview]
Message-ID: <20210106230555.GA13999@ashkalra_ubuntu_server> (raw)
In-Reply-To: <20201218195641.GL2956@work-vm>

On Fri, Dec 18, 2020 at 07:56:41PM +0000, Dr. David Alan Gilbert wrote:
> * Kalra, Ashish (Ashish.Kalra@amd.com) wrote:
> > Hello Dave,
> > 
> > On Dec 18, 2020, at 1:40 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> > * Ashish Kalra (ashish.kalra@amd.com) wrote:
> > On Fri, Dec 11, 2020 at 10:55:42PM +0000, Ashish Kalra wrote:
> > Hello All,
> > 
> > On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote:
> > 
> > On 12/7/20 9:09 PM, Steve Rutherford wrote:
> > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote:
> > On Sun, Dec 06, 2020, Paolo Bonzini wrote:
> > On 03/12/20 01:34, Sean Christopherson wrote:
> > On Tue, Dec 01, 2020, Ashish Kalra wrote:
> > From: Brijesh Singh <brijesh.singh@amd.com>
> > 
> > KVM hypercall framework relies on alternative framework to patch the
> > VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
> > apply_alternative() is called then it defaults to VMCALL. The approach
> > works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
> > will be able to decode the instruction and do the right things. But
> > when SEV is active, guest memory is encrypted with guest key and
> > hypervisor will not be able to decode the instruction bytes.
> > 
> > Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
> > will be used by the SEV guest to notify encrypted pages to the hypervisor.
> > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
> > and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
> > think there are any existing KVM hypercalls that happen before alternatives are
> > patched, i.e. it'll be a nop for sane kernel builds.
> > 
> > I'm also skeptical that a KVM specific hypercall is the right approach for the
> > encryption behavior, but I'll take that up in the patches later in the series.
> > Do you think that it's the guest that should "donate" memory for the bitmap
> > instead?
> > No.  Two things I'd like to explore:
> > 
> >  1. Making the hypercall to announce/request private vs. shared common across
> >     hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX).
> >     I'm concerned that we'll end up with multiple hypercalls that do more or
> >     less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc...  Maybe it's a
> >     pipe dream, but I'd like to at least explore options before shoving in KVM-
> >     only hypercalls.
> > 
> > 
> >  2. Tracking shared memory via a list of ranges instead of a using bitmap to
> >     track all of guest memory.  For most use cases, the vast majority of guest
> >     memory will be private, most ranges will be 2mb+, and conversions between
> >     private and shared will be uncommon events, i.e. the overhead to walk and
> >     split/merge list entries is hopefully not a big concern.  I suspect a list
> >     would consume far less memory, hopefully without impacting performance.
> > For a fancier data structure, I'd suggest an interval tree. Linux
> > already has an rbtree-based interval tree implementation, which would
> > likely work, and would probably assuage any performance concerns.
> > 
> > Something like this would not be worth doing unless most of the shared
> > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had
> > 60ish discontiguous shared regions. This is by no means a thorough
> > search, but it's suggestive. If this is typical, then the bitmap would
> > be far less efficient than most any interval-based data structure.
> > 
> > You'd have to allow userspace to upper bound the number of intervals
> > (similar to the maximum bitmap size), to prevent host OOMs due to
> > malicious guests. There's something nice about the guest donating
> > memory for this, since that would eliminate the OOM risk.
> > 
> > 
> > Tracking the list of ranges may not be bad idea, especially if we use
> > the some kind of rbtree-based data structure to update the ranges. It
> > will certainly be better than bitmap which grows based on the guest
> > memory size and as you guys see in the practice most of the pages will
> > be guest private. I am not sure if guest donating a memory will cover
> > all the cases, e.g what if we do a memory hotplug (increase the guest
> > ram from 2GB to 64GB), will donated memory range will be enough to store
> > the metadata.
> > 
> > .
> > 
> > With reference to internal discussions regarding the above, i am going
> > to look into specific items as listed below :
> > 
> > 1). "hypercall" related :
> > a). Explore the SEV-SNP page change request structure (included in GHCB),
> > see if there is something common there than can be re-used for SEV/SEV-ES
> > page encryption status hypercalls.
> > b). Explore if there is any common hypercall framework i can use in
> > Linux/KVM.
> > 
> > 2). related to the "backing" data structure - explore using a range-based
> > list or something like rbtree-based interval tree data structure
> > (as mentioned by Steve above) to replace the current bitmap based
> > implementation.
> > 
> > 
> > 
> > I do agree that a range-based list or an interval tree data structure is a
> > really good "logical" fit for the guest page encryption status tracking.
> > 
> > We can only keep track of the guest unencrypted shared pages in the
> > range(s) list (which will keep the data structure quite compact) and all
> > the guest private/encrypted memory does not really need any tracking in
> > the list, anything not in the list will be encrypted/private.
> > 
> > Also looking at a more "practical" use case, here is the current log of
> > page encryption status hypercalls when booting a linux guest :
> > 
> > ...
> > 
> > <snip>
> > 
> > [   56.146336] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = 1
> > [   56.146351] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = 0
> > [   56.147261] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = 0
> > [   56.147271] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = 0
> > ....
> > 
> > [   56.180730] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = 0
> > [   56.180741] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = 0
> > [   56.180768] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = 1
> > [   56.180782] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = 1
> > 
> > ....
> > [   56.197110] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = 0
> > [   56.197120] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = 0
> > [   56.197136] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = 1
> > [   56.197148] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = 1
> > ....
> > 
> > [   56.222679] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = 0
> > [   56.222691] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = 0
> > [   56.222707] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = 1
> > [   56.222720] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = 1
> > ....
> > 
> > [   56.313747] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = 0
> > [   56.313771] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = 0
> > [   56.313789] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = 1
> > [   56.313803] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = 1
> > ....
> > [   56.459276] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 0
> > [   56.459428] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 1
> > [   56.460037] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 1
> > [   56.460216] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 0
> > [   56.460299] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 0
> > [   56.460448] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 1
> > ....
> > 
> > As can be observed here, all guest MMIO ranges are initially setup as
> > shared, and those are all contigious guest page ranges.
> > 
> > After that the encryption status hypercalls are invoked when DMA gets
> > triggered during disk i/o while booting the guest ... here again the
> > guest page ranges are contigious, though mostly single page is touched
> > and a lot of page re-use is observed.
> > 
> > So a range-based list/structure will be a "good" fit for such usage
> > scenarios.
> > 
> > It seems surprisingly common to flick the same pages back and forth between
> > encrypted and clear for quite a while;  why is this?
> > 
> > 
> > dma_alloc_coherent()'s will allocate pages and then call
> > set_decrypted() on them and then at dma_free_coherent(), set_encrypted()
> > is called on the pages to be freed. So these observations in the logs
> > where a lot of single 4K pages are seeing C-bit transitions and
> > corresponding hypercalls are the ones associated with
> > dma_alloc_coherent().
> 
> It makes me wonder if it might be worth teaching it to hold onto those
> DMA pages somewhere until it needs them for something else and avoid the
> extra hypercalls; just something to think about.
> 
> Dave

Following up on this discussion and looking at the hypercall logs and DMA usage scenarios on the SEV, I have the following additional observations and comments :

It is mostly the Guest MMIO regions setup as un-encrypted by uefi/edk2 initially, which will be the "static" nodes in the backing data structure for page encryption status. 
These will be like 15-20 nodes/entries.

Drivers doing DMA allocations using GFP_ATOMIC will be fetching DMA buffers from the pre-allocated unencrypted atomic pool, hence it will be a "static" node added at kernel startup.

As we see with the logs, almost all runtime C-bit transitions and corresponding hypercalls will be from DMA I/O and dma_alloc_coherent/dma_free_coherent calls, these will be 
using 4K/single pages and mostly fragmented ranges, so if we use a "rbtree" based interval tree then there will be a lot of tree insertions and deletions 
(dma_alloc_coherent followed with a dma_free_coherent), so this will lead to a lot of expensive tree rotations and re-balancing, compared to much less complex 
and faster linked list node insertions and deletions (if we use a list based structure to represent these interval ranges).

Also as the static nodes in the structure will be quite limited (all the above DMA I/O added ranges will simply be inserted and removed), so a linked list lookup 
won't be too expensive compared to a tree lookup. In other words, this be a fixed size list.

Looking at the above, I am now more inclined to use a list based structure to represent the page encryption status.

Looking fwd. to any comments/feedback/thoughts on the above.

Thanks,
Ashish


  reply	other threads:[~2021-01-06 23:07 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-01  0:45 [PATCH v2 0/9] Add AMD SEV page encryption bitmap support Ashish Kalra
2020-12-01  0:45 ` [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3 Ashish Kalra
2020-12-03  0:34   ` Sean Christopherson
2020-12-04 17:16     ` Brijesh Singh
2020-12-06 10:26     ` Paolo Bonzini
2020-12-07 20:41       ` Sean Christopherson
2020-12-08  3:09         ` Steve Rutherford
2020-12-08  4:16           ` Kalra, Ashish
2020-12-08 16:29           ` Brijesh Singh
2020-12-11 22:55             ` Ashish Kalra
2020-12-12  4:56               ` Ashish Kalra
2020-12-18 19:39                 ` Dr. David Alan Gilbert
     [not found]                   ` <E79E09A2-F314-4B59-B7AE-07B1D422DF2B@amd.com>
2020-12-18 19:56                     ` Dr. David Alan Gilbert
2021-01-06 23:05                       ` Ashish Kalra [this message]
2021-01-07  1:01                         ` Steve Rutherford
2021-01-07  1:34                           ` Ashish Kalra
2021-01-07  8:05                             ` Ashish Kalra
2021-01-08  0:47                               ` Ashish Kalra
2021-01-08  0:55                                 ` Steve Rutherford
2021-01-07 17:07                           ` Ashish Kalra
2021-01-07 17:26                             ` Sean Christopherson
2021-01-07 18:41                               ` Ashish Kalra
2021-01-07 19:22                                 ` Sean Christopherson
2021-01-08  0:54                                   ` Steve Rutherford
2021-01-08 16:56                                     ` Sean Christopherson
2020-12-01  0:46 ` [PATCH v2 2/9] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall Ashish Kalra
2020-12-02 16:54   ` Dr. David Alan Gilbert
2020-12-02 21:22     ` Ashish Kalra
2020-12-06 10:25       ` Paolo Bonzini
2020-12-01  0:47 ` [PATCH v2 3/9] KVM: x86: Introduce KVM_GET_PAGE_ENC_BITMAP ioctl Ashish Kalra
2020-12-06 11:02   ` Dov Murik
2020-12-07 22:00     ` Ashish Kalra
2020-12-01  0:47 ` [PATCH v2 4/9] mm: x86: Invoke hypercall when page encryption status is changed Ashish Kalra
2020-12-01  0:47 ` [PATCH v2 5/9] KVM: x86: Introduce KVM_SET_PAGE_ENC_BITMAP ioctl Ashish Kalra
2020-12-01  0:47 ` [PATCH v2 6/9] KVM: SVM: Add support for static allocation of unified Page Encryption Bitmap Ashish Kalra
2020-12-01  0:48 ` [PATCH v2 7/9] KVM: x86: Mark _bss_decrypted section variables as decrypted in page encryption bitmap Ashish Kalra
2020-12-01  0:48 ` [PATCH v2 8/9] KVM: x86: Add kexec support for SEV " Ashish Kalra
2020-12-01  0:48 ` [PATCH v2 9/9] KVM: SVM: Bypass DBG_DECRYPT API calls for unecrypted guest memory Ashish Kalra
2020-12-08  5:18 [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3 Kalra, Ashish

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210106230555.GA13999@ashkalra_ubuntu_server \
    --to=ashish.kalra@amd.com \
    --cc=Thomas.Lendacky@amd.com \
    --cc=bp@suse.de \
    --cc=brijesh.singh@amd.com \
    --cc=dgilbert@redhat.com \
    --cc=dovmurik@linux.vnet.ibm.com \
    --cc=frankeh@us.ibm.com \
    --cc=hpa@zytor.com \
    --cc=jejb@linux.ibm.com \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    --cc=srutherford@google.com \
    --cc=tglx@linutronix.de \
    --cc=tobin@ibm.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.