kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org, Chao Peng <chao.p.peng@linux.intel.com>,
	Fuad Tabba <tabba@google.com>,
	Yu Zhang <yu.c.zhang@linux.intel.com>,
	Vishal Annapurve <vannapurve@google.com>,
	Ackerley Tng <ackerleytng@google.com>,
	Maciej Szmigiero <mail@maciej.szmigiero.name>,
	Vlastimil Babka <vbabka@suse.cz>,
	Quentin Perret <qperret@google.com>,
	Michael Roth <michael.roth@amd.com>, Wang <wei.w.wang@intel.com>,
	Liam Merwick <liam.merwick@oracle.com>,
	Isaku Yamahata <isaku.yamahata@gmail.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>, Jorg Rodel <jroedel@suse.de>,
	Vitaly Kuznetsov <vkuznets@redhat.com>
Subject: Re: [RFC PATCH v11 00/29]  KVM: guest_memfd() and per-page attributes
Date: Fri, 25 Aug 2023 10:47:12 -0700	[thread overview]
Message-ID: <ZOjpIL0SFH+E3Dj4@google.com> (raw)
In-Reply-To: <20230718234512.1690985-1-seanjc@google.com>

On Tue, Jul 18, 2023, Sean Christopherson wrote:
> This is the next iteration of implementing fd-based (instead of vma-based)
> memory for KVM guests.  If you want the full background of why we are doing
> this, please go read the v10 cover letter[1].
> 
> The biggest change from v10 is to implement the backing storage in KVM
> itself, and expose it via a KVM ioctl() instead of a "generic" sycall.
> See link[2] for details on why we pivoted to a KVM-specific approach.
> 
> Key word is "biggest".  Relative to v10, there are many big changes.
> Highlights below (I can't remember everything that got changed at
> this point).
> 
> Tagged RFC as there are a lot of empty changelogs, and a lot of missing
> documentation.  And ideally, we'll have even more tests before merging.
> There are also several gaps/opens (to be discussed in tomorrow's PUCK).
> 
> v11:
>  - Test private<=>shared conversions *without* doing fallocate()
>  - PUNCH_HOLE all memory between iterations of the conversion test so that
>    KVM doesn't retain pages in the guest_memfd
>  - Rename hugepage control to be a very generic ALLOW_HUGEPAGE, instead of
>    giving it a THP or PMD specific name.
>  - Fold in fixes from a lot of people (thank you!)
>  - Zap SPTEs *before* updating attributes to ensure no weirdness, e.g. if
>    KVM handles a page fault and looks at inconsistent attributes
>  - Refactor MMU interaction with attributes updates to reuse much of KVM's
>    framework for mmu_notifiers.
> 
> [1] https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
> [2] https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com

Trimmed the Cc substantially to discuss what needs to be done (within our control)
to have a chance of landing this "soon".

We've chipped away at the todo list a bit, but there are still several non-trivial
things that need to get addressed before we can merge guest_memfd().  If we move
*really* fast, e.g. get everything address in less than 3 weeks, we have an outside
chance at hitting 6.7.  But honestly, I think it 6.7 is extremely unlikely.

For 6.8, we'll be in good shape if we can get a non-RFC posted in the next ~6
weeks, i.e. by end of September, though obviously the sooner the better.  If we slip
much beyond that, 6.8 is going to be tough due to people disappearing for year-end
stuff and holidays, i.e. we won't have enough time to address feedback _and_ get a
another round of reviews.

Speaking purely from a personal perspective, I really, really want to hit 6.8 so
that this doesn't drag into 2024.

Loosely ordered by size and urgency (bigger, more urgent stuff first).  Please
holler if I've missed something (taking notes is my Achilles heel).


Filemap vs. xarray
------------------
This is the main item that needs attention.  I don't want to merge guest_memfd()
without doing this comparison, as not using filemap means we don't need AS_UNMOVABLE.
Arguably we could merge a filemap implementation without AS_UNMOVABLE and just eat
the suboptimal behavior, but not waiting a little while longer to do everything we
can to get this right the first time seems ridiculous after we've been working on
this for literally years.

Paolo was going to work on an axarray implementation, but AFAIK he hasn't done
anything yet.  We (Google) don't have anyone available to work on an xarray
implementation for several weeks (at best), so if anyone has the bandwidth and
desire to take stab at an xarray implementation, please speak up.


kvm_gmem_error_page()
---------------------
As pointed out by Vishal[*], guest_memfd()'s error/poison handling is garbage.
KVM needs to unmap, check for poison, and probably also restrict the allowed
mapping size if a partial page is poisoned.

This item also needs actually testing, e.g. via error injection.  Writing a
proper selftest may not be feasible, but at a bare minimum, someone needs to
manually verify an error on a guest_memfd() can get routed all the way into the
guest, e.g. as an #MC on x86.

This needs an owner.  I'm guessing 2-3 weeks?  Though I tend to be overly
optimistic when sizing these things...

[*] https://lore.kernel.org/all/CAGtprH9a2jX-hdww9GPuMrO9noNeXkoqE8oejtVn2vD0AZa3zA@mail.gmail.com


Documentation
-------------
Obviously a must have.  AFAIK, no one is "officially" signed up to work on this.
I honestly haven't looked at the document in recent versions, so I have no idea
how much effort is required.


Fully anonymous inode vs. proper filesystem
-------------------------------------------
This is another one that needs to get sorted out before merging, but it should
be a much smaller task (a day or two).  I will get to this in a few weeks unless
someone beats me to the punch.


KVM_CAP_GUEST_MEMFD
-------------------
New ioctl() needs a new cap.  Trivial, just capturing here so I don't forget.


Changelogs
----------
This one is on me, though I will probably procrastinate until all the other todo
items are close to being finished.


Tests
-----
I would really like to have a test that verifies KVM actually installs hugepages,
but I'm ok merging without such a test, mainly because I suspect it will be
annoyingly difficult to end up with a test that isn't flaky.

Beyond that, and the aforementioned memory poisoining, IMO, we have enough test
coverage.  I am always open to more tests, but I don't think adding more coverage
is a must have for merging.


.release_folio and .invalidate_folio versus .evict_inode
--------------------------------------------------------
I think we're good on this one?  IIRC, without a need to "clean" physical memory
(SNP and TDX), we don't need to do anything special.

Mike or Ackerley, am I forgetting anything?


NUMA
----
I am completely comfortable doing nothing as part of the initial merge.  We have
line of sight to supporting NUMA policies in the form of fbind(), and I would be
quite surprised if we get much pushback on implementing fbind().


RSS stats
---------
My preference is to not do anything in the initial implementation, and defer any
changes until later.  IMO, while odd, not capturing guest_memfd() in RSS is
acceptable as there are no virtual mappings to account.  I completely agree that
we would ideally surface the memory usage to userspace in some way, but I don't
think it's so critical that it needs to happen as part of the initial merge.


Intrahost migration support
---------------------------
Ackerley's RFC[*] is enough for me to have confidence that we can support
intrahost migration without having to rework the ABI.  Please holler if you
disagree.

[*] https://lkml.kernel.org/r/cover.1691446946.git.ackerleytng%40google.com


  parent reply	other threads:[~2023-08-25 17:48 UTC|newest]

Thread overview: 140+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-18 23:44 [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes Sean Christopherson
2023-07-18 23:44 ` [RFC PATCH v11 01/29] KVM: Wrap kvm_gfn_range.pte in a per-action union Sean Christopherson
2023-07-19 13:39   ` Jarkko Sakkinen
2023-07-19 15:39     ` Sean Christopherson
2023-07-19 16:55   ` Paolo Bonzini
2023-07-26 20:22     ` Sean Christopherson
2023-07-21  6:26   ` Yan Zhao
2023-07-21 10:45     ` Xu Yilun
2023-07-25 18:05       ` Sean Christopherson
2023-07-18 23:44 ` [RFC PATCH v11 02/29] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges Sean Christopherson
2023-07-19 17:12   ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 03/29] KVM: Use gfn instead of hva for mmu_notifier_retry Sean Christopherson
2023-07-19 17:12   ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 04/29] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER Sean Christopherson
2023-07-19 17:34   ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 05/29] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER Sean Christopherson
2023-07-19  7:31   ` Yuan Yao
2023-07-19 14:15     ` Sean Christopherson
2023-07-20  1:15       ` Yuan Yao
2023-07-18 23:44 ` [RFC PATCH v11 06/29] KVM: Introduce KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
2023-07-21  9:03   ` Paolo Bonzini
2023-07-28  9:25   ` Quentin Perret
2023-07-29  0:03     ` Sean Christopherson
2023-07-31  9:30       ` Quentin Perret
2023-07-31 15:58       ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 07/29] KVM: Add KVM_EXIT_MEMORY_FAULT exit Sean Christopherson
2023-07-19  7:54   ` Yuan Yao
2023-07-19 14:16     ` Sean Christopherson
2023-07-18 23:44 ` [RFC PATCH v11 08/29] KVM: Introduce per-page memory attributes Sean Christopherson
2023-07-20  8:09   ` Yuan Yao
2023-07-20 19:02     ` Isaku Yamahata
2023-07-20 20:20       ` Sean Christopherson
2023-07-21 10:57   ` Paolo Bonzini
2023-07-21 15:56   ` Xiaoyao Li
2023-07-24  4:43   ` Xu Yilun
2023-07-26 15:59     ` Sean Christopherson
2023-07-27  3:24       ` Xu Yilun
2023-08-02 20:31   ` Isaku Yamahata
2023-08-14  0:44   ` Binbin Wu
2023-08-14 21:54     ` Sean Christopherson
2023-07-18 23:44 ` [RFC PATCH v11 09/29] KVM: x86: Disallow hugepages when memory attributes are mixed Sean Christopherson
2023-07-21 11:59   ` Paolo Bonzini
2023-07-21 17:41     ` Sean Christopherson
2023-07-18 23:44 ` [RFC PATCH v11 10/29] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable Sean Christopherson
2023-07-25 10:24   ` Kirill A . Shutemov
2023-07-25 12:51     ` Matthew Wilcox
2023-07-26 11:36       ` Kirill A . Shutemov
2023-07-28 16:02       ` Vlastimil Babka
2023-07-28 16:13         ` Paolo Bonzini
2023-09-01  8:23       ` Vlastimil Babka
2023-07-18 23:44 ` [RFC PATCH v11 11/29] security: Export security_inode_init_security_anon() for use by KVM Sean Christopherson
2023-07-19  2:14   ` Paul Moore
2023-07-31 10:46   ` Vlastimil Babka
2023-07-18 23:44 ` [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
2023-07-19 17:21   ` Vishal Annapurve
2023-07-19 17:47     ` Sean Christopherson
2023-07-20 14:45   ` Xiaoyao Li
2023-07-20 15:14     ` Sean Christopherson
2023-07-20 21:28   ` Isaku Yamahata
2023-07-21  6:13   ` Yuan Yao
2023-07-21 22:27     ` Isaku Yamahata
2023-07-21 22:33       ` Sean Christopherson
2023-07-21 15:05   ` Xiaoyao Li
2023-07-21 15:42     ` Xiaoyao Li
2023-07-21 17:42       ` Sean Christopherson
2023-07-21 17:17   ` Paolo Bonzini
2023-07-21 17:50     ` Sean Christopherson
2023-07-25 15:09   ` Wang, Wei W
2023-07-25 16:03     ` Sean Christopherson
2023-07-26  1:51       ` Wang, Wei W
2023-07-31 16:23       ` Fuad Tabba
2023-07-26 17:18   ` Elliot Berman
2023-07-26 19:28     ` Sean Christopherson
2023-07-27 10:39   ` Fuad Tabba
2023-07-27 17:13     ` Sean Christopherson
2023-07-31 13:46       ` Fuad Tabba
2023-08-03 19:15   ` Ryan Afranji
2023-08-07 23:06   ` Ackerley Tng
2023-08-08 21:13     ` Sean Christopherson
2023-08-10 23:57       ` Vishal Annapurve
2023-08-11 17:44         ` Sean Christopherson
2023-08-15 18:43       ` Ackerley Tng
2023-08-15 20:03         ` Sean Christopherson
2023-08-21 17:30           ` Ackerley Tng
2023-08-21 19:33             ` Sean Christopherson
2023-08-28 22:56               ` Ackerley Tng
2023-08-29  2:53                 ` Elliot Berman
2023-09-14 19:12                   ` Sean Christopherson
2023-09-14 18:15                 ` Sean Christopherson
2023-09-14 23:19                   ` Ackerley Tng
2023-09-15  0:33                     ` Sean Christopherson
2023-08-30 15:12   ` Binbin Wu
2023-08-30 16:44     ` Ackerley Tng
2023-09-01  3:45       ` Binbin Wu
2023-09-01 16:46         ` Ackerley Tng
2023-07-18 23:44 ` [RFC PATCH v11 13/29] KVM: Add transparent hugepage support for dedicated guest memory Sean Christopherson
2023-07-21 15:07   ` Paolo Bonzini
2023-07-21 17:13     ` Sean Christopherson
2023-09-06 22:10       ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 14/29] KVM: x86/mmu: Handle page fault for private memory Sean Christopherson
2023-07-21 15:09   ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 15/29] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro Sean Christopherson
2023-07-21 15:07   ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 16/29] KVM: Allow arch code to track number of memslot address spaces per VM Sean Christopherson
2023-07-21 15:12   ` Paolo Bonzini
2023-07-18 23:45 ` [RFC PATCH v11 17/29] KVM: x86: Add support for "protected VMs" that can utilize private memory Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 18/29] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper Sean Christopherson
2023-07-21 15:14   ` Paolo Bonzini
2023-07-18 23:45 ` [RFC PATCH v11 19/29] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 20/29] KVM: selftests: Add support for creating private memslots Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 21/29] KVM: selftests: Add helpers to convert guest memory b/w private and shared Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 22/29] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86) Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 23/29] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 24/29] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 25/29] KVM: selftests: Add x86-only selftest for private memory conversions Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 26/29] KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 27/29] KVM: selftests: Expand set_memory_region_test to validate guest_memfd() Sean Christopherson
2023-08-07 23:17   ` Ackerley Tng
2023-07-18 23:45 ` [RFC PATCH v11 28/29] KVM: selftests: Add basic selftest for guest_memfd() Sean Christopherson
2023-08-07 23:20   ` Ackerley Tng
2023-08-18 23:03     ` Sean Christopherson
2023-08-07 23:25   ` Ackerley Tng
2023-08-18 23:01     ` Sean Christopherson
2023-08-21 19:49       ` Ackerley Tng
2023-07-18 23:45 ` [RFC PATCH v11 29/29] KVM: selftests: Test KVM exit behavior for private memory/access Sean Christopherson
2023-07-24  6:38 ` [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes Nikunj A. Dadhania
2023-07-24 17:00   ` Sean Christopherson
2023-07-26 11:20     ` Nikunj A. Dadhania
2023-07-26 14:24       ` Sean Christopherson
2023-07-27  6:42         ` Nikunj A. Dadhania
2023-08-03 11:03       ` Vlastimil Babka
2023-07-24 20:16 ` Sean Christopherson
2023-08-25 17:47 ` Sean Christopherson [this message]
2023-08-29  9:12   ` Chao Peng
2023-08-31 18:29     ` Sean Christopherson
2023-09-01  1:17       ` Chao Peng
2023-09-01  8:26         ` Vlastimil Babka
2023-09-01  9:10         ` Paolo Bonzini
2023-08-30  0:00   ` Isaku Yamahata
2023-09-09  0:16   ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZOjpIL0SFH+E3Dj4@google.com \
    --to=seanjc@google.com \
    --cc=ackerleytng@google.com \
    --cc=chao.p.peng@linux.intel.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=jroedel@suse.de \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=liam.merwick@oracle.com \
    --cc=mail@maciej.szmigiero.name \
    --cc=michael.roth@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=qperret@google.com \
    --cc=tabba@google.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=vkuznets@redhat.com \
    --cc=wei.w.wang@intel.com \
    --cc=yu.c.zhang@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).