linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private memory
       [not found] <20220118132121.31388-1-chao.p.peng@linux.intel.com>
@ 2022-02-08 18:33 ` Mike Rapoport
  2022-02-17 13:47   ` Chao Peng
       [not found] ` <20220118132121.31388-2-chao.p.peng@linux.intel.com>
  1 sibling, 1 reply; 10+ messages in thread
From: Mike Rapoport @ 2022-02-08 18:33 UTC (permalink / raw)
  To: Chao Peng
  Cc: linux-api, kvm, linux-kernel, linux-mm, linux-fsdevel,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david

(addded linux-api)

On Tue, Jan 18, 2022 at 09:21:09PM +0800, Chao Peng wrote:
> This is the v4 of this series which try to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
> 
>   fea31d169094 KVM: x86/pmu: Fix available_event_types check for
>                REF_CPU_CYCLES event
> 
> Introduction
> ------------
> In general this patch series introduce fd-based memslot which provides
> guest memory through memory file descriptor fd[offset,size] instead of
> hva/size. The fd can be created from a supported memory filesystem
> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> and the the memory backing store exchange callbacks when such memslot
> gets created. At runtime KVM will call into callbacks provided by the
> backing store to get the pfn with the fd+offset. Memory backing store
> will also call into KVM callbacks when userspace fallocate/punch hole
> on the fd to notify KVM to map/unmap secondary MMU page tables.
> 
> Comparing to existing hva-based memslot, this new type of memslot allows
> guest memory unmapped from host userspace like QEMU and even the kernel
> itself, therefore reduce attack surface and prevent bugs.
> 
> Based on this fd-based memslot, we can build guest private memory that
> is going to be used in confidential computing environments such as Intel
> TDX and AMD SEV. When supported, the memory backing store can provide
> more enforcement on the fd and KVM can use a single memslot to hold both
> the private and shared part of the guest memory. 
> 
> mm extension
> ---------------------
> Introduces new F_SEAL_INACCESSIBLE for shmem and new MFD_INACCESSIBLE
> flag for memfd_create(), the file created with these flags cannot read(),
> write() or mmap() etc via normal MMU operations. The file content can
> only be used with the newly introduced memfile_notifier extension.

It would be great to see man page draft for new ABI flags
 
> The memfile_notifier extension provides two sets of callbacks for KVM to
> interact with the memory backing store:
>   - memfile_notifier_ops: callbacks for memory backing store to notify
>     KVM when memory gets allocated/invalidated.
>   - memfile_pfn_ops: callbacks for KVM to call into memory backing store
>     to request memory pages for guest private memory.
> 
> memslot extension
> -----------------
> Add the private fd and the fd offset to existing 'shared' memslot so that
> both private/shared guest memory can live in one single memslot. A page in
> the memslot is either private or shared. A page is private only when it's
> already allocated in the backing store fd, all the other cases it's treated
> as shared, this includes those already mapped as shared as well as those
> having not been mapped. This means the memory backing store is the place
> which tells the truth of which page is private.
> 
> Private memory map/unmap and conversion
> ---------------------------------------
> Userspace's map/unmap operations are done by fallocate() ioctl on the
> backing store fd.
>   - map: default fallocate() with mode=0.
>   - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
> The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap
> secondary MMU page tables.
> 
> Test
> ----
> To test the new functionalities of this patch TDX patchset is needed.
> Since TDX patchset has not been merged so I did two kinds of test:
> 
> -  Regresion test on kvm/queue (this patch)
>    Most new code are not covered. I only tested building and booting.
> 
> -  New Funational test on latest TDX code
>    The patch is rebased to latest TDX code and tested the new
>    funcationalities.
> 
> For TDX test please see below repos:
> Linux: https://github.com/chao-p/linux/tree/privmem-v4.3
> QEMU: https://github.com/chao-p/qemu/tree/privmem-v4
> 
> And an example QEMU command line:
> -object tdx-guest,id=tdx \
> -object memory-backend-memfd-private,id=ram1,size=2G \
> -machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1
> 
> Changelog
> ----------
> v4:
>   - Decoupled the callbacks between KVM/mm from memfd and use new
>     name 'memfile_notifier'.
>   - Supported register multiple memslots to the same backing store.
>   - Added per-memslot pfn_ops instead of per-system.
>   - Reworked the invalidation part.
>   - Improved new KVM uAPIs (private memslot extension and memory
>     error) per Sean's suggestions.
>   - Addressed many other minor fixes for comments from v3.
> v3:
>   - Added locking protection when calling
>     invalidate_page_range/fallocate callbacks.
>   - Changed memslot structure to keep use useraddr for shared memory.
>   - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
>   - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
>   - Commit message improvement.
>   - Many small fixes for comments from the last version.
> 
> Links of previous discussions
> -----------------------------
> [1] Original design proposal:
> https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/
> [2] Updated proposal and RFC patch v1:
> https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@linux.intel.com/
> [3] Patch v3: https://lkml.org/lkml/2021/12/23/283
> 
> Chao Peng (11):
>   mm/memfd: Introduce MFD_INACCESSIBLE flag
>   mm: Introduce memfile_notifier
>   mm/shmem: Support memfile_notifier
>   KVM: Extend the memslot to support fd-based private memory
>   KVM: Use kvm_userspace_memory_region_ext
>   KVM: Add KVM_EXIT_MEMORY_ERROR exit
>   KVM: Use memfile_pfn_ops to obtain pfn for private pages
>   KVM: Handle page fault for private memory
>   KVM: Register private memslot to memory backing store
>   KVM: Zap existing KVM mappings when pages changed in the private fd
>   KVM: Expose KVM_MEM_PRIVATE
> 
> Kirill A. Shutemov (1):
>   mm/shmem: Introduce F_SEAL_INACCESSIBLE
> 
>  arch/x86/kvm/Kconfig             |   1 +
>  arch/x86/kvm/mmu/mmu.c           |  73 +++++++++++-
>  arch/x86/kvm/mmu/paging_tmpl.h   |  11 +-
>  arch/x86/kvm/x86.c               |  12 +-
>  include/linux/kvm_host.h         |  49 +++++++-
>  include/linux/memfile_notifier.h |  53 +++++++++
>  include/linux/shmem_fs.h         |   4 +
>  include/uapi/linux/fcntl.h       |   1 +
>  include/uapi/linux/kvm.h         |  17 +++
>  include/uapi/linux/memfd.h       |   1 +
>  mm/Kconfig                       |   4 +
>  mm/Makefile                      |   1 +
>  mm/memfd.c                       |  20 +++-
>  mm/memfile_notifier.c            |  99 ++++++++++++++++
>  mm/shmem.c                       | 121 +++++++++++++++++++-
>  virt/kvm/kvm_main.c              | 188 +++++++++++++++++++++++++++----
>  16 files changed, 614 insertions(+), 41 deletions(-)
>  create mode 100644 include/linux/memfile_notifier.h
>  create mode 100644 mm/memfile_notifier.c
> 
> -- 
> 2.17.1
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE
       [not found] ` <20220118132121.31388-2-chao.p.peng@linux.intel.com>
@ 2022-02-11 23:33   ` Andy Lutomirski
  2022-02-17 13:06     ` Chao Peng
  0 siblings, 1 reply; 10+ messages in thread
From: Andy Lutomirski @ 2022-02-11 23:33 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	qemu-devel, Linux API
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Kirill A . Shutemov, jun.nakajima,
	dave.hansen, ak, david

On 1/18/22 05:21, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
> the file is inaccessible from userspace through ordinary MMU access
> (e.g., read/write/mmap). However, the file content can be accessed
> via a different mechanism (e.g. KVM MMU) indirectly.
> 
> It provides semantics required for KVM guest private memory support
> that a file descriptor with this seal set is going to be used as the
> source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV but may not be accessible from host userspace.
> 
> At this time only shmem implements this seal.
> 

I don't dislike this *that* much, but I do dislike this. 
F_SEAL_INACCESSIBLE essentially transmutes a memfd into a different type 
of object.  While this can apparently be done successfully and without 
races (as in this code), it's at least awkward.  I think that either 
creating a special inaccessible memfd should be a single operation that 
create the correct type of object or there should be a clear 
justification for why it's a two-step process.

(Imagine if the way to create an eventfd would be to call 
timerfd_create() and then do a special fcntl to turn it into an eventfd 
but only if it's not currently armed.  This would be weird.)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE
  2022-02-11 23:33   ` [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE Andy Lutomirski
@ 2022-02-17 13:06     ` Chao Peng
  2022-02-17 19:09       ` Andy Lutomirski
  0 siblings, 1 reply; 10+ messages in thread
From: Chao Peng @ 2022-02-17 13:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel,
	Linux API, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Kirill A . Shutemov, jun.nakajima,
	dave.hansen, ak, david

On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:
> On 1/18/22 05:21, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
> > the file is inaccessible from userspace through ordinary MMU access
> > (e.g., read/write/mmap). However, the file content can be accessed
> > via a different mechanism (e.g. KVM MMU) indirectly.
> > 
> > It provides semantics required for KVM guest private memory support
> > that a file descriptor with this seal set is going to be used as the
> > source of guest memory in confidential computing environments such
> > as Intel TDX/AMD SEV but may not be accessible from host userspace.
> > 
> > At this time only shmem implements this seal.
> > 
> 
> I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE
> essentially transmutes a memfd into a different type of object.  While this
> can apparently be done successfully and without races (as in this code),
> it's at least awkward.  I think that either creating a special inaccessible
> memfd should be a single operation that create the correct type of object or
> there should be a clear justification for why it's a two-step process.

Now one justification maybe from Stever's comment to patch-00: for ARM
usage it can be used with creating a normal memfd, (partially)populate
it with initial guest memory content (e.g. firmware), and then
F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in
KVM (definitely the current code needs to be changed to support that).

Thanks,
Chao
> 
> (Imagine if the way to create an eventfd would be to call timerfd_create()
> and then do a special fcntl to turn it into an eventfd but only if it's not
> currently armed.  This would be weird.)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-02-08 18:33 ` [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private memory Mike Rapoport
@ 2022-02-17 13:47   ` Chao Peng
  0 siblings, 0 replies; 10+ messages in thread
From: Chao Peng @ 2022-02-17 13:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-api, kvm, linux-kernel, linux-mm, linux-fsdevel,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david

On Tue, Feb 08, 2022 at 08:33:18PM +0200, Mike Rapoport wrote:
> (addded linux-api)
> 
> On Tue, Jan 18, 2022 at 09:21:09PM +0800, Chao Peng wrote:
> > This is the v4 of this series which try to implement the fd-based KVM
> > guest private memory. The patches are based on latest kvm/queue branch
> > commit:
> > 
> >   fea31d169094 KVM: x86/pmu: Fix available_event_types check for
> >                REF_CPU_CYCLES event
> > 
> > Introduction
> > ------------
> > In general this patch series introduce fd-based memslot which provides
> > guest memory through memory file descriptor fd[offset,size] instead of
> > hva/size. The fd can be created from a supported memory filesystem
> > like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> > and the the memory backing store exchange callbacks when such memslot
> > gets created. At runtime KVM will call into callbacks provided by the
> > backing store to get the pfn with the fd+offset. Memory backing store
> > will also call into KVM callbacks when userspace fallocate/punch hole
> > on the fd to notify KVM to map/unmap secondary MMU page tables.
> > 
> > Comparing to existing hva-based memslot, this new type of memslot allows
> > guest memory unmapped from host userspace like QEMU and even the kernel
> > itself, therefore reduce attack surface and prevent bugs.
> > 
> > Based on this fd-based memslot, we can build guest private memory that
> > is going to be used in confidential computing environments such as Intel
> > TDX and AMD SEV. When supported, the memory backing store can provide
> > more enforcement on the fd and KVM can use a single memslot to hold both
> > the private and shared part of the guest memory. 
> > 
> > mm extension
> > ---------------------
> > Introduces new F_SEAL_INACCESSIBLE for shmem and new MFD_INACCESSIBLE
> > flag for memfd_create(), the file created with these flags cannot read(),
> > write() or mmap() etc via normal MMU operations. The file content can
> > only be used with the newly introduced memfile_notifier extension.
> 
> It would be great to see man page draft for new ABI flags

Yes I can provide the man page.

Thanks,
Chao

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE
  2022-02-17 13:06     ` Chao Peng
@ 2022-02-17 19:09       ` Andy Lutomirski
  2022-02-23 11:49         ` Chao Peng
  0 siblings, 1 reply; 10+ messages in thread
From: Andy Lutomirski @ 2022-02-17 19:09 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm list, Linux Kernel Mailing List, linux-mm, linux-fsdevel,
	qemu-devel, Linux API, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Yu Zhang,
	Kirill A. Shutemov, Nakajima, Jun, Dave Hansen, Andi Kleen,
	David Hildenbrand

On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote:
> On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:
>> On 1/18/22 05:21, Chao Peng wrote:
>> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> > 
>> > Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
>> > the file is inaccessible from userspace through ordinary MMU access
>> > (e.g., read/write/mmap). However, the file content can be accessed
>> > via a different mechanism (e.g. KVM MMU) indirectly.
>> > 
>> > It provides semantics required for KVM guest private memory support
>> > that a file descriptor with this seal set is going to be used as the
>> > source of guest memory in confidential computing environments such
>> > as Intel TDX/AMD SEV but may not be accessible from host userspace.
>> > 
>> > At this time only shmem implements this seal.
>> > 
>> 
>> I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE
>> essentially transmutes a memfd into a different type of object.  While this
>> can apparently be done successfully and without races (as in this code),
>> it's at least awkward.  I think that either creating a special inaccessible
>> memfd should be a single operation that create the correct type of object or
>> there should be a clear justification for why it's a two-step process.
>
> Now one justification maybe from Stever's comment to patch-00: for ARM
> usage it can be used with creating a normal memfd, (partially)populate
> it with initial guest memory content (e.g. firmware), and then
> F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in
> KVM (definitely the current code needs to be changed to support that).

Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right?  So this won't work.

In any case, the whole confidential VM initialization story is a bit buddy.  From the earlier emails, it sounds like ARM expects the host to fill in guest memory and measure it.  From my recollection of Intel's scheme (which may well be wrong, and I could easily be confusing it with SGX), TDX instead measures what is essentially a transcript of the series of operations that initializes the VM.  These are fundamentally not the same thing even if they accomplish the same end goal.  For TDX, we unavoidably need an operation (ioctl or similar) that initializes things according to the VM's instructions, and ARM ought to be able to use roughly the same mechanism.

Also, if we ever get fancy and teach the page allocator about memory with reduced directmap permissions, it may well be more efficient for userspace to shove data into a memfd via ioctl than it is to mmap it and write the data.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE
  2022-02-17 19:09       ` Andy Lutomirski
@ 2022-02-23 11:49         ` Chao Peng
  2022-02-23 12:05           ` Steven Price
  0 siblings, 1 reply; 10+ messages in thread
From: Chao Peng @ 2022-02-23 11:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: kvm list, Linux Kernel Mailing List, linux-mm, linux-fsdevel,
	qemu-devel, Linux API, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Yu Zhang,
	Kirill A. Shutemov, Nakajima, Jun, Dave Hansen, Andi Kleen,
	David Hildenbrand, Steven Price

On Thu, Feb 17, 2022 at 11:09:35AM -0800, Andy Lutomirski wrote:
> On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote:
> > On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:
> >> On 1/18/22 05:21, Chao Peng wrote:
> >> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >> > 
> >> > Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
> >> > the file is inaccessible from userspace through ordinary MMU access
> >> > (e.g., read/write/mmap). However, the file content can be accessed
> >> > via a different mechanism (e.g. KVM MMU) indirectly.
> >> > 
> >> > It provides semantics required for KVM guest private memory support
> >> > that a file descriptor with this seal set is going to be used as the
> >> > source of guest memory in confidential computing environments such
> >> > as Intel TDX/AMD SEV but may not be accessible from host userspace.
> >> > 
> >> > At this time only shmem implements this seal.
> >> > 
> >> 
> >> I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE
> >> essentially transmutes a memfd into a different type of object.  While this
> >> can apparently be done successfully and without races (as in this code),
> >> it's at least awkward.  I think that either creating a special inaccessible
> >> memfd should be a single operation that create the correct type of object or
> >> there should be a clear justification for why it's a two-step process.
> >
> > Now one justification maybe from Stever's comment to patch-00: for ARM
> > usage it can be used with creating a normal memfd, (partially)populate
> > it with initial guest memory content (e.g. firmware), and then
> > F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in
> > KVM (definitely the current code needs to be changed to support that).
> 
> Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right?  So this won't work.

Hmm, right, if we set F_SEAL_INACCESSIBLE on a non-empty file, we will 
need to make sure access to existing mmap-ed area should be prevented,
but that is hard.

> 
> In any case, the whole confidential VM initialization story is a bit buddy.  From the earlier emails, it sounds like ARM expects the host to fill in guest memory and measure it.  From my recollection of Intel's scheme (which may well be wrong, and I could easily be confusing it with SGX), TDX instead measures what is essentially a transcript of the series of operations that initializes the VM.  These are fundamentally not the same thing even if they accomplish the same end goal.  For TDX, we unavoidably need an operation (ioctl or similar) that initializes things according to the VM's instructions, and ARM ought to be able to use roughly the same mechanism.

Yes, TDX requires a ioctl. Steven may comment on the ARM part.

Chao
> 
> Also, if we ever get fancy and teach the page allocator about memory with reduced directmap permissions, it may well be more efficient for userspace to shove data into a memfd via ioctl than it is to mmap it and write the data.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE
  2022-02-23 11:49         ` Chao Peng
@ 2022-02-23 12:05           ` Steven Price
  2022-03-04 19:24             ` Andy Lutomirski
  0 siblings, 1 reply; 10+ messages in thread
From: Steven Price @ 2022-02-23 12:05 UTC (permalink / raw)
  To: Chao Peng, Andy Lutomirski
  Cc: kvm list, Linux Kernel Mailing List, linux-mm, linux-fsdevel,
	qemu-devel, Linux API, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Yu Zhang,
	Kirill A. Shutemov, Nakajima, Jun, Dave Hansen, Andi Kleen,
	David Hildenbrand

On 23/02/2022 11:49, Chao Peng wrote:
> On Thu, Feb 17, 2022 at 11:09:35AM -0800, Andy Lutomirski wrote:
>> On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote:
>>> On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:
>>>> On 1/18/22 05:21, Chao Peng wrote:
>>>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>>>
>>>>> Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
>>>>> the file is inaccessible from userspace through ordinary MMU access
>>>>> (e.g., read/write/mmap). However, the file content can be accessed
>>>>> via a different mechanism (e.g. KVM MMU) indirectly.
>>>>>
>>>>> It provides semantics required for KVM guest private memory support
>>>>> that a file descriptor with this seal set is going to be used as the
>>>>> source of guest memory in confidential computing environments such
>>>>> as Intel TDX/AMD SEV but may not be accessible from host userspace.
>>>>>
>>>>> At this time only shmem implements this seal.
>>>>>
>>>>
>>>> I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE
>>>> essentially transmutes a memfd into a different type of object.  While this
>>>> can apparently be done successfully and without races (as in this code),
>>>> it's at least awkward.  I think that either creating a special inaccessible
>>>> memfd should be a single operation that create the correct type of object or
>>>> there should be a clear justification for why it's a two-step process.
>>>
>>> Now one justification maybe from Stever's comment to patch-00: for ARM
>>> usage it can be used with creating a normal memfd, (partially)populate
>>> it with initial guest memory content (e.g. firmware), and then
>>> F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in
>>> KVM (definitely the current code needs to be changed to support that).
>>
>> Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right?  So this won't work.
> 
> Hmm, right, if we set F_SEAL_INACCESSIBLE on a non-empty file, we will 
> need to make sure access to existing mmap-ed area should be prevented,
> but that is hard.
> 
>>
>> In any case, the whole confidential VM initialization story is a bit buddy.  From the earlier emails, it sounds like ARM expects the host to fill in guest memory and measure it.  From my recollection of Intel's scheme (which may well be wrong, and I could easily be confusing it with SGX), TDX instead measures what is essentially a transcript of the series of operations that initializes the VM.  These are fundamentally not the same thing even if they accomplish the same end goal.  For TDX, we unavoidably need an operation (ioctl or similar) that initializes things according to the VM's instructions, and ARM ought to be able to use roughly the same mechanism.
> 
> Yes, TDX requires a ioctl. Steven may comment on the ARM part.

The Arm story is evolving so I can't give a definite answer yet. Our
current prototyping works by creating the initial VM content in a
memslot as with a normal VM and then calling an ioctl which throws the
big switch and converts all the (populated) pages to be protected. At
this point the RMM performs a measurement of the data that the VM is
being populated with.

The above (in our prototype) suffers from all the expected problems with
a malicious VMM being able to trick the host kernel into accessing those
pages after they have been protected (causing a fault detected by the
hardware).

The ideal (from our perspective) approach would be to follow the same
flow but where the VMM populates a memfd rather than normal anonymous
pages. The memfd could then be sealed and the pages converted to
protected ones (with the RMM measuring them in the process).

The question becomes how is that memfd populated? It would be nice if
that could be done using normal operations on a memfd (i.e. using
mmap()) and therefore this code could be (relatively) portable. This
would mean that any pages mapped from the memfd would either need to
block the sealing or be revoked at the time of sealing.

The other approach is we could of course implement a special ioctl which
effectively does a memcpy into the (created empty and sealed) memfd and
does the necessary dance with the RMM to measure the contents. This
would match the "transcript of the series of operations" described above
- but seems much less ideal from the viewpoint of the VMM.

Steve

> Chao
>>
>> Also, if we ever get fancy and teach the page allocator about memory with reduced directmap permissions, it may well be more efficient for userspace to shove data into a memfd via ioctl than it is to mmap it and write the data.
> 
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE
  2022-02-23 12:05           ` Steven Price
@ 2022-03-04 19:24             ` Andy Lutomirski
  2022-03-07 13:26               ` Chao Peng
  0 siblings, 1 reply; 10+ messages in thread
From: Andy Lutomirski @ 2022-03-04 19:24 UTC (permalink / raw)
  To: Steven Price, Chao Peng
  Cc: kvm list, Linux Kernel Mailing List, linux-mm, linux-fsdevel,
	qemu-devel, Linux API, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Yu Zhang,
	Kirill A. Shutemov, Nakajima, Jun, Dave Hansen, Andi Kleen,
	David Hildenbrand

On 2/23/22 04:05, Steven Price wrote:
> On 23/02/2022 11:49, Chao Peng wrote:
>> On Thu, Feb 17, 2022 at 11:09:35AM -0800, Andy Lutomirski wrote:
>>> On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote:
>>>> On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:
>>>>> On 1/18/22 05:21, Chao Peng wrote:
>>>>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>>>>
>>>>>> Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
>>>>>> the file is inaccessible from userspace through ordinary MMU access
>>>>>> (e.g., read/write/mmap). However, the file content can be accessed
>>>>>> via a different mechanism (e.g. KVM MMU) indirectly.
>>>>>>
>>>>>> It provides semantics required for KVM guest private memory support
>>>>>> that a file descriptor with this seal set is going to be used as the
>>>>>> source of guest memory in confidential computing environments such
>>>>>> as Intel TDX/AMD SEV but may not be accessible from host userspace.
>>>>>>
>>>>>> At this time only shmem implements this seal.
>>>>>>
>>>>>
>>>>> I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE
>>>>> essentially transmutes a memfd into a different type of object.  While this
>>>>> can apparently be done successfully and without races (as in this code),
>>>>> it's at least awkward.  I think that either creating a special inaccessible
>>>>> memfd should be a single operation that create the correct type of object or
>>>>> there should be a clear justification for why it's a two-step process.
>>>>
>>>> Now one justification maybe from Stever's comment to patch-00: for ARM
>>>> usage it can be used with creating a normal memfd, (partially)populate
>>>> it with initial guest memory content (e.g. firmware), and then
>>>> F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in
>>>> KVM (definitely the current code needs to be changed to support that).
>>>
>>> Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right?  So this won't work.
>>
>> Hmm, right, if we set F_SEAL_INACCESSIBLE on a non-empty file, we will
>> need to make sure access to existing mmap-ed area should be prevented,
>> but that is hard.
>>
>>>
>>> In any case, the whole confidential VM initialization story is a bit buddy.  From the earlier emails, it sounds like ARM expects the host to fill in guest memory and measure it.  From my recollection of Intel's scheme (which may well be wrong, and I could easily be confusing it with SGX), TDX instead measures what is essentially a transcript of the series of operations that initializes the VM.  These are fundamentally not the same thing even if they accomplish the same end goal.  For TDX, we unavoidably need an operation (ioctl or similar) that initializes things according to the VM's instructions, and ARM ought to be able to use roughly the same mechanism.
>>
>> Yes, TDX requires a ioctl. Steven may comment on the ARM part.
> 
> The Arm story is evolving so I can't give a definite answer yet. Our
> current prototyping works by creating the initial VM content in a
> memslot as with a normal VM and then calling an ioctl which throws the
> big switch and converts all the (populated) pages to be protected. At
> this point the RMM performs a measurement of the data that the VM is
> being populated with.
> 
> The above (in our prototype) suffers from all the expected problems with
> a malicious VMM being able to trick the host kernel into accessing those
> pages after they have been protected (causing a fault detected by the
> hardware).
> 
> The ideal (from our perspective) approach would be to follow the same
> flow but where the VMM populates a memfd rather than normal anonymous
> pages. The memfd could then be sealed and the pages converted to
> protected ones (with the RMM measuring them in the process).
> 
> The question becomes how is that memfd populated? It would be nice if
> that could be done using normal operations on a memfd (i.e. using
> mmap()) and therefore this code could be (relatively) portable. This
> would mean that any pages mapped from the memfd would either need to
> block the sealing or be revoked at the time of sealing.
> 
> The other approach is we could of course implement a special ioctl which
> effectively does a memcpy into the (created empty and sealed) memfd and
> does the necessary dance with the RMM to measure the contents. This
> would match the "transcript of the series of operations" described above
> - but seems much less ideal from the viewpoint of the VMM.

A VMM that supports Other Vendors will need to understand this sort of 
model regardless.

I don't particularly mind the idea of having the kernel consume a normal 
memfd and spit out a new object, but I find the concept of changing the 
type of the object in place, even if it has other references, and trying 
to control all the resulting races to be somewhat alarming.

In pseudo-Rust, this is the difference between:

fn convert_to_private(in: &mut Memfd)

and

fn convert_to_private(in: Memfd) -> PrivateMemoryFd

This doesn't map particularly nicely to the kernel, though.

--Andy\

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE
  2022-03-04 19:24             ` Andy Lutomirski
@ 2022-03-07 13:26               ` Chao Peng
  2022-03-08 12:17                 ` Paolo Bonzini
  0 siblings, 1 reply; 10+ messages in thread
From: Chao Peng @ 2022-03-07 13:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Price, kvm list, Linux Kernel Mailing List, linux-mm,
	linux-fsdevel, qemu-devel, Linux API, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Kirill A. Shutemov, Nakajima, Jun,
	Dave Hansen, Andi Kleen, David Hildenbrand

On Fri, Mar 04, 2022 at 11:24:30AM -0800, Andy Lutomirski wrote:
> On 2/23/22 04:05, Steven Price wrote:
> > On 23/02/2022 11:49, Chao Peng wrote:
> > > On Thu, Feb 17, 2022 at 11:09:35AM -0800, Andy Lutomirski wrote:
> > > > On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote:
> > > > > On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:
> > > > > > On 1/18/22 05:21, Chao Peng wrote:
> > > > > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > > > > 
> > > > > > > Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
> > > > > > > the file is inaccessible from userspace through ordinary MMU access
> > > > > > > (e.g., read/write/mmap). However, the file content can be accessed
> > > > > > > via a different mechanism (e.g. KVM MMU) indirectly.
> > > > > > > 
> > > > > > > It provides semantics required for KVM guest private memory support
> > > > > > > that a file descriptor with this seal set is going to be used as the
> > > > > > > source of guest memory in confidential computing environments such
> > > > > > > as Intel TDX/AMD SEV but may not be accessible from host userspace.
> > > > > > > 
> > > > > > > At this time only shmem implements this seal.
> > > > > > > 
> > > > > > 
> > > > > > I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE
> > > > > > essentially transmutes a memfd into a different type of object.  While this
> > > > > > can apparently be done successfully and without races (as in this code),
> > > > > > it's at least awkward.  I think that either creating a special inaccessible
> > > > > > memfd should be a single operation that create the correct type of object or
> > > > > > there should be a clear justification for why it's a two-step process.
> > > > > 
> > > > > Now one justification maybe from Stever's comment to patch-00: for ARM
> > > > > usage it can be used with creating a normal memfd, (partially)populate
> > > > > it with initial guest memory content (e.g. firmware), and then
> > > > > F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in
> > > > > KVM (definitely the current code needs to be changed to support that).
> > > > 
> > > > Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right?  So this won't work.
> > > 
> > > Hmm, right, if we set F_SEAL_INACCESSIBLE on a non-empty file, we will
> > > need to make sure access to existing mmap-ed area should be prevented,
> > > but that is hard.
> > > 
> > > > 
> > > > In any case, the whole confidential VM initialization story is a bit buddy.  From the earlier emails, it sounds like ARM expects the host to fill in guest memory and measure it.  From my recollection of Intel's scheme (which may well be wrong, and I could easily be confusing it with SGX), TDX instead measures what is essentially a transcript of the series of operations that initializes the VM.  These are fundamentally not the same thing even if they accomplish the same end goal.  For TDX, we unavoidably need an operation (ioctl or similar) that initializes things according to the VM's instructions, and ARM ought to be able to use roughly the same mechanism.
> > > 
> > > Yes, TDX requires a ioctl. Steven may comment on the ARM part.
> > 
> > The Arm story is evolving so I can't give a definite answer yet. Our
> > current prototyping works by creating the initial VM content in a
> > memslot as with a normal VM and then calling an ioctl which throws the
> > big switch and converts all the (populated) pages to be protected. At
> > this point the RMM performs a measurement of the data that the VM is
> > being populated with.
> > 
> > The above (in our prototype) suffers from all the expected problems with
> > a malicious VMM being able to trick the host kernel into accessing those
> > pages after they have been protected (causing a fault detected by the
> > hardware).
> > 
> > The ideal (from our perspective) approach would be to follow the same
> > flow but where the VMM populates a memfd rather than normal anonymous
> > pages. The memfd could then be sealed and the pages converted to
> > protected ones (with the RMM measuring them in the process).
> > 
> > The question becomes how is that memfd populated? It would be nice if
> > that could be done using normal operations on a memfd (i.e. using
> > mmap()) and therefore this code could be (relatively) portable. This
> > would mean that any pages mapped from the memfd would either need to
> > block the sealing or be revoked at the time of sealing.
> > 
> > The other approach is we could of course implement a special ioctl which
> > effectively does a memcpy into the (created empty and sealed) memfd and
> > does the necessary dance with the RMM to measure the contents. This
> > would match the "transcript of the series of operations" described above
> > - but seems much less ideal from the viewpoint of the VMM.
> 
> A VMM that supports Other Vendors will need to understand this sort of model
> regardless.
> 
> I don't particularly mind the idea of having the kernel consume a normal
> memfd and spit out a new object, but I find the concept of changing the type
> of the object in place, even if it has other references, and trying to
> control all the resulting races to be somewhat alarming.
> 
> In pseudo-Rust, this is the difference between:
> 
> fn convert_to_private(in: &mut Memfd)
> 
> and
> 
> fn convert_to_private(in: Memfd) -> PrivateMemoryFd
> 
> This doesn't map particularly nicely to the kernel, though.

I understand this Rust semantics and the difficulty to handle races.
Probably we should not expose F_SEAL_INACCESSIBLE to userspace, instead
we can use a new in-kernel flag to indicate the same thing. That flag
should be set only when the memfd is created with MFD_INACCESSIBLE.

Chao
> 
> --Andy\

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE
  2022-03-07 13:26               ` Chao Peng
@ 2022-03-08 12:17                 ` Paolo Bonzini
  0 siblings, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2022-03-08 12:17 UTC (permalink / raw)
  To: Chao Peng, Andy Lutomirski
  Cc: Steven Price, kvm list, Linux Kernel Mailing List, linux-mm,
	linux-fsdevel, qemu-devel, Linux API, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Yu Zhang,
	Kirill A. Shutemov, Nakajima, Jun, Dave Hansen, Andi Kleen,
	David Hildenbrand

On 3/7/22 14:26, Chao Peng wrote:
>> In pseudo-Rust, this is the difference between:
>>
>> fn convert_to_private(in: &mut Memfd)
>>
>> and
>>
>> fn convert_to_private(in: Memfd) -> PrivateMemoryFd
>>
>> This doesn't map particularly nicely to the kernel, though.
> I understand this Rust semantics and the difficulty to handle races.
> Probably we should not expose F_SEAL_INACCESSIBLE to userspace, instead
> we can use a new in-kernel flag to indicate the same thing. That flag
> should be set only when the memfd is created with MFD_INACCESSIBLE.

Yes, I like this.

Paolo


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-03-08 12:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20220118132121.31388-1-chao.p.peng@linux.intel.com>
2022-02-08 18:33 ` [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private memory Mike Rapoport
2022-02-17 13:47   ` Chao Peng
     [not found] ` <20220118132121.31388-2-chao.p.peng@linux.intel.com>
2022-02-11 23:33   ` [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE Andy Lutomirski
2022-02-17 13:06     ` Chao Peng
2022-02-17 19:09       ` Andy Lutomirski
2022-02-23 11:49         ` Chao Peng
2022-02-23 12:05           ` Steven Price
2022-03-04 19:24             ` Andy Lutomirski
2022-03-07 13:26               ` Chao Peng
2022-03-08 12:17                 ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).