Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

From: "Nikunj A. Dadhania" <nikunj@amd.com>
To: "Gupta, Pankaj" <pankaj.gupta@amd.com>,
	Chao Peng <chao.p.peng@linux.intel.com>,
	Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>,
	Hugh Dickins <hughd@google.com>, Jeff Layton <jlayton@kernel.org>,
	"J . Bruce Fields" <bfields@fieldses.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Shuah Khan <shuah@kernel.org>, Mike Rapoport <rppt@kernel.org>,
	Steven Price <steven.price@arm.com>,
	"Maciej S . Szmigiero" <mail@maciej.szmigiero.name>,
	Vlastimil Babka <vbabka@suse.cz>,
	Vishal Annapurve <vannapurve@google.com>,
	Yu Zhang <yu.c.zhang@linux.intel.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com,
	ak@linux.intel.com, david@redhat.com, aarcange@redhat.com,
	ddutile@redhat.com, dhildenb@redhat.com,
	Quentin Perret <qperret@google.com>,
	Michael Roth <michael.roth@amd.com>,
	mhocko@suse.com, Muchun Song <songmuchun@bytedance.com>,
	bharata@amd.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-kselftest@vger.kernel.org, linux-api@vger.kernel.org,
	linux-doc@vger.kernel.org, qemu-devel@nongnu.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
Date: Thu, 11 Aug 2022 22:48:28 +0530	[thread overview]
Message-ID: <9dc91ce8-4cb6-37e6-4c25-27a72dc11dd0@amd.com> (raw)
In-Reply-To: <9e86daea-5619-a216-fe02-0562cf14c501@amd.com>

On 11/08/22 17:00, Gupta, Pankaj wrote:
> 
>>> This is the v7 of this series which tries to implement the fd-based KVM
>>> guest private memory. The patches are based on latest kvm/queue branch
>>> commit:
>>>
>>>    b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>> split_desc_cache only by default capacity
>>>
>>> Introduction
>>> ------------
>>> In general this patch series introduce fd-based memslot which provides
>>> guest memory through memory file descriptor fd[offset,size] instead of
>>> hva/size. The fd can be created from a supported memory filesystem
>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>> and the the memory backing store exchange callbacks when such memslot
>>> gets created. At runtime KVM will call into callbacks provided by the
>>> backing store to get the pfn with the fd+offset. Memory backing store
>>> will also call into KVM callbacks when userspace punch hole on the fd
>>> to notify KVM to unmap secondary MMU page table entries.
>>>
>>> Comparing to existing hva-based memslot, this new type of memslot allows
>>> guest memory unmapped from host userspace like QEMU and even the kernel
>>> itself, therefore reduce attack surface and prevent bugs.
>>>
>>> Based on this fd-based memslot, we can build guest private memory that
>>> is going to be used in confidential computing environments such as Intel
>>> TDX and AMD SEV. When supported, the memory backing store can provide
>>> more enforcement on the fd and KVM can use a single memslot to hold both
>>> the private and shared part of the guest memory.
>>>
>>> mm extension
>>> ---------------------
>>> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
>>> created with these flags cannot read(), write() or mmap() etc via normal
>>> MMU operations. The file content can only be used with the newly
>>> introduced memfile_notifier extension.
>>>
>>> The memfile_notifier extension provides two sets of callbacks for KVM to
>>> interact with the memory backing store:
>>>    - memfile_notifier_ops: callbacks for memory backing store to notify
>>>      KVM when memory gets invalidated.
>>>    - backing store callbacks: callbacks for KVM to call into memory
>>>      backing store to request memory pages for guest private memory.
>>>
>>> The memfile_notifier extension also provides APIs for memory backing
>>> store to register/unregister itself and to trigger the notifier when the
>>> bookmarked memory gets invalidated.
>>>
>>> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
>>> prevent double allocation caused by unintentional guest when we only
>>> have a single side of the shared/private memfds effective.
>>>
>>> memslot extension
>>> -----------------
>>> Add the private fd and the fd offset to existing 'shared' memslot so
>>> that both private/shared guest memory can live in one single memslot.
>>> A page in the memslot is either private or shared. Whether a guest page
>>> is private or shared is maintained through reusing existing SEV ioctls
>>> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
>>>
>>> Test
>>> ----
>>> To test the new functionalities of this patch TDX patchset is needed.
>>> Since TDX patchset has not been merged so I did two kinds of test:
>>>
>>> -  Regresion test on kvm/queue (this patchset)
>>>     Most new code are not covered. Code also in below repo:
>>>     https://github.com/chao-p/linux/tree/privmem-v7
>>>
>>> -  New Funational test on latest TDX code
>>>     The patch is rebased to latest TDX code and tested the new
>>>     funcationalities. See below repos:
>>>     Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
>>>     QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
>>
>> While debugging an issue with SEV+UPM, found that fallocate() returns
>> an error in QEMU which is not handled (EINTR). With the below handling
>> of EINTR subsequent fallocate() succeeds:
>>
>>
>> diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c
>> index af8fb0c957..e8597ed28d 100644
>> --- a/backends/hostmem-memfd-private.c
>> +++ b/backends/hostmem-memfd-private.c
>> @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>       MachineState *machine = MACHINE(qdev_get_machine());
>>       uint32_t ram_flags;
>>       char *name;
>> -    int fd, priv_fd;
>> +    int fd, priv_fd, ret;
>>         if (!backend->size) {
>>           error_setg(errp, "can't create backend with size 0");
>> @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>                                      backend->size, ram_flags, fd, 0, errp);
>>       g_free(name);
>>   -    fallocate(priv_fd, 0, 0, backend->size);
>> +again:
>> +    ret = fallocate(priv_fd, 0, 0, backend->size);
>> +    if (ret) {
>> +           perror("Fallocate failed: \n");
>> +           if (errno == EINTR)
>> +                   goto again;
>> +           else
>> +                   exit(1);
>> +    }
>>
>> However, fallocate() preallocates full guest memory before starting the guest.
>> With this behaviour guest memory is *not* demand pinned. Is there a way to
>> prevent fallocate() from reserving full guest memory?
> 
> Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.

That is correct, but the question is when does the memory allocated, as these flags are set,
memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is
allocated.

Regards
Nikunj