From: "Nikunj A. Dadhania" <nikunj@amd.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
Marc Zyngier <maz@kernel.org>,
Oliver Upton <oliver.upton@linux.dev>,
Huacai Chen <chenhuacai@kernel.org>,
Michael Ellerman <mpe@ellerman.id.au>,
Anup Patel <anup@brainfault.org>,
Paul Walmsley <paul.walmsley@sifive.com>,
Palmer Dabbelt <palmer@dabbelt.com>,
Albert Ou <aou@eecs.berkeley.edu>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
Paul Moore <paul@paul-moore.com>,
James Morris <jmorris@namei.org>,
"Serge E. Hallyn" <serge@hallyn.com>,
kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
kvmarm@lists.linux.dev, linux-mips@vger.kernel.org,
linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org,
linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, linux-security-module@vger.kernel.org,
linux-kernel@vger.kernel.org,
Chao Peng <chao.p.peng@linux.intel.com>,
Fuad Tabba <tabba@google.com>,
Jarkko Sakkinen <jarkko@kernel.org>,
Yu Zhang <yu.c.zhang@linux.intel.com>,
Vishal Annapurve <vannapurve@google.com>,
Ackerley Tng <ackerleytng@google.com>,
Maciej Szmigiero <mail@maciej.szmigiero.name>,
Vlastimil Babka <vbabka@suse.cz>,
David Hildenbrand <david@redhat.com>,
Quentin Perret <qperret@google.com>,
Michael Roth <michael.roth@amd.com>, Wang <wei.w.wang@intel.com>,
Liam Merwick <liam.merwick@oracle.com>,
Isaku Yamahata <isaku.yamahata@gmail.com>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes
Date: Wed, 26 Jul 2023 16:50:16 +0530 [thread overview]
Message-ID: <2f98a32c-bd3d-4890-b757-4d2f67a3b1a7@amd.com> (raw)
In-Reply-To: <ZL6uMk/8UeuGj8CP@google.com>
Hi Sean,
On 7/24/2023 10:30 PM, Sean Christopherson wrote:
> On Mon, Jul 24, 2023, Nikunj A. Dadhania wrote:
>> On 7/19/2023 5:14 AM, Sean Christopherson wrote:
>>> This is the next iteration of implementing fd-based (instead of vma-based)
>>> memory for KVM guests. If you want the full background of why we are doing
>>> this, please go read the v10 cover letter[1].
>>>
>>> The biggest change from v10 is to implement the backing storage in KVM
>>> itself, and expose it via a KVM ioctl() instead of a "generic" sycall.
>>> See link[2] for details on why we pivoted to a KVM-specific approach.
>>>
>>> Key word is "biggest". Relative to v10, there are many big changes.
>>> Highlights below (I can't remember everything that got changed at
>>> this point).
>>>
>>> Tagged RFC as there are a lot of empty changelogs, and a lot of missing
>>> documentation. And ideally, we'll have even more tests before merging.
>>> There are also several gaps/opens (to be discussed in tomorrow's PUCK).
>>
>> As per our discussion on the PUCK call, here are the memory/NUMA accounting
>> related observations that I had while working on SNP guest secure page migration:
>>
>> * gmem allocations are currently treated as file page allocations
>> accounted to the kernel and not to the QEMU process.
>
> We need to level set on terminology: these are all *stats*, not accounting. That
> distinction matters because we have wiggle room on stats, e.g. we can probably get
> away with just about any definition of how guest_memfd memory impacts stats, so
> long as the information that is surfaced to userspace is useful and expected.
>
> But we absolutely need to get accounting correct, specifically the allocations
> need to be correctly accounted in memcg. And unless I'm missing something,
> nothing in here shows anything related to memcg.
I tried out memcg after creating a separate cgroup for the qemu process. Guest
memory is accounted in memcg.
$ egrep -w "file|file_thp|unevictable" memory.stat
file 42978775040
file_thp 42949672960
unevictable 42953588736
NUMA allocations are coming from right nodes as set by the numactl.
$ egrep -w "file|file_thp|unevictable" memory.numa_stat
file N0=0 N1=20480 N2=21489377280 N3=21489377280
file_thp N0=0 N1=0 N2=21472739328 N3=21476933632
unevictable N0=0 N1=0 N2=21474697216 N3=21478891520
>
>> Starting an SNP guest with 40G memory with memory interleave between
>> Node2 and Node3
>>
>> $ numactl -i 2,3 ./bootg_snp.sh
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 242179 root 20 0 40.4g 99580 51676 S 78.0 0.0 0:56.58 qemu-system-x86
>>
>> -> Incorrect process resident memory and shared memory is reported
>
> I don't know that I would call these "incorrect". Shared memory definitely is
> correct, because by definition guest_memfd isn't shared. RSS is less clear cut;
> gmem memory is resident in RAM, but if we show gmem in RSS then we'll end up with
> scenarios where RSS > VIRT, which will be quite confusing for unaware users (I'm
> assuming the 40g of VIRT here comes from QEMU mapping the shared half of gmem
> memslots).
I am not sure why will RSS exceed the VIRT, it should be at max 40G (assuming all the
memory is private)
As per my experiments with a hack below. MM_FILEPAGES does get accounted to RSS/SHR in top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4339 root 20 0 40.4g 40.1g 40.1g S 76.7 16.0 0:13.83 qemu-system-x86
diff --git a/mm/memory.c b/mm/memory.c
index f456f3b5049c..5b1f48a2e714 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -166,6 +166,7 @@ void mm_trace_rss_stat(struct mm_struct *mm, int member)
{
trace_rss_stat(mm, member);
}
+EXPORT_SYMBOL(mm_trace_rss_stat);
/*
* Note: this doesn't free the actual pages themselves. That
diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
index a7e926af4255..e4f268bf9ce2 100644
--- a/virt/kvm/guest_mem.c
+++ b/virt/kvm/guest_mem.c
@@ -91,6 +91,10 @@ static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index)
clear_highpage(folio_page(folio, i));
}
+ /* Account only once for the first time */
+ if (!folio_test_dirty(folio))
+ add_mm_counter(current->mm, MM_FILEPAGES, folio_nr_pages(folio));
+
folio_mark_accessed(folio);
folio_mark_dirty(folio);
folio_mark_uptodate(folio);
We can update the rss_stat appropriately to get correct reporting in userspace.
>> Accounting of the memory happens in the host page fault handler path,
>> but for private guest pages we will never hit that.
>>
>> * NUMA allocation does use the process mempolicy for appropriate node
>> allocation (Node2 and Node3), but they again do not get attributed to
>> the QEMU process
>>
>> Every 1.0s: sudo numastat -m -p qemu-system-x86 | egrep -i "qemu|PID|Node|Filepage" gomati: Mon Jul 24 11:51:34 2023
>>
>> Per-node process memory usage (in MBs)
>> PID Node 0 Node 1 Node 2 Node 3 Total
>> 242179 (qemu-system-x86) 21.14 1.61 39.44 39.38 101.57
>>
>> Per-node system memory usage (in MBs):
>> Node 0 Node 1 Node 2 Node 3 Total
>> FilePages 2475.63 2395.83 23999.46 23373.22 52244.14
>>
>>
>> * Most of the memory accounting relies on the VMAs and as private-fd of
>> gmem doesn't have a VMA(and that was the design goal), user-space fails
>> to attribute the memory appropriately to the process.
>>
>> /proc/<qemu pid>/numa_maps
>> 7f528be00000 interleave:2-3 file=/memfd:memory-backend-memfd-shared\040(deleted) anon=1070 dirty=1070 mapped=1987 mapmax=256 active=1956 N2=582 N3=1405 kernelpagesize_kB=4
>> 7f5c90200000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted)
>> 7f5c90400000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=32 active=0 N2=32 kernelpagesize_kB=4
>> 7f5c90800000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=892 active=0 N2=512 N3=380 kernelpagesize_kB=4
>>
>> /proc/<qemu pid>/smaps
>> 7f528be00000-7f5c8be00000 rw-p 00000000 00:01 26629 /memfd:memory-backend-memfd-shared (deleted)
>> 7f5c90200000-7f5c90220000 rw-s 00000000 00:01 44033 /memfd:rom-backend-memfd-shared (deleted)
>> 7f5c90400000-7f5c90420000 rw-s 00000000 00:01 44032 /memfd:rom-backend-memfd-shared (deleted)
>> 7f5c90800000-7f5c90b7c000 rw-s 00000000 00:01 1025 /memfd:rom-backend-memfd-shared (deleted)
>
> This is all expected, and IMO correct. There are no userspace mappings, and so
> not accounting anything is working as intended.
Doesn't sound that correct, if 10 SNP guests are running each using 10GB, how would we know who is using 100GB of memory?
>
>> * QEMU based NUMA bindings will not work. Memory backend uses mbind()
>> to set the policy for a particular virtual memory range but gmem
>> private-FD does not have a virtual memory range visible in the host.
>
> Yes, adding a generic fbind() is the way to solve silve.
Regards,
Nikunj
next prev parent reply other threads:[~2023-07-26 11:20 UTC|newest]
Thread overview: 140+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-18 23:44 [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes Sean Christopherson
2023-07-18 23:44 ` [RFC PATCH v11 01/29] KVM: Wrap kvm_gfn_range.pte in a per-action union Sean Christopherson
2023-07-19 13:39 ` Jarkko Sakkinen
2023-07-19 15:39 ` Sean Christopherson
2023-07-19 16:55 ` Paolo Bonzini
2023-07-26 20:22 ` Sean Christopherson
2023-07-21 6:26 ` Yan Zhao
2023-07-21 10:45 ` Xu Yilun
2023-07-25 18:05 ` Sean Christopherson
2023-07-18 23:44 ` [RFC PATCH v11 02/29] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges Sean Christopherson
2023-07-19 17:12 ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 03/29] KVM: Use gfn instead of hva for mmu_notifier_retry Sean Christopherson
2023-07-19 17:12 ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 04/29] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER Sean Christopherson
2023-07-19 17:34 ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 05/29] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER Sean Christopherson
2023-07-19 7:31 ` Yuan Yao
2023-07-19 14:15 ` Sean Christopherson
2023-07-20 1:15 ` Yuan Yao
2023-07-18 23:44 ` [RFC PATCH v11 06/29] KVM: Introduce KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
2023-07-21 9:03 ` Paolo Bonzini
2023-07-28 9:25 ` Quentin Perret
2023-07-29 0:03 ` Sean Christopherson
2023-07-31 9:30 ` Quentin Perret
2023-07-31 15:58 ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 07/29] KVM: Add KVM_EXIT_MEMORY_FAULT exit Sean Christopherson
2023-07-19 7:54 ` Yuan Yao
2023-07-19 14:16 ` Sean Christopherson
2023-07-18 23:44 ` [RFC PATCH v11 08/29] KVM: Introduce per-page memory attributes Sean Christopherson
2023-07-20 8:09 ` Yuan Yao
2023-07-20 19:02 ` Isaku Yamahata
2023-07-20 20:20 ` Sean Christopherson
2023-07-21 10:57 ` Paolo Bonzini
2023-07-21 15:56 ` Xiaoyao Li
2023-07-24 4:43 ` Xu Yilun
2023-07-26 15:59 ` Sean Christopherson
2023-07-27 3:24 ` Xu Yilun
2023-08-02 20:31 ` Isaku Yamahata
2023-08-14 0:44 ` Binbin Wu
2023-08-14 21:54 ` Sean Christopherson
2023-07-18 23:44 ` [RFC PATCH v11 09/29] KVM: x86: Disallow hugepages when memory attributes are mixed Sean Christopherson
2023-07-21 11:59 ` Paolo Bonzini
2023-07-21 17:41 ` Sean Christopherson
2023-07-18 23:44 ` [RFC PATCH v11 10/29] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable Sean Christopherson
2023-07-25 10:24 ` Kirill A . Shutemov
2023-07-25 12:51 ` Matthew Wilcox
2023-07-26 11:36 ` Kirill A . Shutemov
2023-07-28 16:02 ` Vlastimil Babka
2023-07-28 16:13 ` Paolo Bonzini
2023-09-01 8:23 ` Vlastimil Babka
2023-07-18 23:44 ` [RFC PATCH v11 11/29] security: Export security_inode_init_security_anon() for use by KVM Sean Christopherson
2023-07-19 2:14 ` Paul Moore
2023-07-31 10:46 ` Vlastimil Babka
2023-07-18 23:44 ` [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
2023-07-19 17:21 ` Vishal Annapurve
2023-07-19 17:47 ` Sean Christopherson
2023-07-20 14:45 ` Xiaoyao Li
2023-07-20 15:14 ` Sean Christopherson
2023-07-20 21:28 ` Isaku Yamahata
2023-07-21 6:13 ` Yuan Yao
2023-07-21 22:27 ` Isaku Yamahata
2023-07-21 22:33 ` Sean Christopherson
2023-07-21 15:05 ` Xiaoyao Li
2023-07-21 15:42 ` Xiaoyao Li
2023-07-21 17:42 ` Sean Christopherson
2023-07-21 17:17 ` Paolo Bonzini
2023-07-21 17:50 ` Sean Christopherson
2023-07-25 15:09 ` Wang, Wei W
2023-07-25 16:03 ` Sean Christopherson
2023-07-26 1:51 ` Wang, Wei W
2023-07-31 16:23 ` Fuad Tabba
2023-07-26 17:18 ` Elliot Berman
2023-07-26 19:28 ` Sean Christopherson
2023-07-27 10:39 ` Fuad Tabba
2023-07-27 17:13 ` Sean Christopherson
2023-07-31 13:46 ` Fuad Tabba
2023-08-03 19:15 ` Ryan Afranji
2023-08-07 23:06 ` Ackerley Tng
2023-08-08 21:13 ` Sean Christopherson
2023-08-10 23:57 ` Vishal Annapurve
2023-08-11 17:44 ` Sean Christopherson
2023-08-15 18:43 ` Ackerley Tng
2023-08-15 20:03 ` Sean Christopherson
2023-08-21 17:30 ` Ackerley Tng
2023-08-21 19:33 ` Sean Christopherson
2023-08-28 22:56 ` Ackerley Tng
2023-08-29 2:53 ` Elliot Berman
2023-09-14 19:12 ` Sean Christopherson
2023-09-14 18:15 ` Sean Christopherson
2023-09-14 23:19 ` Ackerley Tng
2023-09-15 0:33 ` Sean Christopherson
2023-08-30 15:12 ` Binbin Wu
2023-08-30 16:44 ` Ackerley Tng
2023-09-01 3:45 ` Binbin Wu
2023-09-01 16:46 ` Ackerley Tng
2023-07-18 23:44 ` [RFC PATCH v11 13/29] KVM: Add transparent hugepage support for dedicated guest memory Sean Christopherson
2023-07-21 15:07 ` Paolo Bonzini
2023-07-21 17:13 ` Sean Christopherson
2023-09-06 22:10 ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 14/29] KVM: x86/mmu: Handle page fault for private memory Sean Christopherson
2023-07-21 15:09 ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 15/29] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro Sean Christopherson
2023-07-21 15:07 ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 16/29] KVM: Allow arch code to track number of memslot address spaces per VM Sean Christopherson
2023-07-21 15:12 ` Paolo Bonzini
2023-07-18 23:45 ` [RFC PATCH v11 17/29] KVM: x86: Add support for "protected VMs" that can utilize private memory Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 18/29] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper Sean Christopherson
2023-07-21 15:14 ` Paolo Bonzini
2023-07-18 23:45 ` [RFC PATCH v11 19/29] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 20/29] KVM: selftests: Add support for creating private memslots Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 21/29] KVM: selftests: Add helpers to convert guest memory b/w private and shared Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 22/29] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86) Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 23/29] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 24/29] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 25/29] KVM: selftests: Add x86-only selftest for private memory conversions Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 26/29] KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 27/29] KVM: selftests: Expand set_memory_region_test to validate guest_memfd() Sean Christopherson
2023-08-07 23:17 ` Ackerley Tng
2023-07-18 23:45 ` [RFC PATCH v11 28/29] KVM: selftests: Add basic selftest for guest_memfd() Sean Christopherson
2023-08-07 23:20 ` Ackerley Tng
2023-08-18 23:03 ` Sean Christopherson
2023-08-07 23:25 ` Ackerley Tng
2023-08-18 23:01 ` Sean Christopherson
2023-08-21 19:49 ` Ackerley Tng
2023-07-18 23:45 ` [RFC PATCH v11 29/29] KVM: selftests: Test KVM exit behavior for private memory/access Sean Christopherson
2023-07-24 6:38 ` [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes Nikunj A. Dadhania
2023-07-24 17:00 ` Sean Christopherson
2023-07-26 11:20 ` Nikunj A. Dadhania [this message]
2023-07-26 14:24 ` Sean Christopherson
2023-07-27 6:42 ` Nikunj A. Dadhania
2023-08-03 11:03 ` Vlastimil Babka
2023-07-24 20:16 ` Sean Christopherson
2023-08-25 17:47 ` Sean Christopherson
2023-08-29 9:12 ` Chao Peng
2023-08-31 18:29 ` Sean Christopherson
2023-09-01 1:17 ` Chao Peng
2023-09-01 8:26 ` Vlastimil Babka
2023-09-01 9:10 ` Paolo Bonzini
2023-08-30 0:00 ` Isaku Yamahata
2023-09-09 0:16 ` Sean Christopherson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2f98a32c-bd3d-4890-b757-4d2f67a3b1a7@amd.com \
--to=nikunj@amd.com \
--cc=ackerleytng@google.com \
--cc=akpm@linux-foundation.org \
--cc=anup@brainfault.org \
--cc=aou@eecs.berkeley.edu \
--cc=chao.p.peng@linux.intel.com \
--cc=chenhuacai@kernel.org \
--cc=david@redhat.com \
--cc=isaku.yamahata@gmail.com \
--cc=jarkko@kernel.org \
--cc=jmorris@namei.org \
--cc=kirill.shutemov@linux.intel.com \
--cc=kvm-riscv@lists.infradead.org \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.linux.dev \
--cc=liam.merwick@oracle.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mips@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-riscv@lists.infradead.org \
--cc=linux-security-module@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mail@maciej.szmigiero.name \
--cc=maz@kernel.org \
--cc=michael.roth@amd.com \
--cc=mpe@ellerman.id.au \
--cc=oliver.upton@linux.dev \
--cc=palmer@dabbelt.com \
--cc=paul.walmsley@sifive.com \
--cc=paul@paul-moore.com \
--cc=pbonzini@redhat.com \
--cc=qperret@google.com \
--cc=seanjc@google.com \
--cc=serge@hallyn.com \
--cc=tabba@google.com \
--cc=vannapurve@google.com \
--cc=vbabka@suse.cz \
--cc=wei.w.wang@intel.com \
--cc=willy@infradead.org \
--cc=yu.c.zhang@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).