From: Qi Zheng <zhengqi.arch@bytedance.com>
To: david@redhat.com
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, songmuchun@bytedance.com,
zhouchengming@bytedance.com, akpm@linux-foundation.org,
tglx@linutronix.de, kirill.shutemov@linux.intel.com,
jgg@nvidia.com, tj@kernel.org, dennis@kernel.org,
ming.lei@redhat.com
Subject: Re: [RFC PATCH 00/18] Try to free user PTE page table pages
Date: Tue, 17 May 2022 16:30:35 +0800 [thread overview]
Message-ID: <8c51d9ae-5a8e-74a9-ddc2-70b5fcd38427@bytedance.com> (raw)
In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com>
On 2022/4/29 9:35 PM, Qi Zheng wrote:
> Hi,
>
> This patch series aims to try to free user PTE page table pages when no one is
> using it.
>
> The beginning of this story is that some malloc libraries(e.g. jemalloc or
> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those
> VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want.
> But the page tables do not be freed by madvise(), so it can produce many
> page tables when the process touches an enormous virtual address space.
>
> The following figures are a memory usage snapshot of one process which actually
> happened on our server:
>
> VIRT: 55t
> RES: 590g
> VmPTE: 110g
>
> As we can see, the PTE page tables size is 110g, while the RES is 590g. In
> theory, the process only need 1.2g PTE page tables to map those physical
> memory. The reason why PTE page tables occupy a lot of memory is that
> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
> doesn't free the PTE page table pages. So we can free those empty PTE page
> tables to save memory. In the above cases, we can save memory about 108g(best
> case). And the larger the difference between the size of VIRT and RES, the
> more memory we save.
>
> In this patch series, we add a pte_ref field to the struct page of page table
> to track how many users of user PTE page table. Similar to the mechanism of page
> refcount, the user of PTE page table should hold a refcount to it before
> accessing. The user PTE page table page may be freed when the last refcount is
> dropped.
>
> Different from the idea of another patchset of mine before[1], the pte_ref
> becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
> such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
> entryies, and then release the user PTE page table page when checking that
> pte_ref is 0. The advantage of this is that there is basically no performance
> overhead in percpu mode, but it can also free the empty PTEs. In addition, the
> code implementation of this patchset is much simpler and more portable than the
> another patchset[1].
Hi David,
I learned from the LWN article[1] that you led a session at the LSFMM on
the problems posed by the lack of page-table reclaim (And thank you very
much for mentioning some of my work in this direction). So I want to
know, what are the further plans of the community for this problem?
For the way of adding pte_ref to each PTE page table page, I currently
posted two versions: atomic count version[2] and percpu_ref version(This
patchset).
For the atomic count version:
- Advantage: PTE pages can be freed as soon as the reference count drops
to 0.
- Disadvantage: The addition and subtraction of pte_ref are atomic
operations, which have a certain performance overhead,
but should not become a performance bottleneck until the
mmap_lock contention problem is resolved.
For the percpu_ref version:
- Advantage: In the percpu mode, the addition and subtraction of the
pte_ref are all operations on local cpu variables, there
is basically no performance overhead.
Disadvantage: Need to explicitly convert the pte_ref to atomic mode so
that the unused PTE pages can be freed.
There are still many places to optimize the code implementation of these
two versions. But before I do further work, I would like to hear your
and the community's views and suggestions on these two versions.
Thanks,
Qi
[1]: https://lwn.net/Articles/893726 (Ways to reclaim unused page-table
pages)
[2]:
https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/
>
--
Thanks,
Qi
next prev parent reply other threads:[~2022-05-17 8:31 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 01/18] x86/mm/encrypt: add the missing pte_unmap() call Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 02/18] percpu_ref: make ref stable after percpu_ref_switch_to_atomic_sync() returns Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 03/18] percpu_ref: make percpu_ref_switch_lock per percpu_ref Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 04/18] mm: convert to use ptep_clear() in pte_clear_not_present_full() Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 05/18] mm: split the related definitions of pte_offset_map_lock() into pgtable.h Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 06/18] mm: introduce CONFIG_FREE_USER_PTE Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 07/18] mm: add pte_to_page() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 08/18] mm: introduce percpu_ref for user PTE page table page Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 09/18] pte_ref: add pte_tryget() and {__,}pte_put() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 10/18] mm: add pte_tryget_map{_lock}() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 11/18] mm: convert to use pte_tryget_map_lock() Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 12/18] mm: convert to use pte_tryget_map() Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 13/18] mm: add try_to_free_user_pte() helper Qi Zheng
2022-04-30 13:35 ` Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 14/18] mm: use try_to_free_user_pte() in MADV_DONTNEED case Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 15/18] mm: use try_to_free_user_pte() in MADV_FREE case Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 16/18] pte_ref: add track_pte_{set, clear}() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 17/18] x86/mm: add x86_64 support for pte_ref Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 18/18] Documentation: add document " Qi Zheng
2022-04-30 13:19 ` Bagas Sanjaya
2022-04-30 13:32 ` Qi Zheng
2022-05-17 8:30 ` Qi Zheng [this message]
2022-05-18 14:51 ` [RFC PATCH 00/18] Try to free user PTE page table pages David Hildenbrand
2022-05-18 14:56 ` Matthew Wilcox
2022-05-19 4:03 ` Qi Zheng
2022-05-19 3:58 ` Qi Zheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8c51d9ae-5a8e-74a9-ddc2-70b5fcd38427@bytedance.com \
--to=zhengqi.arch@bytedance.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=dennis@kernel.org \
--cc=jgg@nvidia.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ming.lei@redhat.com \
--cc=songmuchun@bytedance.com \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=zhouchengming@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).