linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Qi Zheng <zhengqi.arch@bytedance.com>
To: david@redhat.com
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, songmuchun@bytedance.com,
	zhouchengming@bytedance.com, akpm@linux-foundation.org,
	tglx@linutronix.de, kirill.shutemov@linux.intel.com,
	jgg@nvidia.com, tj@kernel.org, dennis@kernel.org,
	ming.lei@redhat.com
Subject: Re: [RFC PATCH 00/18] Try to free user PTE page table pages
Date: Tue, 17 May 2022 16:30:35 +0800	[thread overview]
Message-ID: <8c51d9ae-5a8e-74a9-ddc2-70b5fcd38427@bytedance.com> (raw)
In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com>



On 2022/4/29 9:35 PM, Qi Zheng wrote:
> Hi,
> 
> This patch series aims to try to free user PTE page table pages when no one is
> using it.
> 
> The beginning of this story is that some malloc libraries(e.g. jemalloc or
> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those
> VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want.
> But the page tables do not be freed by madvise(), so it can produce many
> page tables when the process touches an enormous virtual address space.
> 
> The following figures are a memory usage snapshot of one process which actually
> happened on our server:
> 
>          VIRT:  55t
>          RES:   590g
>          VmPTE: 110g
> 
> As we can see, the PTE page tables size is 110g, while the RES is 590g. In
> theory, the process only need 1.2g PTE page tables to map those physical
> memory. The reason why PTE page tables occupy a lot of memory is that
> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
> doesn't free the PTE page table pages. So we can free those empty PTE page
> tables to save memory. In the above cases, we can save memory about 108g(best
> case). And the larger the difference between the size of VIRT and RES, the
> more memory we save.
> 
> In this patch series, we add a pte_ref field to the struct page of page table
> to track how many users of user PTE page table. Similar to the mechanism of page
> refcount, the user of PTE page table should hold a refcount to it before
> accessing. The user PTE page table page may be freed when the last refcount is
> dropped.
> 
> Different from the idea of another patchset of mine before[1], the pte_ref
> becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
> such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
> entryies, and then release the user PTE page table page when checking that
> pte_ref is 0. The advantage of this is that there is basically no performance
> overhead in percpu mode, but it can also free the empty PTEs. In addition, the
> code implementation of this patchset is much simpler and more portable than the
> another patchset[1].

Hi David,

I learned from the LWN article[1] that you led a session at the LSFMM on
the problems posed by the lack of page-table reclaim (And thank you very
much for mentioning some of my work in this direction). So I want to
know, what are the further plans of the community for this problem?

For the way of adding pte_ref to each PTE page table page, I currently
posted two versions: atomic count version[2] and percpu_ref version(This
patchset).

For the atomic count version:
- Advantage: PTE pages can be freed as soon as the reference count drops
              to 0.
- Disadvantage: The addition and subtraction of pte_ref are atomic
                 operations, which have a certain performance overhead,
                 but should not become a performance bottleneck until the
                 mmap_lock contention problem is resolved.

For the percpu_ref version:
- Advantage: In the percpu mode, the addition and subtraction of the
              pte_ref are all operations on local cpu variables, there
              is basically no performance overhead.
Disadvantage: Need to explicitly convert the pte_ref to atomic mode so
               that the unused PTE pages can be freed.

There are still many places to optimize the code implementation of these
two versions. But before I do further work, I would like to hear your
and the community's views and suggestions on these two versions.

Thanks,
Qi

[1]: https://lwn.net/Articles/893726 (Ways to reclaim unused page-table 
pages)
[2]: 
https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/

> 

-- 
Thanks,
Qi

  parent reply	other threads:[~2022-05-17  8:31 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 01/18] x86/mm/encrypt: add the missing pte_unmap() call Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 02/18] percpu_ref: make ref stable after percpu_ref_switch_to_atomic_sync() returns Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 03/18] percpu_ref: make percpu_ref_switch_lock per percpu_ref Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 04/18] mm: convert to use ptep_clear() in pte_clear_not_present_full() Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 05/18] mm: split the related definitions of pte_offset_map_lock() into pgtable.h Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 06/18] mm: introduce CONFIG_FREE_USER_PTE Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 07/18] mm: add pte_to_page() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 08/18] mm: introduce percpu_ref for user PTE page table page Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 09/18] pte_ref: add pte_tryget() and {__,}pte_put() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 10/18] mm: add pte_tryget_map{_lock}() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 11/18] mm: convert to use pte_tryget_map_lock() Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 12/18] mm: convert to use pte_tryget_map() Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 13/18] mm: add try_to_free_user_pte() helper Qi Zheng
2022-04-30 13:35   ` Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 14/18] mm: use try_to_free_user_pte() in MADV_DONTNEED case Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 15/18] mm: use try_to_free_user_pte() in MADV_FREE case Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 16/18] pte_ref: add track_pte_{set, clear}() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 17/18] x86/mm: add x86_64 support for pte_ref Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 18/18] Documentation: add document " Qi Zheng
2022-04-30 13:19   ` Bagas Sanjaya
2022-04-30 13:32     ` Qi Zheng
2022-05-17  8:30 ` Qi Zheng [this message]
2022-05-18 14:51   ` [RFC PATCH 00/18] Try to free user PTE page table pages David Hildenbrand
2022-05-18 14:56     ` Matthew Wilcox
2022-05-19  4:03       ` Qi Zheng
2022-05-19  3:58     ` Qi Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8c51d9ae-5a8e-74a9-ddc2-70b5fcd38427@bytedance.com \
    --to=zhengqi.arch@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=dennis@kernel.org \
    --cc=jgg@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ming.lei@redhat.com \
    --cc=songmuchun@bytedance.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).