linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, songmuchun@bytedance.com,
	zhouchengming@bytedance.com, akpm@linux-foundation.org,
	tglx@linutronix.de, kirill.shutemov@linux.intel.com,
	jgg@nvidia.com, tj@kernel.org, dennis@kernel.org,
	ming.lei@redhat.com
Subject: Re: [RFC PATCH 00/18] Try to free user PTE page table pages
Date: Wed, 18 May 2022 16:51:06 +0200	[thread overview]
Message-ID: <37055be1-05af-f7ef-c33e-27f90fa0f9ca@redhat.com> (raw)
In-Reply-To: <8c51d9ae-5a8e-74a9-ddc2-70b5fcd38427@bytedance.com>

On 17.05.22 10:30, Qi Zheng wrote:
> 
> 
> On 2022/4/29 9:35 PM, Qi Zheng wrote:
>> Hi,
>>
>> This patch series aims to try to free user PTE page table pages when no one is
>> using it.
>>
>> The beginning of this story is that some malloc libraries(e.g. jemalloc or
>> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those
>> VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want.
>> But the page tables do not be freed by madvise(), so it can produce many
>> page tables when the process touches an enormous virtual address space.
>>
>> The following figures are a memory usage snapshot of one process which actually
>> happened on our server:
>>
>>          VIRT:  55t
>>          RES:   590g
>>          VmPTE: 110g
>>
>> As we can see, the PTE page tables size is 110g, while the RES is 590g. In
>> theory, the process only need 1.2g PTE page tables to map those physical
>> memory. The reason why PTE page tables occupy a lot of memory is that
>> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
>> doesn't free the PTE page table pages. So we can free those empty PTE page
>> tables to save memory. In the above cases, we can save memory about 108g(best
>> case). And the larger the difference between the size of VIRT and RES, the
>> more memory we save.
>>
>> In this patch series, we add a pte_ref field to the struct page of page table
>> to track how many users of user PTE page table. Similar to the mechanism of page
>> refcount, the user of PTE page table should hold a refcount to it before
>> accessing. The user PTE page table page may be freed when the last refcount is
>> dropped.
>>
>> Different from the idea of another patchset of mine before[1], the pte_ref
>> becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
>> such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
>> entryies, and then release the user PTE page table page when checking that
>> pte_ref is 0. The advantage of this is that there is basically no performance
>> overhead in percpu mode, but it can also free the empty PTEs. In addition, the
>> code implementation of this patchset is much simpler and more portable than the
>> another patchset[1].
> 
> Hi David,
> 
> I learned from the LWN article[1] that you led a session at the LSFMM on
> the problems posed by the lack of page-table reclaim (And thank you very
> much for mentioning some of my work in this direction). So I want to
> know, what are the further plans of the community for this problem?

Hi,

yes, I talked about the involved challenges, especially, how malicious
user space can trigger allocation of almost elusively page tables and
essentially consume a lot of unmovable+unswappable memory and even store
secrets in the page table structure.

Empty PTE tables is one such case we care about, but there is more. Even
with your approach, we can still end up with many page tables that are
allocated on higher levels (e.g., PMD tables) or page tables that are
not empty (especially, filled with the shared zeropage).

Ideally, we'd have some mechanism that can reclaim also other
reclaimable page tables (e.g., filled with shared zeropage). One idea
was to add reclaimable page tables to the LRU list and to then
scan+reclaim them on demand. There are multiple challenges involved,
obviously. One is how to synchronize against concurrent page table
walkers, another one is how to invalidate MMU notifiers from reclaim
context. It would most probably involve storing required information in
the memmap to be able to lock+synchronize.

Having that said, adding infrastructure that might not be easy to extend
to the more general case of reclaiming other reclaimable page tables on
multiple levels (esp PMD tables) might not be what we want. OTOH, it
gets the job done for once case we care about.

It's really hard to tell what to do because reclaiming page tables and
eventually handling malicious user space correctly is far from trivial :)

I'll be on vacation until end of May, I'll come back to this mail once
I'm back.

-- 
Thanks,

David / dhildenb


  reply	other threads:[~2022-05-18 14:51 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 01/18] x86/mm/encrypt: add the missing pte_unmap() call Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 02/18] percpu_ref: make ref stable after percpu_ref_switch_to_atomic_sync() returns Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 03/18] percpu_ref: make percpu_ref_switch_lock per percpu_ref Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 04/18] mm: convert to use ptep_clear() in pte_clear_not_present_full() Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 05/18] mm: split the related definitions of pte_offset_map_lock() into pgtable.h Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 06/18] mm: introduce CONFIG_FREE_USER_PTE Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 07/18] mm: add pte_to_page() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 08/18] mm: introduce percpu_ref for user PTE page table page Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 09/18] pte_ref: add pte_tryget() and {__,}pte_put() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 10/18] mm: add pte_tryget_map{_lock}() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 11/18] mm: convert to use pte_tryget_map_lock() Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 12/18] mm: convert to use pte_tryget_map() Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 13/18] mm: add try_to_free_user_pte() helper Qi Zheng
2022-04-30 13:35   ` Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 14/18] mm: use try_to_free_user_pte() in MADV_DONTNEED case Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 15/18] mm: use try_to_free_user_pte() in MADV_FREE case Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 16/18] pte_ref: add track_pte_{set, clear}() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 17/18] x86/mm: add x86_64 support for pte_ref Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 18/18] Documentation: add document " Qi Zheng
2022-04-30 13:19   ` Bagas Sanjaya
2022-04-30 13:32     ` Qi Zheng
2022-05-17  8:30 ` [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
2022-05-18 14:51   ` David Hildenbrand [this message]
2022-05-18 14:56     ` Matthew Wilcox
2022-05-19  4:03       ` Qi Zheng
2022-05-19  3:58     ` Qi Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=37055be1-05af-f7ef-c33e-27f90fa0f9ca@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dennis@kernel.org \
    --cc=jgg@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ming.lei@redhat.com \
    --cc=songmuchun@bytedance.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=zhengqi.arch@bytedance.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).