From: Qi Zheng <zhengqi.arch@bytedance.com>
To: David Hildenbrand <david@redhat.com>,
akpm@linux-foundation.org, tglx@linutronix.de,
hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com,
kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com,
jgg@ziepe.ca
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, songmuchun@bytedance.com
Subject: Re: [PATCH v2 0/9] Free user PTE page table pages
Date: Thu, 2 Sep 2021 11:37:49 +0800 [thread overview]
Message-ID: <1a6afa69-a955-9da7-1ff8-818bfcc7131e@bytedance.com> (raw)
In-Reply-To: <5b9348fc-95fe-5be2-e9df-7c906e0c9b81@redhat.com>
On 2021/9/1 PM8:32, David Hildenbrand wrote:
> On 19.08.21 05:18, Qi Zheng wrote:
>> Hi,
>>
>> This patch series aims to free user PTE page table pages when all PTE
>> entries
>> are empty.
>>
>> The beginning of this story is that some malloc libraries(e.g.
>> jemalloc or
>> tcmalloc) usually allocate the amount of VAs by mmap() and do not
>> unmap those VAs.
>> They will use madvise(MADV_DONTNEED) to free physical memory if they
>> want.
>> But the page tables do not be freed by madvise(), so it can produce many
>> page tables when the process touches an enormous virtual address space.
>>
>> The following figures are a memory usage snapshot of one process which
>> actually
>> happened on our server:
>>
>> VIRT: 55t
>> RES: 590g
>> VmPTE: 110g
>>
>> As we can see, the PTE page tables size is 110g, while the RES is
>> 590g. In
>> theory, the process only need 1.2g PTE page tables to map those physical
>> memory. The reason why PTE page tables occupy a lot of memory is that
>> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
>> doesn't free the PTE page table pages. So we can free those empty PTE
>> page
>> tables to save memory. In the above cases, we can save memory about
>> 108g(best
>> case). And the larger the difference between the size of VIRT and RES,
>> the
>> more memory we save.
>>
>> In this patch series, we add a pte_refcount field to the struct page
>> of page
>> table to track how many users of PTE page table. Similar to the
>> mechanism of
>> page refcount, the user of PTE page table should hold a refcount to it
>> before
>> accessing. The PTE page table page will be freed when the last
>> refcount is
>> dropped.
>>
>> Testing:
>>
>> The following code snippet can show the effect of optimization:
>>
>> mmap 50G
>> while (1) {
>> for (; i < 1024 * 25; i++) {
>> touch 2M memory
>> madvise MADV_DONTNEED 2M
>> }
>> }
>>
>> As we can see, the memory usage of VmPTE is reduced:
>>
>> before after
>> VIRT 50.0 GB 50.0 GB
>> RES 3.1 MB 3.6 MB
>> VmPTE 102640 kB 248 kB
>>
>> I also have tested the stability by LTP[1] for several weeks. I have
>> not seen
>> any crash so far.
>>
>> The performance of page fault can be affected because of the
>> allocation/freeing
>> of PTE page table pages. The following is the test result by using a
>> micro
>> benchmark[2]:
>>
>> root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads:
>>
>> threads before (pf/min) after (pf/min)
>> 1 32,085,255 31,880,833
>> (-0.64%)
>> 8 101,674,967 100,588,311
>> (-1.17%)
>> 16 113,207,000 112,801,832
>> (-0.36%)
>>
>> (The "pfn/min" means how many page faults in one minute.)
>>
>> The performance of page fault is ~1% slower than before.
>>
>> This series is based on next-20210812.
>>
>> Comments and suggestions are welcome.
>>
>> Thanks,
>> Qi.
>>
>
>
> Some high-level feedback after studying the code:
>
> 1. Try introducing the new dummy primitives ("API") first, and then
> convert each subsystem individually; especially, maybe convert the whole
> pagefault handling in a single patch, because it's far from trivial.
> This will make this series much easier to digest.
>
> Then, have a patch that adds actual logic to the dummy primitives via a
> config option.
>
> 2. Minimize the API.
>
> a) pte_alloc_get{,_map,_map_lock}() is really ugly. Maybe restrict it to
> pte_alloc_get().
>
> b) pmd_trans_unstable_or_pte_try_get() and friends are really ugly.
>
> Handle it independently for now, even if it implies duplicate runtime
> checks.
>
> if (pmd_trans_unstable() || !pte_try_get()) ...
>
> We can always optimize later, once we can come up with something cleaner.
>
> 3. Merge #6, and #7, after factoring out all changes to other subsystems
> to use the API
>
> 4. Merge #8 into #6. There is a lot of unnecessary code churn back and
> forth, and IMHO the whole approach might not make sense without RCU due
> to the additional locking overhead.
>
> Or at least, try to not modify the API you introduced in patch #6 or #7
> in #8 again. Converting all call sites back and forth just makes review
> quite hard.
>
Very detailed feedback! Thank you very much for your time and energy,
I will seriously adopt and implement these modification suggestions.
>
> I am preparing some some cleanups that will make get_locked_pte() and
> similar a little nicer to handle. I'll send them out this or next week.
>
Yup, we just simply convert the pte_alloc_map_lock() in
__get_locked_pte() to pte_alloc_get_map_lock(), and then
call the paired pte_put() in the caller of get_locked_pte().
Like the following pattern:
insert_page
--> get_locked_pte
--> __get_locked_pte
--> pte_alloc_get_map_lock
"do some things"
pte_put
This is really ugly and hard to review.
I look forward to your cleanups, besides, can you outline your approach?
Thanks again,
Qi
next prev parent reply other threads:[~2021-09-02 3:38 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-19 3:18 [PATCH v2 0/9] Free user PTE page table pages Qi Zheng
2021-08-19 3:18 ` [PATCH v2 1/9] mm: introduce pmd_install() helper Qi Zheng
2021-08-24 16:26 ` David Hildenbrand
2021-08-25 16:20 ` Qi Zheng
2021-08-25 16:32 ` David Hildenbrand
2021-08-26 3:04 ` Qi Zheng
2021-08-19 3:18 ` [PATCH v2 2/9] mm: remove redundant smp_wmb() Qi Zheng
2021-08-19 3:18 ` [PATCH v2 3/9] mm: rework the parameter of lock_page_or_retry() Qi Zheng
2021-08-19 3:18 ` [PATCH v2 4/9] mm: move pte_alloc{,_map,_map_lock}() to a separate file Qi Zheng
2021-08-19 3:18 ` [PATCH v2 5/9] mm: pte_refcount infrastructure Qi Zheng
2021-08-19 3:18 ` [PATCH v2 6/9] mm: free user PTE page table pages Qi Zheng
2021-08-19 7:01 ` David Hildenbrand
2021-08-19 10:18 ` [External] " Qi Zheng
2021-09-01 13:53 ` Jason Gunthorpe
2021-09-01 13:57 ` David Hildenbrand
2021-09-01 15:32 ` Jason Gunthorpe
2021-09-01 16:13 ` David Hildenbrand
2021-09-01 16:16 ` Jason Gunthorpe
2021-09-01 16:19 ` David Hildenbrand
2021-09-01 17:10 ` Jason Gunthorpe
2021-09-01 17:49 ` David Hildenbrand
2021-09-01 17:55 ` Jason Gunthorpe
2021-09-01 17:58 ` David Hildenbrand
2021-09-01 18:09 ` Jason Gunthorpe
2021-09-01 18:10 ` David Hildenbrand
2021-09-02 7:04 ` Qi Zheng
2021-09-02 6:53 ` Qi Zheng
2021-08-19 3:18 ` [PATCH v2 7/9] mm: add THP support for pte_ref Qi Zheng
2021-08-19 3:18 ` [PATCH v2 8/9] mm: free PTE page table by using rcu mechanism Qi Zheng
2021-08-19 3:18 ` [PATCH v2 9/9] mm: use mmu_gather to free PTE page table Qi Zheng
2021-09-01 12:32 ` [PATCH v2 0/9] Free user PTE page table pages David Hildenbrand
2021-09-01 16:07 ` Jason Gunthorpe
2021-09-01 16:10 ` David Hildenbrand
2021-09-02 3:37 ` Qi Zheng [this message]
2021-09-15 14:52 ` Qi Zheng
2021-09-15 14:59 ` Jason Gunthorpe
2021-09-16 5:32 ` Qi Zheng
2021-09-16 8:30 ` David Hildenbrand
2021-09-16 8:41 ` Qi Zheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1a6afa69-a955-9da7-1ff8-818bfcc7131e@bytedance.com \
--to=zhengqi.arch@bytedance.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=jgg@ziepe.ca \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=mika.penttila@nextfour.com \
--cc=songmuchun@bytedance.com \
--cc=tglx@linutronix.de \
--cc=vdavydov.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).