linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Qi Zheng <zhengqi.arch@bytedance.com>
To: David Hildenbrand <david@redhat.com>,
	akpm@linux-foundation.org, tglx@linutronix.de,
	hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com,
	kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com,
	jgg@ziepe.ca
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, songmuchun@bytedance.com
Subject: Re: [PATCH v2 0/9] Free user PTE page table pages
Date: Thu, 2 Sep 2021 11:37:49 +0800	[thread overview]
Message-ID: <1a6afa69-a955-9da7-1ff8-818bfcc7131e@bytedance.com> (raw)
In-Reply-To: <5b9348fc-95fe-5be2-e9df-7c906e0c9b81@redhat.com>



On 2021/9/1 PM8:32, David Hildenbrand wrote:
> On 19.08.21 05:18, Qi Zheng wrote:
>> Hi,
>>
>> This patch series aims to free user PTE page table pages when all PTE 
>> entries
>> are empty.
>>
>> The beginning of this story is that some malloc libraries(e.g. 
>> jemalloc or
>> tcmalloc) usually allocate the amount of VAs by mmap() and do not 
>> unmap those VAs.
>> They will use madvise(MADV_DONTNEED) to free physical memory if they 
>> want.
>> But the page tables do not be freed by madvise(), so it can produce many
>> page tables when the process touches an enormous virtual address space.
>>
>> The following figures are a memory usage snapshot of one process which 
>> actually
>> happened on our server:
>>
>>     VIRT:  55t
>>     RES:   590g
>>     VmPTE: 110g
>>
>> As we can see, the PTE page tables size is 110g, while the RES is 
>> 590g. In
>> theory, the process only need 1.2g PTE page tables to map those physical
>> memory. The reason why PTE page tables occupy a lot of memory is that
>> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
>> doesn't free the PTE page table pages. So we can free those empty PTE 
>> page
>> tables to save memory. In the above cases, we can save memory about 
>> 108g(best
>> case). And the larger the difference between the size of VIRT and RES, 
>> the
>> more memory we save.
>>
>> In this patch series, we add a pte_refcount field to the struct page 
>> of page
>> table to track how many users of PTE page table. Similar to the 
>> mechanism of
>> page refcount, the user of PTE page table should hold a refcount to it 
>> before
>> accessing. The PTE page table page will be freed when the last 
>> refcount is
>> dropped.
>>
>> Testing:
>>
>> The following code snippet can show the effect of optimization:
>>
>>     mmap 50G
>>     while (1) {
>>         for (; i < 1024 * 25; i++) {
>>             touch 2M memory
>>             madvise MADV_DONTNEED 2M
>>         }
>>     }
>>
>> As we can see, the memory usage of VmPTE is reduced:
>>
>>             before                        after
>> VIRT               50.0 GB                  50.0 GB
>> RES                3.1 MB                   3.6 MB
>> VmPTE             102640 kB                   248 kB
>>
>> I also have tested the stability by LTP[1] for several weeks. I have 
>> not seen
>> any crash so far.
>>
>> The performance of page fault can be affected because of the 
>> allocation/freeing
>> of PTE page table pages. The following is the test result by using a 
>> micro
>> benchmark[2]:
>>
>> root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads:
>>
>> threads         before (pf/min)                     after (pf/min)
>>      1                32,085,255                         31,880,833 
>> (-0.64%)
>>      8               101,674,967                        100,588,311 
>> (-1.17%)
>>     16               113,207,000                        112,801,832 
>> (-0.36%)
>>
>> (The "pfn/min" means how many page faults in one minute.)
>>
>> The performance of page fault is ~1% slower than before.
>>
>> This series is based on next-20210812.
>>
>> Comments and suggestions are welcome.
>>
>> Thanks,
>> Qi.
>>
> 
> 
> Some high-level feedback after studying the code:
> 
> 1. Try introducing the new dummy primitives ("API") first, and then 
> convert each subsystem individually; especially, maybe convert the whole 
> pagefault handling in a single patch, because it's far from trivial. 
> This will make this series much easier to digest.
> 
> Then, have a patch that adds actual logic to the dummy primitives via a 
> config option.
> 
> 2. Minimize the API.
> 
> a) pte_alloc_get{,_map,_map_lock}() is really ugly. Maybe restrict it to 
> pte_alloc_get().
> 
> b) pmd_trans_unstable_or_pte_try_get() and friends are really ugly.
> 
> Handle it independently for now, even if it implies duplicate runtime 
> checks.
> 
> if (pmd_trans_unstable() || !pte_try_get()) ...
> 
> We can always optimize later, once we can come up with something cleaner.
> 
> 3. Merge #6, and #7, after factoring out all changes to other subsystems 
> to use the API
> 
> 4. Merge #8 into #6. There is a lot of unnecessary code churn back and 
> forth, and IMHO the whole approach might not make sense without RCU due 
> to the additional locking overhead.
> 
> Or at least, try to not modify the API you introduced in patch #6 or #7 
> in #8 again. Converting all call sites back and forth just makes review 
> quite hard.
> 

Very detailed feedback! Thank you very much for your time and energy,
I will seriously adopt and implement these modification suggestions.

> 
> I am preparing some some cleanups that will make get_locked_pte() and 
> similar a little nicer to handle. I'll send them out this or next week.
> 

Yup, we just simply convert the pte_alloc_map_lock() in
__get_locked_pte() to pte_alloc_get_map_lock(), and then
call the paired pte_put() in the caller of get_locked_pte().
Like the following pattern:

	insert_page
	--> get_locked_pte
	    --> __get_locked_pte
		--> pte_alloc_get_map_lock

	    "do some things"

	    pte_put

This is really ugly and hard to review.

I look forward to your cleanups, besides, can you outline your approach?

Thanks again,
Qi

  parent reply	other threads:[~2021-09-02  3:38 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-19  3:18 [PATCH v2 0/9] Free user PTE page table pages Qi Zheng
2021-08-19  3:18 ` [PATCH v2 1/9] mm: introduce pmd_install() helper Qi Zheng
2021-08-24 16:26   ` David Hildenbrand
2021-08-25 16:20     ` Qi Zheng
2021-08-25 16:32       ` David Hildenbrand
2021-08-26  3:04         ` Qi Zheng
2021-08-19  3:18 ` [PATCH v2 2/9] mm: remove redundant smp_wmb() Qi Zheng
2021-08-19  3:18 ` [PATCH v2 3/9] mm: rework the parameter of lock_page_or_retry() Qi Zheng
2021-08-19  3:18 ` [PATCH v2 4/9] mm: move pte_alloc{,_map,_map_lock}() to a separate file Qi Zheng
2021-08-19  3:18 ` [PATCH v2 5/9] mm: pte_refcount infrastructure Qi Zheng
2021-08-19  3:18 ` [PATCH v2 6/9] mm: free user PTE page table pages Qi Zheng
2021-08-19  7:01   ` David Hildenbrand
2021-08-19 10:18     ` [External] " Qi Zheng
2021-09-01 13:53   ` Jason Gunthorpe
2021-09-01 13:57     ` David Hildenbrand
2021-09-01 15:32       ` Jason Gunthorpe
2021-09-01 16:13         ` David Hildenbrand
2021-09-01 16:16           ` Jason Gunthorpe
2021-09-01 16:19             ` David Hildenbrand
2021-09-01 17:10               ` Jason Gunthorpe
2021-09-01 17:49                 ` David Hildenbrand
2021-09-01 17:55                   ` Jason Gunthorpe
2021-09-01 17:58                     ` David Hildenbrand
2021-09-01 18:09                       ` Jason Gunthorpe
2021-09-01 18:10                         ` David Hildenbrand
2021-09-02  7:04                     ` Qi Zheng
2021-09-02  6:53         ` Qi Zheng
2021-08-19  3:18 ` [PATCH v2 7/9] mm: add THP support for pte_ref Qi Zheng
2021-08-19  3:18 ` [PATCH v2 8/9] mm: free PTE page table by using rcu mechanism Qi Zheng
2021-08-19  3:18 ` [PATCH v2 9/9] mm: use mmu_gather to free PTE page table Qi Zheng
2021-09-01 12:32 ` [PATCH v2 0/9] Free user PTE page table pages David Hildenbrand
2021-09-01 16:07   ` Jason Gunthorpe
2021-09-01 16:10     ` David Hildenbrand
2021-09-02  3:37   ` Qi Zheng [this message]
2021-09-15 14:52   ` Qi Zheng
2021-09-15 14:59     ` Jason Gunthorpe
2021-09-16  5:32       ` Qi Zheng
2021-09-16  8:30         ` David Hildenbrand
2021-09-16  8:41           ` Qi Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1a6afa69-a955-9da7-1ff8-818bfcc7131e@bytedance.com \
    --to=zhengqi.arch@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=jgg@ziepe.ca \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mika.penttila@nextfour.com \
    --cc=songmuchun@bytedance.com \
    --cc=tglx@linutronix.de \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).