Re: [PATCH v2 0/9] Free user PTE page table pages

From: Qi Zheng <zhengqi.arch@bytedance.com>
To: David Hildenbrand <david@redhat.com>,
	akpm@linux-foundation.org, tglx@linutronix.de,
	hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com,
	kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com,
	jgg@ziepe.ca
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, songmuchun@bytedance.com
Subject: Re: [PATCH v2 0/9] Free user PTE page table pages
Date: Thu, 2 Sep 2021 11:37:49 +0800	[thread overview]
Message-ID: <1a6afa69-a955-9da7-1ff8-818bfcc7131e@bytedance.com> (raw)
In-Reply-To: <5b9348fc-95fe-5be2-e9df-7c906e0c9b81@redhat.com>

On 2021/9/1 PM8:32, David Hildenbrand wrote:
> On 19.08.21 05:18, Qi Zheng wrote:
>> Hi,
>>
>> This patch series aims to free user PTE page table pages when all PTE 
>> entries
>> are empty.
>>
>> The beginning of this story is that some malloc libraries(e.g. 
>> jemalloc or
>> tcmalloc) usually allocate the amount of VAs by mmap() and do not 
>> unmap those VAs.
>> They will use madvise(MADV_DONTNEED) to free physical memory if they 
>> want.
>> But the page tables do not be freed by madvise(), so it can produce many
>> page tables when the process touches an enormous virtual address space.
>>
>> The following figures are a memory usage snapshot of one process which 
>> actually
>> happened on our server:
>>
>>     VIRT:  55t
>>     RES:   590g
>>     VmPTE: 110g
>>
>> As we can see, the PTE page tables size is 110g, while the RES is 
>> 590g. In
>> theory, the process only need 1.2g PTE page tables to map those physical
>> memory. The reason why PTE page tables occupy a lot of memory is that
>> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
>> doesn't free the PTE page table pages. So we can free those empty PTE 
>> page
>> tables to save memory. In the above cases, we can save memory about 
>> 108g(best
>> case). And the larger the difference between the size of VIRT and RES, 
>> the
>> more memory we save.
>>
>> In this patch series, we add a pte_refcount field to the struct page 
>> of page
>> table to track how many users of PTE page table. Similar to the 
>> mechanism of
>> page refcount, the user of PTE page table should hold a refcount to it 
>> before
>> accessing. The PTE page table page will be freed when the last 
>> refcount is
>> dropped.
>>
>> Testing:
>>
>> The following code snippet can show the effect of optimization:
>>
>>     mmap 50G
>>     while (1) {
>>         for (; i < 1024 * 25; i++) {
>>             touch 2M memory
>>             madvise MADV_DONTNEED 2M
>>         }
>>     }
>>
>> As we can see, the memory usage of VmPTE is reduced:
>>
>>             before                        after
>> VIRT               50.0 GB                  50.0 GB
>> RES                3.1 MB                   3.6 MB
>> VmPTE             102640 kB                   248 kB
>>
>> I also have tested the stability by LTP[1] for several weeks. I have 
>> not seen
>> any crash so far.
>>
>> The performance of page fault can be affected because of the 
>> allocation/freeing
>> of PTE page table pages. The following is the test result by using a 
>> micro
>> benchmark[2]:
>>
>> root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads:
>>
>> threads         before (pf/min)                     after (pf/min)
>>      1                32,085,255                         31,880,833 
>> (-0.64%)
>>      8               101,674,967                        100,588,311 
>> (-1.17%)
>>     16               113,207,000                        112,801,832 
>> (-0.36%)
>>
>> (The "pfn/min" means how many page faults in one minute.)
>>
>> The performance of page fault is ~1% slower than before.
>>
>> This series is based on next-20210812.
>>
>> Comments and suggestions are welcome.
>>
>> Thanks,
>> Qi.
>>
> 
> 
> Some high-level feedback after studying the code:
> 
> 1. Try introducing the new dummy primitives ("API") first, and then 
> convert each subsystem individually; especially, maybe convert the whole 
> pagefault handling in a single patch, because it's far from trivial. 
> This will make this series much easier to digest.
> 
> Then, have a patch that adds actual logic to the dummy primitives via a 
> config option.
> 
> 2. Minimize the API.
> 
> a) pte_alloc_get{,_map,_map_lock}() is really ugly. Maybe restrict it to 
> pte_alloc_get().
> 
> b) pmd_trans_unstable_or_pte_try_get() and friends are really ugly.
> 
> Handle it independently for now, even if it implies duplicate runtime 
> checks.
> 
> if (pmd_trans_unstable() || !pte_try_get()) ...
> 
> We can always optimize later, once we can come up with something cleaner.
> 
> 3. Merge #6, and #7, after factoring out all changes to other subsystems 
> to use the API
> 
> 4. Merge #8 into #6. There is a lot of unnecessary code churn back and 
> forth, and IMHO the whole approach might not make sense without RCU due 
> to the additional locking overhead.
> 
> Or at least, try to not modify the API you introduced in patch #6 or #7 
> in #8 again. Converting all call sites back and forth just makes review 
> quite hard.
> 

Very detailed feedback! Thank you very much for your time and energy,
I will seriously adopt and implement these modification suggestions.

> 
> I am preparing some some cleanups that will make get_locked_pte() and 
> similar a little nicer to handle. I'll send them out this or next week.
> 

Yup, we just simply convert the pte_alloc_map_lock() in
__get_locked_pte() to pte_alloc_get_map_lock(), and then
call the paired pte_put() in the caller of get_locked_pte().
Like the following pattern:

	insert_page
	--> get_locked_pte
	    --> __get_locked_pte
		--> pte_alloc_get_map_lock

	    "do some things"

	    pte_put

This is really ugly and hard to review.

I look forward to your cleanups, besides, can you outline your approach?

Thanks again,
Qi