All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qi Zheng <zhengqi.arch@bytedance.com>
To: akpm@linux-foundation.org, tglx@linutronix.de,
	kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com,
	david@redhat.com, jgg@nvidia.com
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, songmuchun@bytedance.com,
	zhouchengming@bytedance.com
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages
Date: Wed, 10 Nov 2021 16:52:07 +0800	[thread overview]
Message-ID: <214410b7-01b6-9d01-b731-f8f1a3495d6b@bytedance.com> (raw)
In-Reply-To: <20211110084057.27676-1-zhengqi.arch@bytedance.com>

Hi all,

I’m sorry, something went wrong when sending this patch set, I will 
resend the whole patch later.

Thanks,
Qi

On 11/10/21 4:40 PM, Qi Zheng wrote:
> Hi,
> 
> This patch series aims to free user PTE page table pages when all PTE entries
> are empty.
> 
> The beginning of this story is that some malloc libraries(e.g. jemalloc or
> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those VAs.
> They will use madvise(MADV_DONTNEED) to free physical memory if they want.
> But the page tables do not be freed by madvise(), so it can produce many
> page tables when the process touches an enormous virtual address space.
> 
> The following figures are a memory usage snapshot of one process which actually
> happened on our server:
> 
>          VIRT:  55t
>          RES:   590g
>          VmPTE: 110g
> 
> As we can see, the PTE page tables size is 110g, while the RES is 590g. In
> theory, the process only need 1.2g PTE page tables to map those physical
> memory. The reason why PTE page tables occupy a lot of memory is that
> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
> doesn't free the PTE page table pages. So we can free those empty PTE page
> tables to save memory. In the above cases, we can save memory about 108g(best
> case). And the larger the difference between the size of VIRT and RES, the
> more memory we save.
> 
> In this patch series, we add a pte_refcount field to the struct page of page
> table to track how many users of PTE page table. Similar to the mechanism of
> page refcount, the user of PTE page table should hold a refcount to it before
> accessing. The PTE page table page will be freed when the last refcount is
> dropped.
> 
> Testing:
> 
> The following code snippet can show the effect of optimization:
> 
>          mmap 50G
>          while (1) {
>                  for (; i < 1024 * 25; i++) {
>                          touch 2M memory
>                          madvise MADV_DONTNEED 2M
>                  }
>          }
> 
> As we can see, the memory usage of VmPTE is reduced:
> 
>                          before                          after
> VIRT                   50.0 GB                        50.0 GB
> RES                     3.1 MB                         3.6 MB
> VmPTE                102640 kB                         248 kB
> 
> I also have tested the stability by LTP[1] for several weeks. I have not seen
> any crash so far.
> 
> The performance of page fault can be affected because of the allocation/freeing
> of PTE page table pages. The following is the test result by using a micro
> benchmark[2]:
> 
> root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads:
> 
> threads         before (pf/min)                     after (pf/min)
>      1                32,085,255                         31,880,833 (-0.64%)
>      8               101,674,967                        100,588,311 (-1.17%)
>     16               113,207,000                        112,801,832 (-0.36%)
> 
> (The "pfn/min" means how many page faults in one minute.)
> 
> The performance of page fault is ~1% slower than before.
> 
> And there are no obvious changes in perf hot spots:
> 
> before:
>    19.29%  [kernel]  [k] clear_page_rep
>    16.12%  [kernel]  [k] do_user_addr_fault
>     9.57%  [kernel]  [k] _raw_spin_unlock_irqrestore
>     6.16%  [kernel]  [k] get_page_from_freelist
>     5.03%  [kernel]  [k] __handle_mm_fault
>     3.53%  [kernel]  [k] __rcu_read_unlock
>     3.45%  [kernel]  [k] handle_mm_fault
>     3.38%  [kernel]  [k] down_read_trylock
>     2.74%  [kernel]  [k] free_unref_page_list
>     2.17%  [kernel]  [k] up_read
>     1.93%  [kernel]  [k] charge_memcg
>     1.73%  [kernel]  [k] try_charge_memcg
>     1.71%  [kernel]  [k] __alloc_pages
>     1.69%  [kernel]  [k] ___perf_sw_event
>     1.44%  [kernel]  [k] get_mem_cgroup_from_mm
> 
> after:
>    18.19%  [kernel]  [k] clear_page_rep
>    16.28%  [kernel]  [k] do_user_addr_fault
>     8.39%  [kernel]  [k] _raw_spin_unlock_irqrestore
>     5.12%  [kernel]  [k] get_page_from_freelist
>     4.81%  [kernel]  [k] __handle_mm_fault
>     4.68%  [kernel]  [k] down_read_trylock
>     3.80%  [kernel]  [k] handle_mm_fault
>     3.59%  [kernel]  [k] get_mem_cgroup_from_mm
>     2.49%  [kernel]  [k] free_unref_page_list
>     2.41%  [kernel]  [k] up_read
>     2.16%  [kernel]  [k] charge_memcg
>     1.92%  [kernel]  [k] __rcu_read_unlock
>     1.88%  [kernel]  [k] ___perf_sw_event
>     1.70%  [kernel]  [k] pte_get_unless_zero
> 
> This series is based on next-20211108.
> 
> Comments and suggestions are welcome.
> 
> Thanks,
> Qi.
> 
> [1] https://github.com/linux-test-project/ltp
> [2] https://lore.kernel.org/lkml/20100106160614.ff756f82.kamezawa.hiroyu@jp.fujitsu.com/2-multi-fault-all.c
> 
> Changelog in v2 -> v3:
>   - Refactored this patch series:
>          - [PATCH v3 6/15]: Introduce the new dummy helpers first
>          - [PATCH v3 7-12/15]: Convert each subsystem individually
>          - [PATCH v3 13/15]: Implement the actual logic to the dummy helpers
>     And thanks for the advice from David and Jason.
>   - Add a document.
> 
> Changelog in v1 -> v2:
>   - Change pte_install() to pmd_install().
>   - Fix some typo and code style problems.
>   - Split [PATCH v1 5/7] into [PATCH v2 4/9], [PATCH v2 5/9],[PATCH v2 6/9]
>     and [PATCH v2 7/9].
> 
> Qi Zheng (15):
>    mm: do code cleanups to filemap_map_pmd()
>    mm: introduce is_huge_pmd() helper
>    mm: move pte_offset_map_lock() to pgtable.h
>    mm: rework the parameter of lock_page_or_retry()
>    mm: add pmd_installed_type return for __pte_alloc() and other friends
>    mm: introduce refcount for user PTE page table page
>    mm/pte_ref: add support for user PTE page table page allocation
>    mm/pte_ref: initialize the refcount of the withdrawn PTE page table
>      page
>    mm/pte_ref: add support for the map/unmap of user PTE page table page
>    mm/pte_ref: add support for page fault path
>    mm/pte_ref: take a refcount before accessing the PTE page table page
>    mm/pte_ref: update the pmd entry in move_normal_pmd()
>    mm/pte_ref: free user PTE page table pages
>    Documentation: add document for pte_ref
>    mm/pte_ref: use mmu_gather to free PTE page table pages
> 
>   Documentation/vm/pte_ref.rst | 216 ++++++++++++++++++++++++++++++++++++
>   arch/x86/Kconfig             |   2 +-
>   fs/proc/task_mmu.c           |  24 +++-
>   fs/userfaultfd.c             |   9 +-
>   include/linux/huge_mm.h      |  10 +-
>   include/linux/mm.h           | 170 ++++-------------------------
>   include/linux/mm_types.h     |   6 +-
>   include/linux/pagemap.h      |   8 +-
>   include/linux/pgtable.h      | 152 +++++++++++++++++++++++++-
>   include/linux/pte_ref.h      | 146 +++++++++++++++++++++++++
>   include/linux/rmap.h         |   2 +
>   kernel/events/uprobes.c      |   2 +
>   mm/Kconfig                   |   4 +
>   mm/Makefile                  |   4 +-
>   mm/damon/vaddr.c             |  12 +-
>   mm/debug_vm_pgtable.c        |   5 +-
>   mm/filemap.c                 |  45 +++++---
>   mm/gup.c                     |  25 ++++-
>   mm/hmm.c                     |   5 +-
>   mm/huge_memory.c             |   3 +-
>   mm/internal.h                |   4 +-
>   mm/khugepaged.c              |  21 +++-
>   mm/ksm.c                     |   6 +-
>   mm/madvise.c                 |  21 +++-
>   mm/memcontrol.c              |  12 +-
>   mm/memory-failure.c          |  11 +-
>   mm/memory.c                  | 254 ++++++++++++++++++++++++++++++++-----------
>   mm/mempolicy.c               |   6 +-
>   mm/migrate.c                 |  54 ++++-----
>   mm/mincore.c                 |   7 +-
>   mm/mlock.c                   |   1 +
>   mm/mmu_gather.c              |  40 +++----
>   mm/mprotect.c                |  11 +-
>   mm/mremap.c                  |  14 ++-
>   mm/page_vma_mapped.c         |   4 +
>   mm/pagewalk.c                |  15 ++-
>   mm/pgtable-generic.c         |   1 +
>   mm/pte_ref.c                 | 141 ++++++++++++++++++++++++
>   mm/rmap.c                    |  10 ++
>   mm/swapfile.c                |   3 +
>   mm/userfaultfd.c             |  40 +++++--
>   41 files changed, 1186 insertions(+), 340 deletions(-)
>   create mode 100644 Documentation/vm/pte_ref.rst
>   create mode 100644 include/linux/pte_ref.h
>   create mode 100644 mm/pte_ref.c
> 

  parent reply	other threads:[~2021-11-10  8:52 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-10  8:40 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
2021-11-10  8:40 ` [PATCH v3 01/15] mm: do code cleanups to filemap_map_pmd() Qi Zheng
2021-11-10  8:40 ` [PATCH v3 02/15] mm: introduce is_huge_pmd() helper Qi Zheng
2021-11-10 12:29   ` Jason Gunthorpe
2021-11-10 12:58     ` Qi Zheng
2021-11-10 12:59       ` Jason Gunthorpe
2021-11-10  8:40 ` [PATCH v3 03/15] mm: move pte_offset_map_lock() to pgtable.h Qi Zheng
2021-11-10  8:40 ` [PATCH v3 04/15] mm: rework the parameter of lock_page_or_retry() Qi Zheng
2021-11-10  8:40 ` [PATCH v3 05/15] mm: add pmd_installed_type return for __pte_alloc() and other friends Qi Zheng
2021-11-10  8:40 ` [PATCH v3 06/15] mm: introduce refcount for user PTE page table page Qi Zheng
2021-11-10  8:40 ` [PATCH v3 07/15] mm/pte_ref: add support for user PTE page table page allocation Qi Zheng
2021-11-10  8:40 ` [PATCH v3 08/15] mm/pte_ref: initialize the refcount of the withdrawn PTE page table page Qi Zheng
2021-11-10  8:40 ` [PATCH v3 09/15] mm/pte_ref: add support for the map/unmap of user " Qi Zheng
2021-11-10  8:52 ` Qi Zheng [this message]
2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
2021-11-10 12:56 ` Jason Gunthorpe
2021-11-10 13:25   ` David Hildenbrand
2021-11-10 13:59     ` Qi Zheng
2021-11-10 14:38     ` Jason Gunthorpe
2021-11-10 15:37       ` David Hildenbrand
2021-11-10 16:39         ` Jason Gunthorpe
2021-11-10 17:37           ` David Hildenbrand
2021-11-10 17:49             ` Jason Gunthorpe
2021-11-11  3:58             ` Qi Zheng
2021-11-11  9:22               ` David Hildenbrand
2021-11-11 11:08                 ` Qi Zheng
2021-11-11 11:19                   ` David Hildenbrand
2021-11-11 12:00                     ` Qi Zheng
2021-11-11 12:20                       ` David Hildenbrand
2021-11-11 12:32                         ` Qi Zheng
2021-11-11 12:51                           ` David Hildenbrand
2021-11-11 13:01                             ` Qi Zheng
2021-11-10 16:49         ` Matthew Wilcox
2021-11-10 16:53           ` David Hildenbrand
2021-11-10 16:56             ` Jason Gunthorpe
2021-11-10 13:54   ` Qi Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=214410b7-01b6-9d01-b731-f8f1a3495d6b@bytedance.com \
    --to=zhengqi.arch@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mika.penttila@nextfour.com \
    --cc=songmuchun@bytedance.com \
    --cc=tglx@linutronix.de \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.