From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 499D4C636C8 for ; Sun, 18 Jul 2021 04:30:58 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id AC6FD61179 for ; Sun, 18 Jul 2021 04:30:57 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AC6FD61179 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A7FEE8D00F4; Sun, 18 Jul 2021 00:30:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A2F0E8D00EC; Sun, 18 Jul 2021 00:30:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8A9338D00F4; Sun, 18 Jul 2021 00:30:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0160.hostedemail.com [216.40.44.160]) by kanga.kvack.org (Postfix) with ESMTP id 628148D00EC for ; Sun, 18 Jul 2021 00:30:57 -0400 (EDT) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 6584619243 for ; Sun, 18 Jul 2021 04:30:54 +0000 (UTC) X-FDA: 78374433228.10.44E7E34 Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) by imf18.hostedemail.com (Postfix) with ESMTP id 8D7C94002087 for ; Sun, 18 Jul 2021 04:30:52 +0000 (UTC) Received: by mail-pl1-f180.google.com with SMTP id h1so7729713plf.6 for ; Sat, 17 Jul 2021 21:30:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=vYfpy1YnQDgIKqUyoAlZjkkXkJfZl99UdKpX40ZPhZE=; b=sLFKx5LtsrRys7pm2ElDRTbDpCdEM0vmi4Anhs6u6WTm+XbJDGSWE5YcIP1uy93k7b 66I/MXSiii109dMlSJEfl1lxd94owKDCsaGO2hfukRkKg/kccCsL3Yj1Ivm7Ql9/MYL8 X0GUCNlgWTadG7BAeSyUXW+IsW4BFV9vpCSOmz7j1/Gx1kZlHs08/YbXhox7XnFN8Msf lddOQ5tzoC512hMHrKR4MWVRlwDtGLygFrjE43PRQjVG0i1NoFz7besUtEVTZgj4IfJJ 9ClUoM7SRcMM4/evBYNVwbQ6rcACj9ZXwEWFsM3kdX5/XlNLZG0POVQX1xelYQvD+wM1 XLHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=vYfpy1YnQDgIKqUyoAlZjkkXkJfZl99UdKpX40ZPhZE=; b=Kta/PQG2H7Lz4GwD//J3eshRb+6mQJR6Jl+iqh+FFAYW8YbME61IBKsOxiQ3vTzSq4 Ugon+xqe28ukr9c8FxeLy88e90IudU5dqrWoQxGyY5lwlzT3vZbfSj6jiiRz4CNNDdl6 pdSG6raChpScQR8BHDJv/OrQQKlQ3ECAXhglgkdjCYCnh+UH1Qe+YHu8rN6XiXQ4PvCD hi6TUysZX2Mic3kpHphB+70LsF4pLp28/02ET6JLmkBBtvsqGs/Jgau215RGXkPnkiWC CqugTXVuXlJLcSzvl/lRFgDTDwDL/8eevdVaZ9OZTtzqUPryaaRbNtzPPNr1u/2RC5R1 aBmQ== X-Gm-Message-State: AOAM5317IQA0ujjyQj6oC/rHgYUxpLRl9gy0RmjoLLPUABPyOlOgdGpB b15IUTi9ydUISAzzEchEH+c6iw== X-Google-Smtp-Source: ABdhPJxe1Eu8phqsjGtA5z+wPbjpY7A6/eJCvQlAlDHhZIEkQI3OvRldFW8ebhvYvn/b+sLJYFZ2/w== X-Received: by 2002:a17:902:968a:b029:11d:6448:1352 with SMTP id n10-20020a170902968ab029011d64481352mr13937205plp.59.1626582651254; Sat, 17 Jul 2021 21:30:51 -0700 (PDT) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.253]) by smtp.gmail.com with ESMTPSA id a22sm16263217pgv.84.2021.07.17.21.30.47 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sat, 17 Jul 2021 21:30:50 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, Qi Zheng Subject: [PATCH 0/7] Free user PTE page table pages Date: Sun, 18 Jul 2021 12:30:26 +0800 Message-Id: <20210718043034.76431-1-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) MIME-Version: 1.0 Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=sLFKx5Lt; spf=pass (imf18.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspamd-Server: rspam02 X-Stat-Signature: beofs6zn44d5t8nn1jbqsmkimfegpg74 X-Rspamd-Queue-Id: 8D7C94002087 X-HE-Tag: 1626582652-767639 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, This patch series aims to free user PTE page table pages when all PTE ent= ries are empty. The beginning of this story is that some malloc libraries(e.g. jemalloc o= r tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap t= hose VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want= . But the page tables do not be freed by madvise(), so it can produce many page tables when the process touches an enormous virtual address space. The following figures are a memory usage snapshot of one process which ac= tually happened on our server: VIRT: 55t RES: 590g VmPTE: 110g As we can see, the PTE page tables size is 110g, while the RES is 590g. I= n theory, the process only need 1.2g PTE page tables to map those physical memory. The reason why PTE page tables occupy a lot of memory is that madvise(MADV_DONTNEED) only empty the PTE and free physical memory but doesn't free the PTE page table pages. So we can free those empty PTE pag= e tables to save memory. In the above cases, we can save memory about 108g(= best case). And the larger the difference between the size of VIRT and RES, th= e more memory we save. In this patch series, we add a pte_refcount field to the struct page of p= age table to track how many users of PTE page table. Similar to the mechanism= of page refcount, the user of PTE page table should hold a refcount to it be= fore accessing. The PTE page table page will be freed when the last refcount i= s dropped. Testing: The following code snippet can show the effect of optimization: mmap 50G while (1) { for (; i < 1024 * 25; i++) { touch 2M memory madvise MADV_DONTNEED 2M } } As we can see, the memory usage of VmPTE is reduced: before after VIRT 50.0 GB 50.0 GB RES 3.1 MB 3.6 MB VmPTE 102640 kB 248 kB I also have tested the stability by LTP[1] for several weeks. I have not = seen any crash so far. The performance of page fault can be affected because of the allocation/f= reeing of PTE page table pages. The following is the test result by using a micr= o benchmark[2]: root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads: threads before (pf/min) after (pf/min) 1 32,085,255 31,880,833 (-0.64= %) 8 101,674,967 100,588,311 (-1.17= %) 16 113,207,000 112,801,832 (-0.36= %) (The "pfn/min" means how many page faults in one minute.) The performance of page fault is ~1% slower than before. This series is based on next-20210708. Patch 1 is a bug fix. Patch 2-4 are code simplification. Patch 5 free user PTE page tables dynamically. Patch 6 defer freeing PTE page tables for a grace period. Patch 7 uses mmu_gather to free PTE page tables. Comments and suggestions are welcome. Thanks, Qi. [1] https://github.com/linux-test-project/ltp [2] https://lore.kernel.org/patchwork/comment/296794/ Qi Zheng (7): mm: fix the deadlock in finish_fault() mm: introduce pte_install() helper mm: remove redundant smp_wmb() mm: rework the parameter of lock_page_or_retry() mm: free user PTE page table pages mm: defer freeing PTE page table for a grace period mm: use mmu_gather to free PTE page table Documentation/vm/split_page_table_lock.rst | 2 +- arch/arm/mm/pgd.c | 2 +- arch/arm64/mm/hugetlbpage.c | 4 +- arch/ia64/mm/hugetlbpage.c | 2 +- arch/parisc/mm/hugetlbpage.c | 2 +- arch/powerpc/mm/hugetlbpage.c | 2 +- arch/s390/mm/gmap.c | 8 +- arch/s390/mm/pgtable.c | 6 +- arch/sh/mm/hugetlbpage.c | 2 +- arch/sparc/mm/hugetlbpage.c | 2 +- arch/x86/Kconfig | 2 +- arch/x86/kernel/tboot.c | 2 +- fs/proc/task_mmu.c | 23 ++- fs/userfaultfd.c | 2 + include/linux/mm.h | 12 +- include/linux/mm_types.h | 8 +- include/linux/pagemap.h | 8 +- include/linux/pgtable.h | 3 +- include/linux/pte_ref.h | 241 +++++++++++++++++++++++= ++ include/linux/rmap.h | 3 + kernel/events/uprobes.c | 3 + mm/Kconfig | 4 + mm/Makefile | 3 +- mm/debug_vm_pgtable.c | 3 +- mm/filemap.c | 56 +++--- mm/gup.c | 10 +- mm/hmm.c | 4 + mm/internal.h | 2 + mm/khugepaged.c | 10 ++ mm/ksm.c | 4 + mm/madvise.c | 20 ++- mm/memcontrol.c | 11 +- mm/memory.c | 279 +++++++++++++++++++----= ------ mm/mempolicy.c | 5 +- mm/migrate.c | 21 ++- mm/mincore.c | 6 +- mm/mlock.c | 1 + mm/mmu_gather.c | 40 ++--- mm/mprotect.c | 10 +- mm/mremap.c | 12 +- mm/page_vma_mapped.c | 4 + mm/pagewalk.c | 19 +- mm/pgtable-generic.c | 2 + mm/pte_ref.c | 146 +++++++++++++++ mm/rmap.c | 13 +- mm/sparse-vmemmap.c | 2 +- mm/swapfile.c | 6 +- mm/userfaultfd.c | 15 +- 48 files changed, 825 insertions(+), 222 deletions(-) create mode 100644 include/linux/pte_ref.h create mode 100644 mm/pte_ref.c --=20 2.11.0