From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.3 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E3939C4727E for ; Mon, 28 Sep 2020 17:55:25 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2F1AD208D5 for ; Mon, 28 Sep 2020 17:55:25 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=sent.com header.i=@sent.com header.b="JePfCtZZ"; dkim=temperror (0-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b="RowmZrDE" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2F1AD208D5 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=sent.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 08046900002; Mon, 28 Sep 2020 13:55:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E5D976B0089; Mon, 28 Sep 2020 13:55:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C3159900002; Mon, 28 Sep 2020 13:55:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0056.hostedemail.com [216.40.44.56]) by kanga.kvack.org (Postfix) with ESMTP id 9F9606B005D for ; Mon, 28 Sep 2020 13:55:23 -0400 (EDT) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 63A57180AD801 for ; Mon, 28 Sep 2020 17:55:23 +0000 (UTC) X-FDA: 77313222126.10.rule28_550ce5227183 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin10.hostedemail.com (Postfix) with ESMTP id 33EB516A0C3 for ; Mon, 28 Sep 2020 17:55:23 +0000 (UTC) X-HE-Tag: rule28_550ce5227183 X-Filterd-Recvd-Size: 15185 Received: from wnew3-smtp.messagingengine.com (wnew3-smtp.messagingengine.com [64.147.123.17]) by imf27.hostedemail.com (Postfix) with ESMTP for ; Mon, 28 Sep 2020 17:55:22 +0000 (UTC) Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailnew.west.internal (Postfix) with ESMTP id 5F59EC18; Mon, 28 Sep 2020 13:55:20 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute4.internal (MEProxy); Mon, 28 Sep 2020 13:55:21 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sent.com; h=from :to:cc:subject:date:message-id:reply-to:mime-version :content-type:content-transfer-encoding; s=fm1; bh=fmncist0AzJYx p6GLtq+mhjYGrSXjZlMGdyR6GzoHFE=; b=JePfCtZZ3VjR0kUkXWq2LPpwtPIYW lNEE93ErV2Ufd1HrdZAgMOGtN6KDRmwBKj9kSHkPFambHhGYjEwHjNAI4RAdPJLM tTz2xmZ6LbJ1GULmgkF04E28WDiLsSP0lYJW2qgzy2gM/6/ZGA9UgWdnlei+2IVJ ihJ8nLtcFcvW/WdEDnwBZ2EF3SH3vRKYKX09r/fX5T6xh8gwIy+aRqkneaTPyvED Kr3jJ659x4KM8+QoIFvZF8RC7HBb5A+yY73OZd+uJkANQHiRyKuHoPry9uqJdY6N ixktF7cj+NrD2WO6yq2v450r9ezE5SRZjRzjwfdrAI/1BRc8O1EHrTviw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:message-id:mime-version:reply-to:subject:to :x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm3; bh=fmncist0AzJYxp6GLtq+mhjYGrSXjZlMGdyR6GzoHFE=; b=RowmZrDE Jx8V8BtJ7p+fDPPRmcIVRA78cLpzbqcNHERRl3ZvWa/A8CSXiOmz/l5MTfciQEey v7Hekt8Exw0J0oQlwsT4rlyzHB+Nc3Xv31sfJhaXrAAbLUCPfDUB9nrjiSuSp63c GhV9DsApldHK8o4TNch8Q/4f5SxUAsucdHAMq05cBthI099jAvUojqzGY53isDFh XG78QVeWjrQfdb3KxLQxBrhKhp1/2MwK9BxwfyD9W/t77UsRuc9EhBsR9/CXgUlH it7SKFHQLpDuIVnQ3uxs2uVsuuJPSjSpu4+ZwXw8bYIdWEfF8Dm4y7DSrAt7rQ9l z6z7m1fRenUcpg== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedujedrvdeigdeliecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpefhvffufffkofhrgggtgfesthekredtredtjeenucfhrhhomhepkghiucgjrghn uceoiihirdihrghnsehsvghnthdrtghomheqnecuggftrfgrthhtvghrnhepkeehvdevje elvdffueevtddviedvgeevveelhfffjeeitedtkeeikeffveffgeetnecuffhomhgrihhn peelqdhrtgehqdhmmhhothhsqddvtddvtddqtdelqddukedqvdduqddvfedrihhtpdhgih hthhhusgdrtghomhdpkhgvrhhnvghlrdhorhhgnecukfhppeduvddrgeeirddutdeirddu ieegnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepii hirdihrghnsehsvghnthdrtghomh X-ME-Proxy: Received: from nvrsysarch6.NVidia.COM (unknown [12.46.106.164]) by mail.messagingengine.com (Postfix) with ESMTPA id DBA873064610; Mon, 28 Sep 2020 13:55:17 -0400 (EDT) From: Zi Yan To: linux-mm@kvack.org Cc: "Kirill A . Shutemov" , Roman Gushchin , Rik van Riel , Matthew Wilcox , Shakeel Butt , Yang Shi , Jason Gunthorpe , Mike Kravetz , Michal Hocko , David Hildenbrand , William Kucharski , Andrea Arcangeli , John Hubbard , David Nellans , linux-kernel@vger.kernel.org, Zi Yan Subject: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Date: Mon, 28 Sep 2020 13:53:58 -0400 Message-Id: <20200928175428.4110504-1-zi.yan@sent.com> X-Mailer: git-send-email 2.28.0 Reply-To: Zi Yan MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Zi Yan Hi all, This patchset adds support for 1GB PUD THP on x86_64. It is on top of v5.9-rc5-mmots-2020-09-18-21-23. It is also available at: https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.9-rc5-mmots-2020-0= 9-18-21-23 Other than PUD THP, we had some discussion on generating THPs and contigu= ous physical memory via a synchronous system call [0]. I am planning to send = out a separate patchset on it later, since I feel that it can be done independe= ntly of PUD THP support. Any comment or suggestion is welcome. Thanks. Motiation =3D=3D=3D=3D The patchset is trying to provide a more transparent way of boosting virt= ual memory performance by leveraging gigantic TLB entries compared to hugetlb= fs pages [1,2]. Roman also said he would provide performance numbers of usin= g 1GB PUD THP once the patchset is a relatively good shape [1]. Patchset organization: =3D=3D=3D=3D 1. Patch 1 and 2: Jason's PUD entry READ_ONCE patch to walk_page_range to= give a consistent read of PUD entries during lockless page table walks. I also add PMD entry READ_ONCE patch, since PMD level walk_page_range = has the same lockless behavior as PUD level. 2. Patch 3: THP page table deposit now use single linked list to enable hierarchical page table deposit, i.e., deposit a PMD page where 512 PT= E pages are deposited to. Every page table page has a deposit_head and a depo= sit_node. For example, when storing 512 PTE pages to a PMD page, PMD page's depo= sit_head links to a PTE page's deposit_node, which links to another PTE page's deposit_node. 3. Patch 4,5,6: helper functions for allocating page table pages for PUD = THPs and change thp_order and thp_nr. 4. Patch 7 to 23: PUD THP implementation. It is broken into small patches= for easy review. 5. Patch 24, 25: new page size encoding for MADV_HUGEPAGE and MADV_NOHUGE= PAGE in madvise. User can specify THP size. Only MADV_HUGEPAGE_1GB is used acc= epted. VM_HUGEPAGE_PUD is added to vm_flags to store the information at big 3= 7. You are welcome to suggest any other approach. 6. Patch 26, 27: enable_pud_thp and hpage_pud_size are added to /sys/kernel/mm/transparent_hugepage/. enable_pud_thp is set to never b= y default. 7. Patch 28, 29: PUD THPs are allocated only from boot-time reserved CMA = regions. The CMA regions can be used for other moveable page allocations. Design for PUD-, PMD-, and PTE-mapped PUD THP =3D=3D=3D=3D One additional design compared to PMD THP is the support for PMD-mapped P= UD THP, since original THP design supports PUD-mapped and PTE-mapped PUD THP automatically. PMD mapcounts are stored at (512*N + 3) subpages (N =3D 0 to 511) and 512= *N subpages are called PMDPageInPUD. A PUDDoubleMap bit is stored at third subpage of a PUD THP, using the same page flag position as DoubleMap (sto= red at second subpage of a PMD THP), to indicate a PUD THP with both PUD and PMD mappings. A PUD THP looks like: =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94=80=E2= =94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94= =AC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80= =E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2= =94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=90 =E2=94=82 H =E2=94=82 T =E2=94=82 T =E2=94=82 T =E2=94=82 ... =E2=94=82 T= =E2=94=82 T =E2=94=82 T =E2=94=82 T =E2=94=82 ... =E2=94=82 T =E2=94= =82 =E2=94=82 0 =E2=94=82 1 =E2=94=82 2 =E2=94=82 3 =E2=94=82 =E2=94=8251= 2=E2=94=82513=E2=94=82514=E2=94=82515=E2=94=82 =E2=94=82262143=E2=94= =82 =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=B4=E2=94=80=E2=94=80=E2=94=80=E2= =94=B4=E2=94=80=E2=94=80=E2=94=80=E2=94=B4=E2=94=80=E2=94=80=E2=94=80=E2=94= =B4=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=B4=E2=94=80=E2=94=80= =E2=94=80=E2=94=B4=E2=94=80=E2=94=80=E2=94=80=E2=94=B4=E2=94=80=E2=94=80=E2= =94=80=E2=94=B4=E2=94=80=E2=94=80=E2=94=80=E2=94=B4=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=B4=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=98 PMDPageInPUD pages in a PUD THP (only show first two PMDPageInPUD pages b= elow). Note that PMDPageInPUD pages are identified by their relative position to= the head page of the PUD THP and are still tail pages except the first one, so H_0, T_512, T_1024, ... T_512x511 are all PMDPageInPUD pages: =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=90 =E2=94=82PMDPageInPUD=E2=94=82 ... =E2=94=82PMDPageInPUD=E2=94=82= ... =E2=94=82 the remaining =E2=94=82 =E2=94=82 page =E2=94=82 511 subpages =E2=94=82 page =E2=94=82= 511 subpages =E2=94=82 510x512 subpages =E2=94=82 =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=B4=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=B4=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=B4=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=B4=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=98 Mapcount positions: * For each subpage, its PTE mapcount is _mapcount, the same as PMD THP. * For PUD THP, its PUD-mapping uses compound_mapcount at T_1 the same as = PMD THP. * For PMD-mapped PUD THP, its PMD-mapping uses compound_mapcount at T_3, = T_515, ..., T_512x511+3. It is called sub_compound_mapcount. PUDDoubleMap and DoubleMap in PUD THP: * PUDDoubleMap is stored at the page flag of T_2 (third subpage), reusing= the DoubleMap's position. * DoubleMap is stored at the page flags of T_1 (second subpage), T_513, .= .., T_512x511+1. [0] https://lore.kernel.org/linux-mm/20200907072014.GD30144@dhcp22.suse.c= z/ [1] https://lore.kernel.org/linux-mm/20200903162527.GF60440@carbon.dhcp.t= hefacebook.com/ [2] https://lore.kernel.org/linux-mm/20200903165051.GN24045@ziepe.ca/ Changelog from RFC v1 =3D=3D=3D=3D 1. Add Jason's PUD entry READ_ONCE patch and my PMD entry READ_ONCE patch= to get consistent page table entry reading in lockless page table walks. 2. Use single linked list for page table page deposit instead of pagechai= n data structure from RFC v1. 3. Address Kirill's comments. 4. Remove PUD page allocation via alloc_contig_pages(), using cma_alloc o= nly. 5. Add madvise flag MADV_HUGEPAGE_1GB to explicitly enable PUD THP on spe= cific VMAs instead of reusing MADV_HUGEPAGE. A new vm_flags VM_HUGEPAGE_PUD = is added to achieve this. 6. Break large patches in v1 into small ones for easy review. Jason Gunthorpe (1): mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked Zi Yan (29): mm: pagewalk: use READ_ONCE when reading the PMD entry unlocked mm: thp: use single linked list for THP page table page deposit. mm: add new helper functions to allocate one PMD page with 512 PTE pages. mm: thp: add page table deposit/withdraw functions for PUD THP. mm: change thp_order and thp_nr as we will have not just PMD THPs. mm: thp: add anonymous PUD THP page fault support without enabling it. mm: thp: add PUD THP support for copy_huge_pud. mm: thp: add PUD THP support to zap_huge_pud. fs: proc: add PUD THP kpageflag. mm: thp: handling PUD THP reference bit. mm: rmap: add mappped/unmapped page order to anonymous page rmap functions. mm: rmap: add map_order to page_remove_anon_compound_rmap. mm: thp: add PUD THP split_huge_pud_page() function. mm: thp: add PUD THP to deferred split list when PUD mapping is gone. mm: debug: adapt dump_page to PUD THP. mm: thp: PUD THP COW splits PUD page and falls back to PMD page. mm: thp: PUD THP follow_p*d_page() support. mm: stats: make smap stats understand PUD THPs. mm: page_vma_walk: teach it about PMD-mapped PUD THP. mm: thp: PUD THP support in try_to_unmap(). mm: thp: split PUD THPs at page reclaim. mm: support PUD THP pagemap support. mm: madvise: add page size options to MADV_HUGEPAGE and MADV_NOHUGEPAGE. mm: vma: add VM_HUGEPAGE_PUD to vm_flags at bit 37. mm: thp: add a global knob to enable/disable PUD THPs. mm: thp: make PUD THP size public. hugetlb: cma: move cma reserve function to cma.c. mm: thp: use cma reservation for pud thp allocation. mm: thp: enable anonymous PUD THP at page fault path. .../admin-guide/kernel-parameters.txt | 2 +- Documentation/admin-guide/mm/transhuge.rst | 1 + arch/arm64/mm/hugetlbpage.c | 2 +- arch/powerpc/mm/hugetlbpage.c | 2 +- arch/x86/include/asm/pgalloc.h | 69 ++ arch/x86/include/asm/pgtable.h | 26 + arch/x86/kernel/setup.c | 8 +- arch/x86/mm/pgtable.c | 38 + drivers/base/node.c | 3 + fs/proc/meminfo.c | 2 + fs/proc/page.c | 2 + fs/proc/task_mmu.c | 200 +++- include/linux/cma.h | 18 + include/linux/huge_mm.h | 84 +- include/linux/hugetlb.h | 12 - include/linux/memcontrol.h | 5 + include/linux/mm.h | 42 +- include/linux/mm_types.h | 11 +- include/linux/mmu_notifier.h | 13 + include/linux/mmzone.h | 1 + include/linux/page-flags.h | 48 + include/linux/pagewalk.h | 4 +- include/linux/pgtable.h | 34 + include/linux/rmap.h | 10 +- include/linux/swap.h | 2 + include/linux/vm_event_item.h | 7 + include/uapi/asm-generic/mman-common.h | 23 + include/uapi/linux/kernel-page-flags.h | 1 + kernel/events/uprobes.c | 4 +- kernel/fork.c | 10 +- mm/cma.c | 119 +++ mm/debug.c | 6 +- mm/gup.c | 60 +- mm/hmm.c | 16 +- mm/huge_memory.c | 899 +++++++++++++++++- mm/hugetlb.c | 117 +-- mm/khugepaged.c | 16 +- mm/ksm.c | 4 +- mm/madvise.c | 76 +- mm/mapping_dirty_helpers.c | 6 +- mm/memcontrol.c | 43 +- mm/memory.c | 28 +- mm/mempolicy.c | 29 +- mm/migrate.c | 12 +- mm/mincore.c | 10 +- mm/page_alloc.c | 53 +- mm/page_vma_mapped.c | 171 +++- mm/pagewalk.c | 47 +- mm/pgtable-generic.c | 49 +- mm/ptdump.c | 3 +- mm/rmap.c | 300 ++++-- mm/swap.c | 30 + mm/swap_slots.c | 2 + mm/swapfile.c | 11 +- mm/userfaultfd.c | 2 +- mm/util.c | 22 +- mm/vmscan.c | 33 +- mm/vmstat.c | 8 + 58 files changed, 2396 insertions(+), 460 deletions(-) -- 2.28.0