From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A7BD7C433DB for ; Thu, 25 Feb 2021 22:13:50 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 237F764E6B for ; Thu, 25 Feb 2021 22:13:50 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 237F764E6B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=nvidia.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9D9246B0005; Thu, 25 Feb 2021 17:13:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 989086B0006; Thu, 25 Feb 2021 17:13:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 850D36B006C; Thu, 25 Feb 2021 17:13:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0125.hostedemail.com [216.40.44.125]) by kanga.kvack.org (Postfix) with ESMTP id 670FC6B0005 for ; Thu, 25 Feb 2021 17:13:49 -0500 (EST) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 250AE180ACF6C for ; Thu, 25 Feb 2021 22:13:49 +0000 (UTC) X-FDA: 77858193378.24.D694067 Received: from hqnvemgate24.nvidia.com (hqnvemgate24.nvidia.com [216.228.121.143]) by imf01.hostedemail.com (Postfix) with ESMTP id 89559200038E for ; Thu, 25 Feb 2021 22:13:47 +0000 (UTC) Received: from hqmail.nvidia.com (Not Verified[216.228.121.13]) by hqnvemgate24.nvidia.com (using TLS: TLSv1.2, AES256-SHA) id ; Thu, 25 Feb 2021 14:13:46 -0800 Received: from [10.2.63.97] (172.20.145.6) by HQMAIL107.nvidia.com (172.20.187.13) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Thu, 25 Feb 2021 22:13:40 +0000 From: Zi Yan To: David Hildenbrand CC: , Matthew Wilcox , "Kirill A . Shutemov" , Roman Gushchin , Andrew Morton , Yang Shi , Michal Hocko , John Hubbard , "Ralph Campbell" , David Nellans , "Jason Gunthorpe" , David Rientjes , "Vlastimil Babka" , Mike Kravetz , Song Liu Subject: Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64 Date: Thu, 25 Feb 2021 17:13:38 -0500 X-Mailer: MailMate (1.14r5757) Message-ID: <67B2C538-45DB-4678-A64D-295A9703EDE1@nvidia.com> In-Reply-To: References: <20210224223536.803765-1-zi.yan@sent.com> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=_MailMate_82C2A834-15C8-4A25-BB70-A03ED9074269_="; micalg=pgp-sha512; protocol="application/pgp-signature" X-Originating-IP: [172.20.145.6] X-ClientProxiedBy: HQMAIL107.nvidia.com (172.20.187.13) To HQMAIL107.nvidia.com (172.20.187.13) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1614291226; bh=qup7LX3IK/d7+RaBSGy6ko1eiixWL0vP3zvcQVJwBx0=; h=From:To:CC:Subject:Date:X-Mailer:Message-ID:In-Reply-To: References:MIME-Version:Content-Type:X-Originating-IP: X-ClientProxiedBy; b=LHw5L7YCWLjyC+U9UY5ahdOdlt5VRVnRB8SUElxYLbgrwYxtiy7C8NrrHPpVXse87 Wkr5KWZE8TAaw3k3kQ2wOJWHfVqSvxYd6IkrLiPhL1GpYPVLRIZFLcRtbW4Z+QfqaG aQZ86+wL0+0Nt24i2O1Y3fDqFhTiTsN/Vq9SL2vgcJRBnUOFxRmzUREihE/NDw/cY5 jBVEFAOd+zveksGmF4cuNGZ2ZvgJoOLIKx1HT06XWDnidkt8I3dXehwGK4JhXpie7G kW8D4WMenXomV3QwNjjM9AuiCEJ8YIw3jPftrBHZAS3ECZFdxM1GbMlv9JgdU8ERLI PfvymTbMaDwfw== X-Stat-Signature: ngxtb7s88fgof1jixczx1bi5rzc8anfa X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 89559200038E Received-SPF: none (nvidia.com>: No applicable sender policy available) receiver=imf01; identity=mailfrom; envelope-from=""; helo=hqnvemgate24.nvidia.com; client-ip=216.228.121.143 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1614291227-152902 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --=_MailMate_82C2A834-15C8-4A25-BB70-A03ED9074269_= Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On 25 Feb 2021, at 6:02, David Hildenbrand wrote: > On 24.02.21 23:35, Zi Yan wrote: >> From: Zi Yan >> >> Hi all, >> >> I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-1= 8-18-29 >> and the code is available at >> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.11-mmotm-2021-0= 2-18-18-29 >> if you want to give it a try. The actual 49 patches are not sent out w= ith this >> cover letter. :) >> >> Instead of asking for code review, I would like to discuss on the conc= erns I got >> from previous RFCs. I think there are two major ones: >> >> 1. 1GB page allocation. Current implementation allocates 1GB pages fro= m CMA >> regions that are reserved at boot time like hugetlbfs. The concern= s on >> using CMA is that an educated guess is needed to avoid depleting k= ernel >> memory in case CMA regions are set too large. Recently David Rient= jes >> proposes to use process_madvise() for hugepage collapse, which is = an >> alternative [1] but might not work for 1GB pages, since there is n= o way of > > I see two core ideas of THP: > > 1) Transparent to the user: you get speedup without really caring *exce= pt* having to enable/disable the optimization sometimes manually (i.e., M= ADV_HUGEPAGE) - because in corner cases (e.g., userfaultfd), it's not co= mpletely transparent and might have performance impacts. mprotect(), mmap= (MAP_FIXED), mremap() work as expected. > > 2) Transparent to other subsystems of the kernel: the page size of the = mapping is in base pages - we can split anytime on demand in case we cann= ot handle THP. In addition, no special requirements: no CMA, no movabilit= y restrictions, no swappability restrictions, ... most stuff works transp= arently by splitting. > > Your current approach messes with 2). Your proposal here messes with 1)= =2E > > Any kind of explicit placement by the user can silently get reverted an= y time. So process_madvise() would really only be useful in cases where a= temporary split might get reverted later on by the os automatically - li= ke we have for 2MB THP right now. > > So process_madvise() is less likely to help if the system won't try col= lapsing automatically (more below). >> _allocating_ a 1GB page to which collapse pages. I proposed a simi= lar >> approach at LSF/MM 2019, generating physically contiguous memory a= fter pages >> are allocated [2], which is usable for 1GB THPs. This approach doe= s in-place >> huge page promotion thus does not require page allocation. > > I like the idea of forming a 1GB THP at a location where already consec= utive pages allow for it. It can be applied generically - and both 1) and= 2) keep working as expected. Anytime there was a split, we can retry for= ming a THP later. > > However, I don't follow how this is actually really feasible in big sca= le. You could only ever collapse into a 1GB THP if you happen to have 1GB= consecutive 2MB THP / 4k already. Sounds to me like this happens when th= e stars align. Both the process_madvise() approach and my proposal require page migratio= n to bring back THPs, since like you said having consecutive pages ready = is extremely rare. IIUC, the process_madvise() approach reuses khugepaged= code to collapse huge pages, namely first allocating a 2MB THP, then copying data over, finally free o= ld base pages. My proposal would migrate pages within a virtual address range (>1GB and 1GB-aligned) to get all physical pages = contiguous, then promote the resulting 1GB consecutive pages to 1GB THP. No new page allocation is needed. Both approaches would need user-space invocation, assuming either the app= lication itself wants to get THPs for a specific region or a user-space d= aemon would do this for a group of application, instead of waiting for kh= ugepaged to slowly (4096 pages every 10s) scan and do huge page collapse.= User will pay the cost of getting THP. This also means THPs are not comp= letely transparent to user, but I think it should be fine when users expl= icitly invoke these two methods to get THPs for better performance. The difference of my proposal is that it does not need a 1GB THP allocati= on, so there is no special requirements like using CMA or increasing MAX_ORDER in buddy allocator to allow 1GB page allocation. = It makes creating THPs with orders > MAX_ORDER possible without other intrusive changes. =E2=80=94 Best Regards, Yan Zi --=_MailMate_82C2A834-15C8-4A25-BB70-A03ED9074269_= Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQJDBAEBCgAtFiEEh7yFAW3gwjwQ4C9anbJR82th+ooFAmA4IRIPHHppeUBudmlk aWEuY29tAAoJEJ2yUfNrYfqKufMQAItWxP2Pul/d1+kSDjnUwTWw/rBXUnq/794O Xo2FyFVy1arUfkyQ/uqdgPCu1+Bh1BUGminpTm0b5rL+ctHXCTrBaMrXROwkCTE1 aQxeaIrnKJlyzvNoYyRBZAJPigSFeXeF5alyIE0KFWZ20cjT1Uw4nXUOGyohTH5H XAzEBRWxoU2a+x8tCFQSTox9vEJVZPvqst8pS5IdbgGQWUctrfcC+7lRRWdo9qHr Z1NeAJcg3x5el5n11jQeG7xE/MA+m5cjbo+YlXa/Owj19+AOaYNy+WUVzz8t/RVh N0gidHN5jhuCBnXdZSEL7goyz2YDzDSxxpemcfo5ZlXfo/SWkEOWvudTfB90vEh6 gLHUgi/IoZT4J3vXW/dPUeh6EqW/NB0CDKTDW2xf29T0BIjh19KB34iwJ4Bry/Ah mqxIkS96+BreWvAj6U7FUOGJcFnOSaA1t4rQimps5xTtq3hwkrALYQ36W5/+Ac23 VREmzTGGPfvFG3e2pS1U5VIxSvU+FBzfn/Q60I0TQi6n8yXCCLPe9FdeReO7sUlW TkzjgeawGmx1OA2ork3ZpEikjp/f61DT0KCbY24MPG0JSJCjFluSWgKani3IRBPe aBJBPy+4JKpF1ozOcodxemCUIo1x9YNXnF90dWcq/GQi8Ge65kjHPzyVL440FENN +u94yCvP =1z+U -----END PGP SIGNATURE----- --=_MailMate_82C2A834-15C8-4A25-BB70-A03ED9074269_=--