From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BCFDAC2D0E4 for ; Tue, 17 Nov 2020 16:29:51 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E85E22465E for ; Tue, 17 Nov 2020 16:29:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=bytedance-com.20150623.gappssmtp.com header.i=@bytedance-com.20150623.gappssmtp.com header.b="oF6O0GIt" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E85E22465E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 494376B0071; Tue, 17 Nov 2020 11:29:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 46B0C6B0072; Tue, 17 Nov 2020 11:29:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 37F336B0073; Tue, 17 Nov 2020 11:29:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0070.hostedemail.com [216.40.44.70]) by kanga.kvack.org (Postfix) with ESMTP id 0D9246B0071 for ; Tue, 17 Nov 2020 11:29:50 -0500 (EST) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id A70B8180AD815 for ; Tue, 17 Nov 2020 16:29:49 +0000 (UTC) X-FDA: 77494446498.07.use35_590db4727333 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin07.hostedemail.com (Postfix) with ESMTP id 8AEF41803F9A0 for ; Tue, 17 Nov 2020 16:29:49 +0000 (UTC) X-HE-Tag: use35_590db4727333 X-Filterd-Recvd-Size: 13047 Received: from mail-pj1-f65.google.com (mail-pj1-f65.google.com [209.85.216.65]) by imf49.hostedemail.com (Postfix) with ESMTP for ; Tue, 17 Nov 2020 16:29:48 +0000 (UTC) Received: by mail-pj1-f65.google.com with SMTP id t12so773396pjq.5 for ; Tue, 17 Nov 2020 08:29:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=A3QTWUzTvUT9gxMoshGLtqPLDCXqi6pfAh27cLWuKoY=; b=oF6O0GItwqng2CH4SIbYH2aWyQ0yc6lC98D8jhxQh/Z/K6wGgSrCp+VelucrtJYIN+ yfrPeJi0WnWoX5NjeJg7eGFqFQH7j+0A/kup3R4Gpm/Na2hqHBDTwGnqDR8yqmKMDFxB jWQsbr+qnzpVfnxrIHMriUpUcKtH1Ei4VI4pdgcMN8m7OKjz1fLIvEICdudJz+MTX8Hb pur0CMetqz1ZfpjWZf0ixQoiXRCmCJ970vKWlb7PggfovGKJFPkHgRalGrdCYI7ATCYB EHeAV57hqAfhxZP2iYYglDOS2Q99Zw+xdno1A/n+4pWJdBqnLMHxUGUnCQa1wcQUc5t9 OL3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=A3QTWUzTvUT9gxMoshGLtqPLDCXqi6pfAh27cLWuKoY=; b=call9mMQgiPjyxWQMNaoLPBwJlelqCAyJFgy65YdeOurCrYkjWiD9ftAdjYjf1sMzt eFLiWrBFuRoVk9jVQ+Gok+7E45GmFLg0GE+EgQrKT+KqTA1gbE7VnFxQDzMmHgW5FaYN B9WFUVCjeGGnAB4gOzizLw4rg5Z7HuJkZYMgZTnAbMPaLpxwdMcfu7mFH0n48KnyYIGd /7/t0z8IGUSRlOb9k1zaHKCCZzE0CL79+rKpQ0tPZLUG3tOaCcFQMZpSf2D6JQRzU/Ji bqCs9dEPHVT8GCh3RsVp2BMO2G3advKa7Dlihsct3SY9vdzNL2BDT/Zn83K369mg8gMv oKag== X-Gm-Message-State: AOAM530UlU6tkzMXGgLt9i4We8F2FMRMN3JVCliEF7ToyR4dd1OdkWDO ZkY4bqnSyYRsVyzj3AXEo2/ZB3cjvXz+DQ4ZusIFVw== X-Google-Smtp-Source: ABdhPJyKdXILrrCSaxnUg7OSQj7ryLePl/Z4V9HWSFy1zzwaaUm05L7FiTmkFJslcslXKPtsd9XP4QPPRhSICbNSWHs= X-Received: by 2002:a17:902:c14b:b029:d6:ab18:108d with SMTP id 11-20020a170902c14bb02900d6ab18108dmr314927plj.20.1605630587256; Tue, 17 Nov 2020 08:29:47 -0800 (PST) MIME-Version: 1.0 References: <20201113105952.11638-1-songmuchun@bytedance.com> <349168819c1249d4bceea26597760b0a@hisilicon.com> <714ae7d701d446259ab269f14a030fe9@hisilicon.com> In-Reply-To: <714ae7d701d446259ab269f14a030fe9@hisilicon.com> From: Muchun Song Date: Wed, 18 Nov 2020 00:29:07 +0800 Message-ID: Subject: Re: [External] RE: [PATCH v4 00/21] Free some vmemmap pages of hugetlb page To: "Song Bao Hua (Barry Song)" Cc: "corbet@lwn.net" , "mike.kravetz@oracle.com" , "tglx@linutronix.de" , "mingo@redhat.com" , "bp@alien8.de" , "x86@kernel.org" , "hpa@zytor.com" , "dave.hansen@linux.intel.com" , "luto@kernel.org" , "peterz@infradead.org" , "viro@zeniv.linux.org.uk" , "akpm@linux-foundation.org" , "paulmck@kernel.org" , "mchehab+huawei@kernel.org" , "pawan.kumar.gupta@linux.intel.com" , "rdunlap@infradead.org" , "oneukum@suse.com" , "anshuman.khandual@arm.com" , "jroedel@suse.de" , "almasrymina@google.com" , "rientjes@google.com" , "willy@infradead.org" , "osalvador@suse.de" , "mhocko@suse.com" , "duanxiongchun@bytedance.com" , "linux-doc@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "linux-fsdevel@vger.kernel.org" Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Nov 17, 2020 at 7:08 PM Song Bao Hua (Barry Song) wrote: > > > > > -----Original Message----- > > From: Muchun Song [mailto:songmuchun@bytedance.com] > > Sent: Tuesday, November 17, 2020 11:50 PM > > To: Song Bao Hua (Barry Song) > > Cc: corbet@lwn.net; mike.kravetz@oracle.com; tglx@linutronix.de; > > mingo@redhat.com; bp@alien8.de; x86@kernel.org; hpa@zytor.com; > > dave.hansen@linux.intel.com; luto@kernel.org; peterz@infradead.org; > > viro@zeniv.linux.org.uk; akpm@linux-foundation.org; paulmck@kernel.org; > > mchehab+huawei@kernel.org; pawan.kumar.gupta@linux.intel.com; > > rdunlap@infradead.org; oneukum@suse.com; anshuman.khandual@arm.com; > > jroedel@suse.de; almasrymina@google.com; rientjes@google.com; > > willy@infradead.org; osalvador@suse.de; mhocko@suse.com; > > duanxiongchun@bytedance.com; linux-doc@vger.kernel.org; > > linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > linux-fsdevel@vger.kernel.org > > Subject: Re: [External] RE: [PATCH v4 00/21] Free some vmemmap pages of > > hugetlb page > > > > On Tue, Nov 17, 2020 at 6:16 PM Song Bao Hua (Barry Song) > > wrote: > > > > > > > > > > > > > -----Original Message----- > > > > From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On > > > > Behalf Of Muchun Song > > > > Sent: Saturday, November 14, 2020 12:00 AM > > > > To: corbet@lwn.net; mike.kravetz@oracle.com; tglx@linutronix.de; > > > > mingo@redhat.com; bp@alien8.de; x86@kernel.org; hpa@zytor.com; > > > > dave.hansen@linux.intel.com; luto@kernel.org; peterz@infradead.org; > > > > viro@zeniv.linux.org.uk; akpm@linux-foundation.org; paulmck@kernel.org; > > > > mchehab+huawei@kernel.org; pawan.kumar.gupta@linux.intel.com; > > > > rdunlap@infradead.org; oneukum@suse.com; > > anshuman.khandual@arm.com; > > > > jroedel@suse.de; almasrymina@google.com; rientjes@google.com; > > > > willy@infradead.org; osalvador@suse.de; mhocko@suse.com > > > > Cc: duanxiongchun@bytedance.com; linux-doc@vger.kernel.org; > > > > linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > > > linux-fsdevel@vger.kernel.org; Muchun Song > > > > > > Subject: [PATCH v4 00/21] Free some vmemmap pages of hugetlb page > > > > > > > > Hi all, > > > > > > > > This patch series will free some vmemmap pages(struct page structures) > > > > associated with each hugetlbpage when preallocated to save memory. > > > > > > > > Nowadays we track the status of physical page frames using struct page > > > > structures arranged in one or more arrays. And here exists one-to-one > > > > mapping between the physical page frame and the corresponding struct > > page > > > > structure. > > > > > > > > The HugeTLB support is built on top of multiple page size support that > > > > is provided by most modern architectures. For example, x86 CPUs normally > > > > support 4K and 2M (1G if architecturally supported) page sizes. Every > > > > HugeTLB has more than one struct page structure. The 2M HugeTLB has > > 512 > > > > struct page structure and 1G HugeTLB has 4096 struct page structures. But > > > > in the core of HugeTLB only uses the first 4 (Use of first 4 struct page > > > > structures comes from HUGETLB_CGROUP_MIN_ORDER.) struct page > > > > structures to > > > > store metadata associated with each HugeTLB. The rest of the struct page > > > > structures are usually read the compound_head field which are all the same > > > > value. If we can free some struct page memory to buddy system so that we > > > > can save a lot of memory. > > > > > > > > When the system boot up, every 2M HugeTLB has 512 struct page > > structures > > > > which size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE). > > > > > > > > hugetlbpage struct pages(8 pages) page > > > > frame(8 pages) > > > > +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ > > > > | | | 0 | -------------> | > > 0 > > > > | > > > > | | | 1 | -------------> | > > 1 > > > > | > > > > | | | 2 | -------------> | > > 2 > > > > | > > > > | | | 3 | -------------> | > > 3 > > > > | > > > > | | | 4 | -------------> | > > 4 > > > > | > > > > | 2M | | 5 | -------------> | > > > > 5 | > > > > | | | 6 | -------------> | > > 6 > > > > | > > > > | | | 7 | -------------> | > > 7 > > > > | > > > > | | +-----------+ > > > > +-----------+ > > > > | | > > > > | | > > > > +-----------+ > > > > > > > > > > > > When a hugetlbpage is preallocated, we can change the mapping from > > above > > > > to > > > > bellow. > > > > > > > > hugetlbpage struct pages(8 pages) page > > > > frame(8 pages) > > > > +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ > > > > | | | 0 | -------------> | > > 0 > > > > | > > > > | | | 1 | -------------> | > > 1 > > > > | > > > > | | | 2 | -------------> > > > > +-----------+ > > > > | | | 3 | -----------------^ ^ > > ^ ^ > > > > ^ > > > > | | | 4 | -------------------+ > > | | > > > > | > > > > | 2M | | 5 | > > ---------------------+ | > > > > | > > > > | | | 6 | > > -----------------------+ | > > > > | | | 7 | > > -------------------------+ > > > > | | +-----------+ > > > > | | > > > > | | > > > > +-----------+ > > > > > > > > For tail pages, the value of compound_head is the same. So we can reuse > > > > first page of tail page structs. We map the virtual addresses of the > > > > remaining 6 pages of tail page structs to the first tail page struct, > > > > and then free these 6 pages. Therefore, we need to reserve at least 2 > > > > pages as vmemmap areas. > > > > > > > > When a hugetlbpage is freed to the buddy system, we should allocate six > > > > pages for vmemmap pages and restore the previous mapping relationship. > > > > > > > > If we uses the 1G hugetlbpage, we can save 4088 pages(There are 4096 > > pages > > > > for > > > > struct page structures, we reserve 2 pages for vmemmap and 8 pages for > > page > > > > tables. So we can save 4088 pages). This is a very substantial gain. On our > > > > server, run some SPDK/QEMU applications which will use 1024GB > > hugetlbpage. > > > > With this feature enabled, we can save ~16GB(1G hugepage)/~11GB(2MB > > > > hugepage) > > > > > > Hi Muchun, > > > > > > Do we really save 11GB for 2MB hugepage? > > > How much do we save if we only get one 2MB hugetlb from one 128MB > > mem_section? > > > It seems we need to get at least one page for the PTEs since we are splitting > > PMD of > > > vmemmap into PTE? > > > > There are 524288(1024GB/2MB) 2MB HugeTLB pages. We can save 6 pages for > > each > > 2MB HugeTLB page. So we can save 3145728 pages. But we need to split PMD > > page > > table for every one 128MB mem_section and every section need one page > > as PTE page > > table. So we need 8192(1024GB/128MB) pages as PTE page tables. > > Finally, we can save > > 3137536(3145728-8192) pages which is 11.97GB. > > The worst case I can see is that: > if we get 100 hugetlb with 2MB size, but the 100 hugetlb comes from different > mem_section, we won't save 11.97GB. we only save 5/8 * 16GB=10GB. > > Anyway, it seems 11GB is in the middle of 10GB and 11.97GB, > so sounds sensible :-) > > ideally, we should be able to free PageTail if we change struct page in some way. > Then we will save much more for 2MB hugetlb. but it seems it is not easy. Now for the 2MB HugrTLB page, we only free 6 vmemmap pages. But your words woke me up. Maybe we really can free 7 vmemmap pages. In this case, we can see 8 of the 512 struct page structures has beed set PG_head flag. If we can adjust compound_head() slightly and make compound_head() return the real head struct page when the parameter is the tail struct page but with PG_head flag set. I will start an investigation and a test. Thanks. > > Thanks > Barry -- Yours, Muchun