From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3A6BC433E0 for ; Mon, 25 Jan 2021 10:56:08 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3187423102 for ; Mon, 25 Jan 2021 10:56:07 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3187423102 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.de Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0DFA68D0003; Mon, 25 Jan 2021 05:56:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 08D668D0001; Mon, 25 Jan 2021 05:56:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E98218D0003; Mon, 25 Jan 2021 05:56:06 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0049.hostedemail.com [216.40.44.49]) by kanga.kvack.org (Postfix) with ESMTP id CFFE98D0001 for ; Mon, 25 Jan 2021 05:56:06 -0500 (EST) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 80CF1180AD82F for ; Mon, 25 Jan 2021 10:56:06 +0000 (UTC) X-FDA: 77743992732.08.value01_0a0223227585 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin08.hostedemail.com (Postfix) with ESMTP id 60FA21819E769 for ; Mon, 25 Jan 2021 10:56:06 +0000 (UTC) X-HE-Tag: value01_0a0223227585 X-Filterd-Recvd-Size: 4776 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf29.hostedemail.com (Postfix) with ESMTP for ; Mon, 25 Jan 2021 10:56:05 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 91279ACF4; Mon, 25 Jan 2021 10:56:04 +0000 (UTC) Date: Mon, 25 Jan 2021 11:56:02 +0100 From: Oscar Salvador To: David Hildenbrand Cc: akpm@linux-foundation.org, mhocko@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, vbabka@suse.cz, pasha.tatashin@soleen.com Subject: Re: [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range Message-ID: <20210125105557.GA28363@linux> References: <20201217130758.11565-1-osalvador@suse.de> <20201217130758.11565-3-osalvador@suse.de> <21079c2d-67d0-fc59-8d7f-0795b3f8a3e3@redhat.com> <20210125103951.GA27851@linux> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210125103951.GA27851@linux> User-Agent: Mutt/1.10.1 (2018-07-13) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Jan 25, 2021 at 11:39:55AM +0100, Oscar Salvador wrote: > > Interresting, so we automatically support differeing sizeof(struct > > page). I guess it will be problematic in case of sizeof(struct page) != > > 64, because then, we might not have multiples of 2MB for the memmap of a > > memory block. > > > > IIRC, it could happen that if we add two consecutive memory blocks, that > > the second one might reuse parts of the vmemmap residing on the first > > memory block. If you remove the first one, you might be in trouble. > > > > E.g., on x86-64 > > vmemmap_populate()->vmemmap_populate_hugepages()->vmemmap_alloc_block_buf(): > > - Populate a huge page > > > > vmemmap_free()->remove_pagetable()...->remove_pmd_table(): > > - memchr_inv() will leave the hugepage populated. > > > > Do we want to fence that off, maybe in mhp_supports_memmap_on_memory()? > > Or do we somehow want to fix that? We should never populate partial huge > > pages from an altmap ... > > > > But maybe I am missing something. > > No, you are not missing anything. > > I think that remove_pmd_table() should be fixed. > vmemmap_populate_hugepages() allocates PMD_SIZE chunk and marks the PMD as > PAGE_KERNEL_LARGE, so when remove_pmd_table() sees that 1) the PMD > is large and 2) the chunk is not aligned, the memset() should be writing > PAGE_INUSE for a PMD chunk. > > I do not really a see a reason why this should not happen. > Actually, we will be leaving pagetables behind as we never get to free > the PMD chunk when sizeof(struct page) is not a multiple of 2MB. > > I plan to fix remove_pmd_table(), but for now let us not allow to use this feature > if the size of a struct page is not 64. > Let us keep it simply for the time being, shall we? My first intention was: diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index b5a3fa4033d3..0c9756a2eb24 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1044,32 +1044,14 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end, continue; if (pmd_large(*pmd)) { - if (IS_ALIGNED(addr, PMD_SIZE) && - IS_ALIGNED(next, PMD_SIZE)) { - if (!direct) - free_hugepage_table(pmd_page(*pmd), - altmap); - - spin_lock(&init_mm.page_table_lock); - pmd_clear(pmd); - spin_unlock(&init_mm.page_table_lock); - pages++; - } else { - /* If here, we are freeing vmemmap pages. */ - memset((void *)addr, PAGE_INUSE, next - addr); - - page_addr = page_address(pmd_page(*pmd)); - if (!memchr_inv(page_addr, PAGE_INUSE, - PMD_SIZE)) { - free_hugepage_table(pmd_page(*pmd), - altmap); - - spin_lock(&init_mm.page_table_lock); - pmd_clear(pmd); - spin_unlock(&init_mm.page_table_lock); - } - } + if (!direct) + free_hugepage_table(pmd_page(*pmd), + altmap); + spin_lock(&init_mm.page_table_lock); + pmd_clear(pmd); + spin_unlock(&init_mm.page_table_lock); + pages++; continue; } I did not try it out yet and this might explode badly, but AFAICS, a PMD size chunk is always allocated even when the section does not spand 2MB. E.g: sizeof(struct page) = 56. PAGES_PER_SECTION * 64 = 2097152 PAGES_PER_SECTION * 56 = 1835008 Even in the latter case, vmemmap_populate_hugepages will allocate a PMD size chunk to satisfy the unaligned range. So, unless I am missing something, it should not be a problem to free that 2MB in remove_pmd_table when 1) the PMD is large and 2) the range is not aligned. -- Oscar Salvador SUSE L3