From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.7 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5F82AC433E0 for ; Thu, 25 Feb 2021 19:01:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 12B4E64F03 for ; Thu, 25 Feb 2021 19:01:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233337AbhBYTAn (ORCPT ); Thu, 25 Feb 2021 14:00:43 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:29523 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233251AbhBYS7g (ORCPT ); Thu, 25 Feb 2021 13:59:36 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1614279488; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Lw++gPrd0qkbs0UPYsh3cqur+6MHk1b/4zyGUrpbpA0=; b=SJ4Ed+AM4cUQmiCrjkaAuhA34FCiAJ6k6dJL11fCOrmed7HTN/zriV6biGSPOWswvw3qlW TIuHeI0bTkp+QLCfdL+na4J83I/+4mSRbCGY/qoB4vjqdA4Gz2qLFHbvg3XRkHvWCVBbqm d+x6ZQZUQBI6VbH1cGIJIj1zyS5bntY= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-46-d1IZYcFnNCe2TMVfs3IFVQ-1; Thu, 25 Feb 2021 13:58:06 -0500 X-MC-Unique: d1IZYcFnNCe2TMVfs3IFVQ-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 6E8C21009615; Thu, 25 Feb 2021 18:58:04 +0000 (UTC) Received: from [10.36.114.58] (ovpn-114-58.ams2.redhat.com [10.36.114.58]) by smtp.corp.redhat.com (Postfix) with ESMTP id AD3F56EF40; Thu, 25 Feb 2021 18:58:02 +0000 (UTC) To: Oscar Salvador , Andrew Morton Cc: Michal Hocko , VlastimilBabkavbabka@suse.cz, pasha.tatashin@soleen.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Anshuman Khandual References: <20210209133854.17399-1-osalvador@suse.de> <20210209133854.17399-2-osalvador@suse.de> From: David Hildenbrand Organization: Red Hat GmbH Subject: Re: [PATCH v2 1/7] mm,memory_hotplug: Allocate memmap from the added memory range Message-ID: <60afb5ca-230e-265f-9579-dac66a152c33@redhat.com> Date: Thu, 25 Feb 2021 19:58:01 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.0 MIME-Version: 1.0 In-Reply-To: <20210209133854.17399-2-osalvador@suse.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09.02.21 14:38, Oscar Salvador wrote: > Physical memory hotadd has to allocate a memmap (struct page array) for > the newly added memory section. Currently, alloc_pages_node() is used > for those allocations. > > This has some disadvantages: > a) an existing memory is consumed for that purpose > (eg: ~2MB per 128MB memory section on x86_64) > b) if the whole node is movable then we have off-node struct pages > which has performance drawbacks. > c) It might be there are no PMD_ALIGNED chunks so memmap array gets > populated with base pages. > > This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled. > > Vmemap page tables can map arbitrary memory. > That means that we can simply use the beginning of each memory section and > map struct pages there. > struct pages which back the allocated space then just need to be treated > carefully. > > Implementation wise we will reuse vmem_altmap infrastructure to override > the default allocator used by __populate_section_memmap. > Part of the implementation also relies on memory_block structure gaining > a new field which specifies the number of vmemmap_pages at the beginning. > This comes in handy as in {online,offline}_pages, all the isolation and > migration is being done on (buddy_start_pfn, end_pfn] range, > being buddy_start_pfn = start_pfn + nr_vmemmap_pages. > > In this way, we have: > > (start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved > (buddy_start_pfn, end_pfn] = Initialized and sent to buddy nit: shouldn't it be [start_pfn, buddy_start_pfn - 1] [buddy_start_pfn, end_pfn - 1] or [start_pfn, buddy_start_pfn) [buddy_start_pfn, end_pfn) (I remember that "[" means inclusive and "(" means exclusive, I might be wrong :) ) I actually prefer the first variant. > > Hot-remove: > > We need to be careful when removing memory, as adding and > removing memory needs to be done with the same granularity. > To check that this assumption is not violated, we check the > memory range we want to remove and if a) any memory block has > vmemmap pages and b) the range spans more than a single memory > block, we scream out loud and refuse to proceed. > > If all is good and the range was using memmap on memory (aka vmemmap pages), > we construct an altmap structure so free_hugepage_table does the right > thing and calls vmem_altmap_free instead of free_pagetable. > > Signed-off-by: Oscar Salvador [...] > /* > * online_page_callback contains pointer to current page onlining function. > * Initially it is generic_online_page(). If it is required it could be > @@ -638,10 +640,20 @@ void generic_online_page(struct page *page, unsigned int order) > } > EXPORT_SYMBOL_GPL(generic_online_page); > > -static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages) > +static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages, > + unsigned long buddy_start_pfn) > { > const unsigned long end_pfn = start_pfn + nr_pages; > - unsigned long pfn; > + unsigned long pfn = buddy_start_pfn; > + > + /* > + * When using memmap_on_memory, the range might be unaligned as the > + * first pfns are used for vmemmap pages. Align it in case we need to. > + */ > + if (pfn & ((1 << (MAX_ORDER - 1)) - 1)) { if (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) > + (*online_page_callback)(pfn_to_page(pfn), pageblock_order); > + pfn += 1 << pageblock_order; pfn += pageblock_nr_pages; Can you add a comment why we can be sure that we are off by a single pageblock? What about s390x where a MAX_ORDER_NR_PAGES == 4 * pageblock_nr_pages? Would it make thing simpler to just do a while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) { (*online_page_callback)(pfn_to_page(pfn), 0); pfn++; } [...] > > /* > * Freshly onlined pages aren't shuffled (e.g., all pages are placed to > @@ -1064,6 +1085,12 @@ static int online_memory_block(struct memory_block *mem, void *arg) > return device_online(&mem->dev); > } > > +bool mhp_supports_memmap_on_memory(unsigned long size) > +{ > + return memmap_on_memory_enabled && > + size == memory_block_size_bytes(); Regarding my other comments as reply to the other patches, I'd move all magic you have when trying to enable right here. > > -int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) > +int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages, > + unsigned long nr_vmemmap_pages) > { > const unsigned long end_pfn = start_pfn + nr_pages; > - unsigned long pfn, system_ram_pages = 0; > + unsigned long pfn, buddy_start_pfn, buddy_nr_pages, system_ram_pages = 0; > unsigned long flags; > struct zone *zone; > struct memory_notify arg; > @@ -1578,6 +1620,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) > !IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION))) > return -EINVAL; > > + buddy_start_pfn = start_pfn + nr_vmemmap_pages; > + buddy_nr_pages = nr_pages - nr_vmemmap_pages; > + > mem_hotplug_begin(); > > /* > @@ -1613,7 +1658,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) > zone_pcp_disable(zone); > > /* set above range as isolated */ > - ret = start_isolate_page_range(start_pfn, end_pfn, > + ret = start_isolate_page_range(buddy_start_pfn, end_pfn, > MIGRATE_MOVABLE, > MEMORY_OFFLINE | REPORT_FAILURE); Did you take care to properly adjust undo_isolate_page_range() as well? I can't spot it. > if (ret) { > @@ -1633,7 +1678,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) > } > > do { > - pfn = start_pfn; > + pfn = buddy_start_pfn; > do { > if (signal_pending(current)) { > ret = -EINTR; > @@ -1664,18 +1709,18 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) > * offlining actually in order to make hugetlbfs's object > * counting consistent. > */ > - ret = dissolve_free_huge_pages(start_pfn, end_pfn); > + ret = dissolve_free_huge_pages(buddy_start_pfn, end_pfn); > if (ret) { > reason = "failure to dissolve huge pages"; > goto failed_removal_isolated; > } > > - ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE); > + ret = test_pages_isolated(buddy_start_pfn, end_pfn, MEMORY_OFFLINE); > > } while (ret); > > /* Mark all sections offline and remove free pages from the buddy. */ > - __offline_isolated_pages(start_pfn, end_pfn); > + __offline_isolated_pages(start_pfn, end_pfn, buddy_start_pfn);> pr_debug("Offlined Pages %ld\n", nr_pages); > > /* > @@ -1684,13 +1729,13 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) > * of isolated pageblocks, memory onlining will properly revert this. > */ > spin_lock_irqsave(&zone->lock, flags); > - zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages; > + zone->nr_isolate_pageblock -= buddy_nr_pages / pageblock_nr_pages; > spin_unlock_irqrestore(&zone->lock, flags); > > zone_pcp_enable(zone); > > /* removal success */ > - adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages); > + adjust_managed_page_count(pfn_to_page(start_pfn), -buddy_nr_pages); > zone->present_pages -= nr_pages; > > pgdat_resize_lock(zone->zone_pgdat, &flags); > @@ -1750,6 +1795,14 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg) > return 0; > } > [...] > +static int get_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg) > +{ > + unsigned long *nr_vmemmap_pages = (unsigned long *)arg; > + > + *nr_vmemmap_pages += mem->nr_vmemmap_pages; > + return mem->nr_vmemmap_pages; > +} > + I think you can do this easier, all you want to know is if there is any block that has nr_vmemmap_pages set - and return the value. static int first_set_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg) { /* If not set, continue with the next block. */ return mem->nr_vmemmap_pages; } ... > + walk_memory_blocks(start, size, &nr_vmemmap_pages, > + get_nr_vmemmap_pages_cb); ... mem->nr_vmemmap_pages = walk_memory_blocks(start ...) Looks quite promising, only a couple of things to fine-tune :) Thanks! -- Thanks, David / dhildenb