From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f197.google.com (mail-pf0-f197.google.com [209.85.192.197]) by kanga.kvack.org (Postfix) with ESMTP id 4CE476B0033 for ; Wed, 29 Nov 2017 02:17:43 -0500 (EST) Received: by mail-pf0-f197.google.com with SMTP id v25so1811190pfg.14 for ; Tue, 28 Nov 2017 23:17:43 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id d76si874533pfk.321.2017.11.28.23.17.41 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 28 Nov 2017 23:17:41 -0800 (PST) Date: Wed, 29 Nov 2017 08:17:37 +0100 From: Michal Hocko Subject: Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration Message-ID: <20171129071737.vjg2sckpkzelifr2@dhcp22.suse.cz> References: <20171128101907.jtjthykeuefxu7gl@dhcp22.suse.cz> <20171128141211.11117-1-mhocko@kernel.org> <20171128141211.11117-3-mhocko@kernel.org> <29679b8f-53d9-b928-7721-9450dde38104@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <29679b8f-53d9-b928-7721-9450dde38104@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: linux-mm@kvack.org, Naoya Horiguchi , LKML On Tue 28-11-17 17:39:50, Mike Kravetz wrote: > On 11/28/2017 06:12 AM, Michal Hocko wrote: > > From: Michal Hocko > > > > hugepage migration relies on __alloc_buddy_huge_page to get a new page. > > This has 2 main disadvantages. > > 1) it doesn't allow to migrate any huge page if the pool is used > > completely which is not an exceptional case as the pool is static and > > unused memory is just wasted. > > 2) it leads to a weird semantic when migration between two numa nodes > > might increase the pool size of the destination NUMA node while the page > > is in use. The issue is caused by per NUMA node surplus pages tracking > > (see free_huge_page). > > > > Address both issues by changing the way how we allocate and account > > pages allocated for migration. Those should temporal by definition. > > So we mark them that way (we will abuse page flags in the 3rd page) > > and update free_huge_page to free such pages to the page allocator. > > Page migration path then just transfers the temporal status from the > > new page to the old one which will be freed on the last reference. > > In general, I think this will work. Some questions below. > > > The global surplus count will never change during this path but we still > > have to be careful when freeing a page from a node with surplus pages > > on the node. > > Not sure about the "freeing page from a node with surplus pages" comment. > If allocating PageHugeTemporary pages does not adjust surplus counts, then > there should be no concern at the time of freeing. > > Could this comment be a hold over from a previous implementation attempt? > Not really. You have to realize that the original page could be surplus on its node. More on that below. [...] > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 8189c92fac82..037bf0f89463 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page) > > if (restore_reserve) > > h->resv_huge_pages++; > > > > - if (h->surplus_huge_pages_node[nid]) { > > + if (PageHugeTemporary(page)) { > > + list_del(&page->lru); > > + ClearPageHugeTemporary(page); > > + update_and_free_page(h, page); > > + if (h->surplus_huge_pages_node[nid]) > > + h->surplus_huge_pages_node[nid]--; > > I think this is not correct. Should the lines dealing with per-node > surplus counts even be here? If the lines above are correct, then it > implies that the sum of per node surplus counts could exceed (or get out > of sync with) the global surplus count. You are right, I guess. This per-node accounting makes the whole thing real pain. I am worried that we will free next page from the same node and reduce the overal pool size. I will think about it some more. > > + } else if (h->surplus_huge_pages_node[nid]) { > > /* remove the page from active list */ > > list_del(&page->lru); > > update_and_free_page(h, page); > > @@ -1531,7 +1537,11 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn) > > return rc; > > } > > > > -static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask, > > +/* > > + * Allocates a fresh surplus page from the page allocator. Temporary > > + * requests (e.g. page migration) can pass enforce_overcommit == false > > 'enforce_overcommit == false' perhaps part of an earlier implementation > attempt? yeah. [...] > > diff --git a/mm/migrate.c b/mm/migrate.c > > index 4d0be47a322a..b3345f8174a9 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -1326,6 +1326,19 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, > > hugetlb_cgroup_migrate(hpage, new_hpage); > > put_new_page = NULL; > > set_page_owner_migrate_reason(new_hpage, reason); > > + > > + /* > > + * transfer temporary state of the new huge page. This is > > + * reverse to other transitions because the newpage is going to > > + * be final while the old one will be freed so it takes over > > + * the temporary status. > > + * No need for any locking here because destructor cannot race > > + * with us. > > + */ > > + if (PageHugeTemporary(new_hpage)) { > > + SetPageHugeTemporary(hpage); > > + ClearPageHugeTemporary(new_hpage); > > + } > > } > > > > unlock_page(hpage); > > > > I'm still trying to wrap my head around all the different scenarios. > In general, this new code only 'kicks in' if the there is not a free > pre-allocated huge page for migration. Right? yes > So, if there are free huge pages they are 'consumed' during migration > and the number of available pre-allocated huge pages is reduced? Or, > is that not exactly how it works? Or does it depend in the purpose > of the migration? Well, if we have pre-allocated pages then we just consume them and they will not get Temporary status so the additional code doesn't kick in. > The only reason I ask is because this new method of allocating a surplus > page (if successful) results in no decrease of available huge pages. > Perhaps all migrations should attempt to allocate surplus pages and not > impact the pre-allocated number of available huge pages. That could reduce the chances of the migration success because allocating a fresh huge page can fail. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org