From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=xtX6=BV=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-10.0 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,
	SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY autolearn=unavailable
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 625C4C433DF
	for <linux-mm@archiver.kernel.org>; Tue, 11 Aug 2020 23:19:40 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 2EA1720774
	for <linux-mm@archiver.kernel.org>; Tue, 11 Aug 2020 23:19:40 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2EA1720774
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 9C1D86B0003; Tue, 11 Aug 2020 19:19:39 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 973018D0001; Tue, 11 Aug 2020 19:19:39 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 839F36B0006; Tue, 11 Aug 2020 19:19:39 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0102.hostedemail.com [216.40.44.102])
	by kanga.kvack.org (Postfix) with ESMTP id 6B82F6B0003
	for <linux-mm@kvack.org>; Tue, 11 Aug 2020 19:19:39 -0400 (EDT)
Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 23AED180AD804
	for <linux-mm@kvack.org>; Tue, 11 Aug 2020 23:19:39 +0000 (UTC)
X-FDA: 77139856878.24.show25_170533426fe6
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin24.hostedemail.com (Postfix) with ESMTP id 7A63F1A4A0
	for <linux-mm@kvack.org>; Tue, 11 Aug 2020 23:19:38 +0000 (UTC)
X-HE-Tag: show25_170533426fe6
X-Filterd-Recvd-Size: 7203
Received: from out30-133.freemail.mail.aliyun.com (out30-133.freemail.mail.aliyun.com [115.124.30.133])
	by imf40.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 11 Aug 2020 23:19:36 +0000 (UTC)
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R191e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04407;MF=richard.weiyang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0U5Vraek_1597187972;
Received: from localhost(mailfrom:richard.weiyang@linux.alibaba.com fp:SMTPD_---0U5Vraek_1597187972)
          by smtp.aliyun-inc.com(127.0.0.1);
          Wed, 12 Aug 2020 07:19:32 +0800
Date: Wed, 12 Aug 2020 07:19:32 +0800
From: Wei Yang <richard.weiyang@linux.alibaba.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>, Baoquan He <bhe@redhat.com>,
	Wei Yang <richard.weiyang@linux.alibaba.com>,
	akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 10/10] mm/hugetlb: not necessary to abuse temporary page
 to workaround the nasty free_huge_page
Message-ID: <20200811231932.GA33666@L-31X9LVDL-1304.local>
Reply-To: Wei Yang <richard.weiyang@linux.alibaba.com>
References: <20200807091251.12129-1-richard.weiyang@linux.alibaba.com>
 <20200807091251.12129-11-richard.weiyang@linux.alibaba.com>
 <20200810021737.GV14854@MiWiFi-R3L-srv>
 <129cc03e-c6d5-24f8-2f3c-f5a3cc821e76@oracle.com>
 <20200811015148.GA10792@MiWiFi-R3L-srv>
 <20200811065406.GC4793@dhcp22.suse.cz>
 <eb9d1e13-7455-0c4e-1f94-0c859c36c0bb@oracle.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <eb9d1e13-7455-0c4e-1f94-0c859c36c0bb@oracle.com>
X-Rspamd-Queue-Id: 7A63F1A4A0
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam01
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000078, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Aug 11, 2020 at 02:43:28PM -0700, Mike Kravetz wrote:
>On 8/10/20 11:54 PM, Michal Hocko wrote:
>> 
>> I have managed to forgot all the juicy details since I have made that
>> change. All that remains is that the surplus pages accounting was quite
>> tricky and back then I didn't figure out a simpler method that would
>> achieve the consistent look at those counters. As mentioned above I
>> suspect this could lead to pre-mature allocation failures while the
>> migration is ongoing.
>
>It is likely lost in the e-mail thread, but the suggested change was to
>alloc_surplus_huge_page().  The code which allocates the migration target
>(alloc_migrate_huge_page) will not be changed.  So, this should not be
>an issue.
>
>>                       Sure quite unlikely to happen and the race window
>> is likely very small. Maybe this is even acceptable but I would strongly
>> recommend to have all this thinking documented in the changelog.
>
>I wrote down a description of what happens in the two different approaches
>"temporary page" vs "surplus page".  It is at the very end of this e-mail.
>When looking at the details, I came up with what may be an even better
>approach.  Why not just call the low level routine to free the page instead
>of going through put_page/free_huge_page?  At the very least, it saves a
>lock roundtrip and there is no need to worry about the counters/accounting.
>
>Here is a patch to do that.  However, we are optimizing a return path in
>a race condition that we are unlikely to ever hit.  I 'tested' it by allocating
>an 'extra' page and freeing it via this method in alloc_surplus_huge_page.
>
>>>From 864c5f8ef4900c95ca3f6f2363a85f3cb25e793e Mon Sep 17 00:00:00 2001
>From: Mike Kravetz <mike.kravetz@oracle.com>
>Date: Tue, 11 Aug 2020 12:45:41 -0700
>Subject: [PATCH] hugetlb: optimize race error return in
> alloc_surplus_huge_page
>
>The routine alloc_surplus_huge_page() could race with with a pool
>size change.  If this happens, the allocated page may not be needed.
>To free the page, the current code will 'Abuse temporary page to
>workaround the nasty free_huge_page codeflow'.  Instead, directly
>call the low level routine that free_huge_page uses.  This works
>out well because the page is new, we hold the only reference and
>already hold the hugetlb_lock.
>
>Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>---
> mm/hugetlb.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
>diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>index 590111ea6975..ac89b91fba86 100644
>--- a/mm/hugetlb.c
>+++ b/mm/hugetlb.c
>@@ -1923,14 +1923,17 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
> 	/*
> 	 * We could have raced with the pool size change.
> 	 * Double check that and simply deallocate the new page
>-	 * if we would end up overcommiting the surpluses. Abuse
>-	 * temporary page to workaround the nasty free_huge_page
>-	 * codeflow
>+	 * if we would end up overcommiting the surpluses.
> 	 */
> 	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
>-		SetPageHugeTemporary(page);
>+		/*
>+		 * Since this page is new, we hold the only reference, and
>+		 * we already hold the hugetlb_lock call the low level free
>+		 * page routine.  This saves at least a lock roundtrip.

The change looks good to me, while I may not understand the "lock roundtrip".
You mean we don't need to release the hugetlb_lock?

>+		 */
>+		(void)put_page_testzero(page); /* don't call destructor */
>+		update_and_free_page(h, page);
> 		spin_unlock(&hugetlb_lock);
>-		put_page(page);
> 		return NULL;
> 	} else {
> 		h->surplus_huge_pages++;
>-- 
>2.25.4
>
>
>Here is a description of the difference in "Temporary Page" vs "Surplus
>Page" approach.
>
>Both only allocate a fresh huge page if surplus_huge_pages is less than
>nr_overcommit_huge_pages.  Of course, the lock protecting those counts
>must be dropped to perform the allocation.  After reacquiring the lock
>is where we have the proposed difference in behavior.
>
>temporary page behavior
>-----------------------
>if surplus_huge_pages >= h->nr_overcommit_huge_pages
>	SetPageHugeTemporary(page)
>	spin_unlock(&hugetlb_lock);
>	put_page(page);
>
>At this time we know surplus_huge_pages is 'at least' nr_overcommit_huge_pages.
>As a result, any new allocation will fail.
>Only user visible result is that number of huge pages will be one greater than
>that specified by user and overcommit values.  This is only visible for the
>short time until the page is actully freed as a result of put_page().
>
>free_huge_page()
>	number of huge pages will be decremented
>
>suprlus page behavior
>---------------------
>surplus_huge_pages++
>surplus_huge_pages_node[page_to_nid(page)]++
>if surplus_huge_pages > nr_overcommit_huge_pages
>	spin_unlock(&hugetlb_lock);
>	put_page(page);
>
>At this time we know surplus_huge_pages is greater than
>nr_overcommit_huge_pages.  As a result, any new allocation will fail.
>User visible result is an increase in surplus pages as well as number of
>huge pages.  In addition, surplus pages will exceed overcommit.  This is
>only visible for the short time until the page is actully freed as a
>result of put_page().
>
>free_huge_page()
>	number of huge pages will be decremented
>	h->surplus_huge_pages--;
>	h->surplus_huge_pages_node[nid]--;

-- 
Wei Yang
Help you, Help me