All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Michal Hocko <mhocko@kernel.org>, linux-mm@kvack.org
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Michal Hocko <mhocko@suse.com>
Subject: Re: [RFC PATCH 4/5] mm, hugetlb: get rid of surplus page accounting tricks
Date: Wed, 13 Dec 2017 16:45:55 -0800	[thread overview]
Message-ID: <8b07431a-547e-4330-4276-570ef861bb35@oracle.com> (raw)
In-Reply-To: <20171204140117.7191-5-mhocko@kernel.org>

On 12/04/2017 06:01 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> alloc_surplus_huge_page increases the pool size and the number of
> surplus pages opportunistically to prevent from races with the pool size
> change. See d1c3fb1f8f29 ("hugetlb: introduce nr_overcommit_hugepages
> sysctl") for more details.
> 
> The resulting code is unnecessarily hairy, cause code duplication and
> doesn't allow to share the allocation paths. Moreover pool size changes
> tend to be very seldom so optimizing for them is not really reasonable.
> Simplify the code and allow to allocate a fresh surplus page as long as
> we are under the overcommit limit and then recheck the condition after
> the allocation and drop the new page if the situation has changed. This
> should provide a reasonable guarantee that an abrupt allocation requests
> will not go way off the limit.
> 
> If we consider races with the pool shrinking and enlarging then we
> should be reasonably safe as well. In the first case we are off by one
> in the worst case and the second case should work OK because the page is
> not yet visible. We can waste CPU cycles for the allocation but that
> should be acceptable for a relatively rare condition.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/hugetlb.c | 60 +++++++++++++++++++++---------------------------------------
>  1 file changed, 21 insertions(+), 39 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a1b8b2888ec9..0c7dc269b6c0 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1538,62 +1538,44 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
>  static struct page *__alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>  		int nid, nodemask_t *nmask)
>  {
> -	struct page *page;
> -	unsigned int r_nid;
> +	struct page *page = NULL;
>  
>  	if (hstate_is_gigantic(h))
>  		return NULL;
>  
> -	/*
> -	 * Assume we will successfully allocate the surplus page to
> -	 * prevent racing processes from causing the surplus to exceed
> -	 * overcommit
> -	 *
> -	 * This however introduces a different race, where a process B
> -	 * tries to grow the static hugepage pool while alloc_pages() is
> -	 * called by process A. B will only examine the per-node
> -	 * counters in determining if surplus huge pages can be
> -	 * converted to normal huge pages in adjust_pool_surplus(). A
> -	 * won't be able to increment the per-node counter, until the
> -	 * lock is dropped by B, but B doesn't drop hugetlb_lock until
> -	 * no more huge pages can be converted from surplus to normal
> -	 * state (and doesn't try to convert again). Thus, we have a
> -	 * case where a surplus huge page exists, the pool is grown, and
> -	 * the surplus huge page still exists after, even though it
> -	 * should just have been converted to a normal huge page. This
> -	 * does not leak memory, though, as the hugepage will be freed
> -	 * once it is out of use. It also does not allow the counters to
> -	 * go out of whack in adjust_pool_surplus() as we don't modify
> -	 * the node values until we've gotten the hugepage and only the
> -	 * per-node value is checked there.
> -	 */
>  	spin_lock(&hugetlb_lock);
> -	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
> -		spin_unlock(&hugetlb_lock);
> -		return NULL;
> -	} else {
> -		h->nr_huge_pages++;
> -		h->surplus_huge_pages++;
> -	}
> +	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages)
> +		goto out_unlock;
>  	spin_unlock(&hugetlb_lock);
>  
>  	page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, nid, nmask);
> +	if (!page)
> +		goto out_unlock;
>  
>  	spin_lock(&hugetlb_lock);
> -	if (page) {
> +	/*
> +	 * We could have raced with the pool size change.
> +	 * Double check that and simply deallocate the new page
> +	 * if we would end up overcommiting the surpluses. Abuse
> +	 * temporary page to workaround the nasty free_huge_page
> +	 * codeflow
> +	 */
> +	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
> +		SetPageHugeTemporary(page);
> +		put_page(page);
> +		page = NULL;
> +	} else {
> +		h->surplus_huge_pages_node[page_to_nid(page)]++;
> +		h->surplus_huge_pages++;
>  		INIT_LIST_HEAD(&page->lru);
>  		r_nid = page_to_nid(page);
>  		set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
>  		set_hugetlb_cgroup(page, NULL);
> -		/*
> -		 * We incremented the global counters already
> -		 */
>  		h->nr_huge_pages_node[r_nid]++;
>  		h->surplus_huge_pages_node[r_nid]++;
> -	} else {
> -		h->nr_huge_pages--;
> -		h->surplus_huge_pages--;

In the case of a successful surplus allocation, the following counters
are incremented:

h->surplus_huge_pages_node[page_to_nid(page)]++;
h->surplus_huge_pages++;
h->nr_huge_pages_node[r_nid]++;
h->surplus_huge_pages_node[r_nid]++;

Looks like per-node surplus_huge_pages_node is incremented twice, and
global nr_huge_pages is not incremented at all.

Also, you removed r_nid so I'm guessing this will not compile?
-- 
Mike Kravetz


>  	}
> +
> +out_unlock:
>  	spin_unlock(&hugetlb_lock);
>  
>  	return page;
> 

WARNING: multiple messages have this Message-ID (diff)
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Michal Hocko <mhocko@kernel.org>, linux-mm@kvack.org
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Michal Hocko <mhocko@suse.com>
Subject: Re: [RFC PATCH 4/5] mm, hugetlb: get rid of surplus page accounting tricks
Date: Wed, 13 Dec 2017 16:45:55 -0800	[thread overview]
Message-ID: <8b07431a-547e-4330-4276-570ef861bb35@oracle.com> (raw)
In-Reply-To: <20171204140117.7191-5-mhocko@kernel.org>

On 12/04/2017 06:01 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> alloc_surplus_huge_page increases the pool size and the number of
> surplus pages opportunistically to prevent from races with the pool size
> change. See d1c3fb1f8f29 ("hugetlb: introduce nr_overcommit_hugepages
> sysctl") for more details.
> 
> The resulting code is unnecessarily hairy, cause code duplication and
> doesn't allow to share the allocation paths. Moreover pool size changes
> tend to be very seldom so optimizing for them is not really reasonable.
> Simplify the code and allow to allocate a fresh surplus page as long as
> we are under the overcommit limit and then recheck the condition after
> the allocation and drop the new page if the situation has changed. This
> should provide a reasonable guarantee that an abrupt allocation requests
> will not go way off the limit.
> 
> If we consider races with the pool shrinking and enlarging then we
> should be reasonably safe as well. In the first case we are off by one
> in the worst case and the second case should work OK because the page is
> not yet visible. We can waste CPU cycles for the allocation but that
> should be acceptable for a relatively rare condition.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/hugetlb.c | 60 +++++++++++++++++++++---------------------------------------
>  1 file changed, 21 insertions(+), 39 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a1b8b2888ec9..0c7dc269b6c0 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1538,62 +1538,44 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
>  static struct page *__alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>  		int nid, nodemask_t *nmask)
>  {
> -	struct page *page;
> -	unsigned int r_nid;
> +	struct page *page = NULL;
>  
>  	if (hstate_is_gigantic(h))
>  		return NULL;
>  
> -	/*
> -	 * Assume we will successfully allocate the surplus page to
> -	 * prevent racing processes from causing the surplus to exceed
> -	 * overcommit
> -	 *
> -	 * This however introduces a different race, where a process B
> -	 * tries to grow the static hugepage pool while alloc_pages() is
> -	 * called by process A. B will only examine the per-node
> -	 * counters in determining if surplus huge pages can be
> -	 * converted to normal huge pages in adjust_pool_surplus(). A
> -	 * won't be able to increment the per-node counter, until the
> -	 * lock is dropped by B, but B doesn't drop hugetlb_lock until
> -	 * no more huge pages can be converted from surplus to normal
> -	 * state (and doesn't try to convert again). Thus, we have a
> -	 * case where a surplus huge page exists, the pool is grown, and
> -	 * the surplus huge page still exists after, even though it
> -	 * should just have been converted to a normal huge page. This
> -	 * does not leak memory, though, as the hugepage will be freed
> -	 * once it is out of use. It also does not allow the counters to
> -	 * go out of whack in adjust_pool_surplus() as we don't modify
> -	 * the node values until we've gotten the hugepage and only the
> -	 * per-node value is checked there.
> -	 */
>  	spin_lock(&hugetlb_lock);
> -	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
> -		spin_unlock(&hugetlb_lock);
> -		return NULL;
> -	} else {
> -		h->nr_huge_pages++;
> -		h->surplus_huge_pages++;
> -	}
> +	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages)
> +		goto out_unlock;
>  	spin_unlock(&hugetlb_lock);
>  
>  	page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, nid, nmask);
> +	if (!page)
> +		goto out_unlock;
>  
>  	spin_lock(&hugetlb_lock);
> -	if (page) {
> +	/*
> +	 * We could have raced with the pool size change.
> +	 * Double check that and simply deallocate the new page
> +	 * if we would end up overcommiting the surpluses. Abuse
> +	 * temporary page to workaround the nasty free_huge_page
> +	 * codeflow
> +	 */
> +	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
> +		SetPageHugeTemporary(page);
> +		put_page(page);
> +		page = NULL;
> +	} else {
> +		h->surplus_huge_pages_node[page_to_nid(page)]++;
> +		h->surplus_huge_pages++;
>  		INIT_LIST_HEAD(&page->lru);
>  		r_nid = page_to_nid(page);
>  		set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
>  		set_hugetlb_cgroup(page, NULL);
> -		/*
> -		 * We incremented the global counters already
> -		 */
>  		h->nr_huge_pages_node[r_nid]++;
>  		h->surplus_huge_pages_node[r_nid]++;
> -	} else {
> -		h->nr_huge_pages--;
> -		h->surplus_huge_pages--;

In the case of a successful surplus allocation, the following counters
are incremented:

h->surplus_huge_pages_node[page_to_nid(page)]++;
h->surplus_huge_pages++;
h->nr_huge_pages_node[r_nid]++;
h->surplus_huge_pages_node[r_nid]++;

Looks like per-node surplus_huge_pages_node is incremented twice, and
global nr_huge_pages is not incremented at all.

Also, you removed r_nid so I'm guessing this will not compile?
-- 
Mike Kravetz


>  	}
> +
> +out_unlock:
>  	spin_unlock(&hugetlb_lock);
>  
>  	return page;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-12-14  0:46 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-04 14:01 [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements Michal Hocko
2017-12-04 14:01 ` Michal Hocko
2017-12-04 14:01 ` [RFC PATCH 1/5] mm, hugetlb: unify core page allocation accounting and initialization Michal Hocko
2017-12-04 14:01   ` Michal Hocko
2017-12-13  0:20   ` Mike Kravetz
2017-12-13  0:20     ` Mike Kravetz
2017-12-04 14:01 ` [RFC PATCH 2/5] mm, hugetlb: integrate giga hugetlb more naturally to the allocation path Michal Hocko
2017-12-04 14:01   ` Michal Hocko
2017-12-13  0:24   ` Mike Kravetz
2017-12-13  0:24     ` Mike Kravetz
2017-12-04 14:01 ` [RFC PATCH 3/5] mm, hugetlb: do not rely on overcommit limit during migration Michal Hocko
2017-12-04 14:01   ` Michal Hocko
2017-12-13 23:35   ` Mike Kravetz
2017-12-13 23:35     ` Mike Kravetz
2017-12-14  7:40     ` Michal Hocko
2017-12-14  7:40       ` Michal Hocko
2017-12-14 20:57       ` Mike Kravetz
2017-12-14 20:57         ` Mike Kravetz
2017-12-04 14:01 ` [RFC PATCH 4/5] mm, hugetlb: get rid of surplus page accounting tricks Michal Hocko
2017-12-04 14:01   ` Michal Hocko
2017-12-14  0:45   ` Mike Kravetz [this message]
2017-12-14  0:45     ` Mike Kravetz
2017-12-14  7:50     ` Michal Hocko
2017-12-14  7:50       ` Michal Hocko
2017-12-14 20:58       ` Mike Kravetz
2017-12-14 20:58         ` Mike Kravetz
2017-12-04 14:01 ` [RFC PATCH 5/5] mm, hugetlb: further simplify hugetlb allocation API Michal Hocko
2017-12-04 14:01   ` Michal Hocko
2017-12-14 21:01   ` Mike Kravetz
2017-12-14 21:01     ` Mike Kravetz
2017-12-15  9:33 ` [RFC PATCH 0/5] mm, hugetlb: allocation API and migration improvements Michal Hocko
2017-12-15  9:33   ` Michal Hocko
2017-12-20  5:33   ` Naoya Horiguchi
2017-12-20  5:33     ` Naoya Horiguchi
2017-12-20  9:53     ` Michal Hocko
2017-12-20  9:53       ` Michal Hocko
2017-12-20 22:43       ` Mike Kravetz
2017-12-20 22:43         ` Mike Kravetz
2017-12-21  7:28         ` Michal Hocko
2017-12-21  7:28           ` Michal Hocko
2017-12-21 23:35           ` Mike Kravetz
2017-12-21 23:35             ` Mike Kravetz
2017-12-22  9:48             ` Michal Hocko
2017-12-22  9:48               ` Michal Hocko
2017-12-22  8:58       ` Naoya Horiguchi
2017-12-22  8:58         ` Naoya Horiguchi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8b07431a-547e-4330-4276-570ef861bb35@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mhocko@suse.com \
    --cc=n-horiguchi@ah.jp.nec.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.