All of lore.kernel.org
 help / color / mirror / Atom feed
* hugetlb: reservation race leading to under provisioning
@ 2017-01-05 15:15 Michal Hocko
  2017-01-06  0:48 ` Mike Kravetz
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2017-01-05 15:15 UTC (permalink / raw)
  To: Naoya Horiguchi, Mike Kravetz; +Cc: linux-mm

Hi,
we have a customer report on an older kernel (3.12) but I believe the
same issue is present in the current vanilla kernel. There is a race
between mmap trying to do a reservation which fails when racing with
truncate_hugepages. See the reproduced attached.

It should go like this (analysis come from the customer and I hope I
haven't screwed their write up).

: Task (T1) does mmap and calls into gather_surplus_pages(), looking for N
: pages.  It determines it needs to allocate N pages, drops the lock, and
: does so.
: 
: We will have:
: hstate->resv_huge_pages == N
: hstate->free_huge_pages == N
: 
: That mapping is then munmap()ed by task T2, which truncates the file:
: 
: truncate_hugepages() {
: 	for each page of the inode after lstart {
: 		truncate_huge_page(page) {
: 			hugetlb_unreserve_pages() {
: 				hugetlb_acct_memory() {
: 					return_unused_surplus_pages() {
: 
: return_unused_surplus_pages() drops h->resv_huge_pages to 0, then
: begins calling free_pool_huge_page() N times:
: 
: 	h->resv_huge_pages -= unused_resv_pages
: 	while (nr_pages--) {
: 		free_pool_huge_page(h, &node_states[N_MEMORY], 1) {
: 			h->free_huge_pages--;
: 		}
: 		cond_resched_lock(&hugetlb_lock);
: 	}
: 
: But the cond_resched_lock() triggers, and it releases the lock with
: 
: h->resv_huge_pages == 0
: h->free_huge_pages == M << N
: 
: T1 having completed its allocations with allocated == N now
: acquires the lock, and recomputes
: 
: needed = (h->resv_huge_pages + delta) - (h->free_huge_pages + allocated);
: 
: needed = N - (M + N) = -M
: 
: Then
: 
: needed += N                  = -M+N
: h->resv_huge_pages += N       = N
: 
: It frees N-M pages to the hugetlb pool via enqueue_huge_page(),
: 
: list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
: 	if ((--needed) < 0)
: 		break;
: 		/*
: 		* This page is now managed by the hugetlb allocator and has
: 		* no users -- drop the buddy allocator's reference.
: 		*/
: 		put_page_testzero(page);
: 		VM_BUG_ON(page_count(page));
: 		enqueue_huge_page(h, page) {
: 			h->free_huge_pages++;
: 		}
: 	}
: 
: h->resv_huge_pages == N
: h->free_huge_pages == N-M
: 
: It releases the lock in order to free the remainder of surplus_list
: via put_page().
: 
: When it releases the lock, T1 reclaims it and returns from
: gather_surplus_pages().
: 
: But then hugetlb_acct_memory() checks
: 
: 	if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {
: 		return_unused_surplus_pages(h, delta);
: 		goto out;
: 	}
: 
: and returns -ENOMEM.

The cond_resched has been added by 7848a4bf51b3 ("mm/hugetlb.c: add
cond_resched_lock() in return_unused_surplus_pages()") and it smells
fishy AFAICT. It leaves the inconsistent state of the hstate behind.
I guess we want to uncommit the reservation one page at the time, something like:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3edb759c5c7d..e3a599146d7c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1783,12 +1783,13 @@ static void return_unused_surplus_pages(struct hstate *h,
 {
 	unsigned long nr_pages;
 
-	/* Uncommit the reservation */
-	h->resv_huge_pages -= unused_resv_pages;
 
 	/* Cannot return gigantic pages currently */
-	if (hstate_is_gigantic(h))
+	if (hstate_is_gigantic(h)) {
+		/* Uncommit the reservation */
+		h->resv_huge_pages -= unused_resv_pages;
 		return;
+	}
 
 	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
 
@@ -1803,6 +1804,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 	while (nr_pages--) {
 		if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
 			break;
+		h->resv_huge_pages--;
 		cond_resched_lock(&hugetlb_lock);
 	}
 }

but I am just not getting the nr_pages = min... part and the way thing
how we can have less surplus_huge_pages than unused_resv_pages....  This
whole code is so confusing that I would even rather go with a simple
revert of 7848a4bf51b3 which would be much easier for the stable backport.

What do you guys think?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: hugetlb: reservation race leading to under provisioning
  2017-01-05 15:15 hugetlb: reservation race leading to under provisioning Michal Hocko
@ 2017-01-06  0:48 ` Mike Kravetz
  2017-01-06  8:58   ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Kravetz @ 2017-01-06  0:48 UTC (permalink / raw)
  To: Michal Hocko, Naoya Horiguchi; +Cc: linux-mm

On 01/05/2017 07:15 AM, Michal Hocko wrote:
> Hi,
> we have a customer report on an older kernel (3.12) but I believe the
> same issue is present in the current vanilla kernel. There is a race
> between mmap trying to do a reservation which fails when racing with
> truncate_hugepages. See the reproduced attached.
> 
> It should go like this (analysis come from the customer and I hope I
> haven't screwed their write up).
> 
> : Task (T1) does mmap and calls into gather_surplus_pages(), looking for N
> : pages.  It determines it needs to allocate N pages, drops the lock, and
> : does so.
> : 
> : We will have:
> : hstate->resv_huge_pages == N
> : hstate->free_huge_pages == N
> : 
> : That mapping is then munmap()ed by task T2, which truncates the file:
> : 
> : truncate_hugepages() {
> : 	for each page of the inode after lstart {
> : 		truncate_huge_page(page) {
> : 			hugetlb_unreserve_pages() {
> : 				hugetlb_acct_memory() {
> : 					return_unused_surplus_pages() {
> : 
> : return_unused_surplus_pages() drops h->resv_huge_pages to 0, then
> : begins calling free_pool_huge_page() N times:
> : 
> : 	h->resv_huge_pages -= unused_resv_pages
> : 	while (nr_pages--) {
> : 		free_pool_huge_page(h, &node_states[N_MEMORY], 1) {
> : 			h->free_huge_pages--;
> : 		}
> : 		cond_resched_lock(&hugetlb_lock);
> : 	}
> : 
> : But the cond_resched_lock() triggers, and it releases the lock with
> : 
> : h->resv_huge_pages == 0
> : h->free_huge_pages == M << N
> : 
> : T1 having completed its allocations with allocated == N now
> : acquires the lock, and recomputes
> : 
> : needed = (h->resv_huge_pages + delta) - (h->free_huge_pages + allocated);
> : 
> : needed = N - (M + N) = -M
> : 
> : Then
> : 
> : needed += N                  = -M+N
> : h->resv_huge_pages += N       = N
> : 
> : It frees N-M pages to the hugetlb pool via enqueue_huge_page(),
> : 
> : list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
> : 	if ((--needed) < 0)
> : 		break;
> : 		/*
> : 		* This page is now managed by the hugetlb allocator and has
> : 		* no users -- drop the buddy allocator's reference.
> : 		*/
> : 		put_page_testzero(page);
> : 		VM_BUG_ON(page_count(page));
> : 		enqueue_huge_page(h, page) {
> : 			h->free_huge_pages++;
> : 		}
> : 	}
> : 
> : h->resv_huge_pages == N
> : h->free_huge_pages == N-M

Are you sure about free_huge_page?

When we entered the routine
h->free_huge_pages == M << N

After the above loop, I think
h->free_huge_pages == M + (N-M)

> : 
> : It releases the lock in order to free the remainder of surplus_list
> : via put_page().
> : 
> : When it releases the lock, T1 reclaims it and returns from
> : gather_surplus_pages().
> : 
> : But then hugetlb_acct_memory() checks
> : 
> : 	if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {
> : 		return_unused_surplus_pages(h, delta);
> : 		goto out;
> : 	}
> : 
> : and returns -ENOMEM.

I'm wondering if this may have more to do with numa allocations of
surplus pages.  Do you know if customer uses any memory policy for
allocations?  There was a change after 3.12 for this (commit 099730d67417).

> 
> The cond_resched has been added by 7848a4bf51b3 ("mm/hugetlb.c: add
> cond_resched_lock() in return_unused_surplus_pages()") and it smells
> fishy AFAICT. It leaves the inconsistent state of the hstate behind.
> I guess we want to uncommit the reservation one page at the time, something like:
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3edb759c5c7d..e3a599146d7c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1783,12 +1783,13 @@ static void return_unused_surplus_pages(struct hstate *h,
>  {
>  	unsigned long nr_pages;
>  
> -	/* Uncommit the reservation */
> -	h->resv_huge_pages -= unused_resv_pages;
>  
>  	/* Cannot return gigantic pages currently */
> -	if (hstate_is_gigantic(h))
> +	if (hstate_is_gigantic(h)) {
> +		/* Uncommit the reservation */
> +		h->resv_huge_pages -= unused_resv_pages;
>  		return;
> +	}
>  
>  	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
>  
> @@ -1803,6 +1804,7 @@ static void return_unused_surplus_pages(struct hstate *h,
>  	while (nr_pages--) {
>  		if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
>  			break;
> +		h->resv_huge_pages--;
>  		cond_resched_lock(&hugetlb_lock);
>  	}
>  }
> 
> but I am just not getting the nr_pages = min... part and the way thing
> how we can have less surplus_huge_pages than unused_resv_pages.... 

Think about the case where there are pre-allocated huge pages in the mix.
Suppose you want to reserve 5 pages via mmap.  There are 3 pre-allocated
free pages which can be used for the reservation.  However, 2 additional
surplus pages will need to be allocated to cover all the reservations.

In this case, I believe the code above would have:
unused_resv_pages = 5
h->surplus_huge_pages = 2
So, the loop would only decrement resv_huge_pages by 2 and leak 3 pages.

>                                                                     This
> whole code is so confusing

Yes, I wrote about 5 replies to this e-mail and deleted them before
hitting send as I later realized they were incorrect.  I'm going to
add to 'hugetlb reservations' to your proposed LSF/MM topic of areas
in need of attention.

> whole code is so confusing that I would even rather go with a simple
> revert of 7848a4bf51b3 which would be much easier for the stable backport.
> 
> What do you guys think?

Let me think about it some more.  At first, I thought it certainly was
a bad idea to drop the lock in return_unused_surplus_pages.  But, the
more I think about it, the more I think it is OK.  There should not be
a problem with dropping the reserve count all at once.  The reserve map
which corresponds to the global reserve count has already been cleared.

-- 
Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: hugetlb: reservation race leading to under provisioning
  2017-01-06  0:48 ` Mike Kravetz
@ 2017-01-06  8:58   ` Michal Hocko
  2017-01-06 15:01     ` Michal Hocko
  2017-01-06 21:57     ` Paul Cassella
  0 siblings, 2 replies; 8+ messages in thread
From: Michal Hocko @ 2017-01-06  8:58 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Naoya Horiguchi, linux-mm, Paul Cassella

Let me add Paul who has done the analysis. I just slightly reworded his
report - hopefully not screwing up anything.

Keeping the full quote for reference.

On Thu 05-01-17 16:48:03, Mike Kravetz wrote:
> On 01/05/2017 07:15 AM, Michal Hocko wrote:
> > Hi,
> > we have a customer report on an older kernel (3.12) but I believe the
> > same issue is present in the current vanilla kernel. There is a race
> > between mmap trying to do a reservation which fails when racing with
> > truncate_hugepages. See the reproduced attached.
> > 
> > It should go like this (analysis come from the customer and I hope I
> > haven't screwed their write up).
> > 
> > : Task (T1) does mmap and calls into gather_surplus_pages(), looking for N
> > : pages.  It determines it needs to allocate N pages, drops the lock, and
> > : does so.
> > : 
> > : We will have:
> > : hstate->resv_huge_pages == N
> > : hstate->free_huge_pages == N
> > : 
> > : That mapping is then munmap()ed by task T2, which truncates the file:
> > : 
> > : truncate_hugepages() {
> > : 	for each page of the inode after lstart {
> > : 		truncate_huge_page(page) {
> > : 			hugetlb_unreserve_pages() {
> > : 				hugetlb_acct_memory() {
> > : 					return_unused_surplus_pages() {
> > : 
> > : return_unused_surplus_pages() drops h->resv_huge_pages to 0, then
> > : begins calling free_pool_huge_page() N times:
> > : 
> > : 	h->resv_huge_pages -= unused_resv_pages
> > : 	while (nr_pages--) {
> > : 		free_pool_huge_page(h, &node_states[N_MEMORY], 1) {
> > : 			h->free_huge_pages--;
> > : 		}
> > : 		cond_resched_lock(&hugetlb_lock);
> > : 	}
> > : 
> > : But the cond_resched_lock() triggers, and it releases the lock with
> > : 
> > : h->resv_huge_pages == 0
> > : h->free_huge_pages == M << N
> > : 
> > : T1 having completed its allocations with allocated == N now
> > : acquires the lock, and recomputes
> > : 
> > : needed = (h->resv_huge_pages + delta) - (h->free_huge_pages + allocated);
> > : 
> > : needed = N - (M + N) = -M
> > : 
> > : Then
> > : 
> > : needed += N                  = -M+N
> > : h->resv_huge_pages += N       = N
> > : 
> > : It frees N-M pages to the hugetlb pool via enqueue_huge_page(),
> > : 
> > : list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
> > : 	if ((--needed) < 0)
> > : 		break;
> > : 		/*
> > : 		* This page is now managed by the hugetlb allocator and has
> > : 		* no users -- drop the buddy allocator's reference.
> > : 		*/
> > : 		put_page_testzero(page);
> > : 		VM_BUG_ON(page_count(page));
> > : 		enqueue_huge_page(h, page) {
> > : 			h->free_huge_pages++;
> > : 		}
> > : 	}
> > : 
> > : h->resv_huge_pages == N
> > : h->free_huge_pages == N-M
> 
> Are you sure about free_huge_page?
> 
> When we entered the routine
> h->free_huge_pages == M << N
> 
> After the above loop, I think
> h->free_huge_pages == M + (N-M)
> 
> > : 
> > : It releases the lock in order to free the remainder of surplus_list
> > : via put_page().
> > : 
> > : When it releases the lock, T1 reclaims it and returns from
> > : gather_surplus_pages().
> > : 
> > : But then hugetlb_acct_memory() checks
> > : 
> > : 	if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {
> > : 		return_unused_surplus_pages(h, delta);
> > : 		goto out;
> > : 	}
> > : 
> > : and returns -ENOMEM.
> 
> I'm wondering if this may have more to do with numa allocations of
> surplus pages.  Do you know if customer uses any memory policy for
> allocations?  There was a change after 3.12 for this (commit 099730d67417).
> 
> > 
> > The cond_resched has been added by 7848a4bf51b3 ("mm/hugetlb.c: add
> > cond_resched_lock() in return_unused_surplus_pages()") and it smells
> > fishy AFAICT. It leaves the inconsistent state of the hstate behind.
> > I guess we want to uncommit the reservation one page at the time, something like:
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 3edb759c5c7d..e3a599146d7c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1783,12 +1783,13 @@ static void return_unused_surplus_pages(struct hstate *h,
> >  {
> >  	unsigned long nr_pages;
> >  
> > -	/* Uncommit the reservation */
> > -	h->resv_huge_pages -= unused_resv_pages;
> >  
> >  	/* Cannot return gigantic pages currently */
> > -	if (hstate_is_gigantic(h))
> > +	if (hstate_is_gigantic(h)) {
> > +		/* Uncommit the reservation */
> > +		h->resv_huge_pages -= unused_resv_pages;
> >  		return;
> > +	}
> >  
> >  	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
> >  
> > @@ -1803,6 +1804,7 @@ static void return_unused_surplus_pages(struct hstate *h,
> >  	while (nr_pages--) {
> >  		if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
> >  			break;
> > +		h->resv_huge_pages--;
> >  		cond_resched_lock(&hugetlb_lock);
> >  	}
> >  }
> > 
> > but I am just not getting the nr_pages = min... part and the way thing
> > how we can have less surplus_huge_pages than unused_resv_pages.... 
> 
> Think about the case where there are pre-allocated huge pages in the mix.
> Suppose you want to reserve 5 pages via mmap.  There are 3 pre-allocated
> free pages which can be used for the reservation.  However, 2 additional
> surplus pages will need to be allocated to cover all the reservations.
> 
> In this case, I believe the code above would have:
> unused_resv_pages = 5
> h->surplus_huge_pages = 2
> So, the loop would only decrement resv_huge_pages by 2 and leak 3 pages.
> 
> >                                                                     This
> > whole code is so confusing
> 
> Yes, I wrote about 5 replies to this e-mail and deleted them before
> hitting send as I later realized they were incorrect.  I'm going to
> add to 'hugetlb reservations' to your proposed LSF/MM topic of areas
> in need of attention.
> 
> > whole code is so confusing that I would even rather go with a simple
> > revert of 7848a4bf51b3 which would be much easier for the stable backport.
> > 
> > What do you guys think?
> 
> Let me think about it some more.  At first, I thought it certainly was
> a bad idea to drop the lock in return_unused_surplus_pages.  But, the
> more I think about it, the more I think it is OK.  There should not be
> a problem with dropping the reserve count all at once.  The reserve map
> which corresponds to the global reserve count has already been cleared.
> 
> -- 
> Mike Kravetz

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: hugetlb: reservation race leading to under provisioning
  2017-01-06  8:58   ` Michal Hocko
@ 2017-01-06 15:01     ` Michal Hocko
  2017-01-06 21:57     ` Paul Cassella
  1 sibling, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2017-01-06 15:01 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Naoya Horiguchi, linux-mm, Paul Cassella

I have only now realized I haven't attached the promissed program to
replicate the issue.

$ cat badmmap.c
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <hugetlbfs.h>
#include <sys/mman.h>
int main(int argc, const char **argv) {
  const size_t len = 224 * 1024 * 1024;
  int count = (argc > 1) ? atoi(argv[1]) : 1000;
  int i;
  for (i = 0; i < count; ++i) {
    int fd = hugetlbfs_unlinked_fd();
    void *ptr = mmap(0, len, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    if (ptr == MAP_FAILED) {
      fprintf(stderr, "mmap() failed on iteration %d (%s)\n", i, strerror(errno));
      return 1;
    }
    close(fd);
    if (munmap(ptr, len) < 0) {
      fprintf(stderr, "munmap() failed on iteration %d (%s)\n", i, strerror(errno));
      return 1;
    }
  }
  printf("PASSED %d iters\n", count);
  return 0;
}

$ cc -o badmmap badmmap.c -lhugetlbfs

Run 8 or so instances in parallel to reproduce.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: hugetlb: reservation race leading to under provisioning
  2017-01-06  8:58   ` Michal Hocko
  2017-01-06 15:01     ` Michal Hocko
@ 2017-01-06 21:57     ` Paul Cassella
  2017-01-08 19:08       ` Mike Kravetz
  2017-01-09  8:58       ` Michal Hocko
  1 sibling, 2 replies; 8+ messages in thread
From: Paul Cassella @ 2017-01-06 21:57 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Mike Kravetz, Naoya Horiguchi, linux-mm

On Fri, 6 Jan 2017, Michal Hocko wrote:
> On Thu 05-01-17 16:48:03, Mike Kravetz wrote:
> > On 01/05/2017 07:15 AM, Michal Hocko wrote:

> > > we have a customer report on an older kernel (3.12) but I believe the
> > > same issue is present in the current vanilla kernel. There is a race
> > > between mmap trying to do a reservation which fails when racing with
> > > truncate_hugepages. See the reproduced attached.
> > > 
> > > It should go like this (analysis come from the customer and I hope I
> > > haven't screwed their write up).

Hi Michal,

There may have been a step missing from what was sent to you, right at the 
point Mike asked about.  I've added it below.


> > > : Task (T1) does mmap and calls into gather_surplus_pages(), looking for N
> > > : pages.  It determines it needs to allocate N pages, drops the lock, and
> > > : does so.
> > > : 
> > > : We will have:
> > > : hstate->resv_huge_pages == N
> > > : hstate->free_huge_pages == N

Note that those N pages are not T1's.  (The test case involves several 
tasks creating files of the same size.)  Those N pages belong to a 
different file that T2 is about to munmap:

> > > : That mapping is then munmap()ed by task T2, which truncates the file:
> > > : 
> > > : truncate_hugepages() {
> > > : 	for each page of the inode after lstart {
> > > : 		truncate_huge_page(page) {
> > > : 			hugetlb_unreserve_pages() {
> > > : 				hugetlb_acct_memory() {
> > > : 					return_unused_surplus_pages() {
> > > : 
> > > : return_unused_surplus_pages() drops h->resv_huge_pages to 0, then
> > > : begins calling free_pool_huge_page() N times:
> > > : 
> > > : 	h->resv_huge_pages -= unused_resv_pages
> > > : 	while (nr_pages--) {
> > > : 		free_pool_huge_page(h, &node_states[N_MEMORY], 1) {
> > > : 			h->free_huge_pages--;
> > > : 		}
> > > : 		cond_resched_lock(&hugetlb_lock);
> > > : 	}
> > > : 
> > > : But the cond_resched_lock() triggers, and it releases the lock with
> > > : 
> > > : h->resv_huge_pages == 0
> > > : h->free_huge_pages == M << N

T2 at this point has freed N-M pages.


> > > : T1 having completed its allocations with allocated == N now
> > > : acquires the lock, and recomputes
> > > : 
> > > : needed = (h->resv_huge_pages + delta) - (h->free_huge_pages + allocated);
> > > : 
> > > : needed = N - (M + N) = -M
> > > : 
> > > : Then
> > > : 
> > > : needed += N                  = -M+N
> > > : h->resv_huge_pages += N       = N
> > > : 
> > > : It frees N-M pages to the hugetlb pool via enqueue_huge_page(),
> > > : 
> > > : list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
> > > : 	if ((--needed) < 0)
> > > : 		break;
> > > : 		/*
> > > : 		* This page is now managed by the hugetlb allocator and has
> > > : 		* no users -- drop the buddy allocator's reference.
> > > : 		*/
> > > : 		put_page_testzero(page);
> > > : 		VM_BUG_ON(page_count(page));
> > > : 		enqueue_huge_page(h, page) {
> > > : 			h->free_huge_pages++;
> > > : 		}
> > > : 	}


> > Are you sure about free_huge_page?

Hi Mike,

There was a step missing here.  You're right at that this point

    h->resv-huge_pages == N
> > h->free_huge_pages == M + (N-M)
                                    == N

Continuing with T1 releasing the lock:

> > > : It releases the lock in order to free the remainder of surplus_list
> > > : via put_page().

When T1 releases the lock, T2 reacquires it and continues its loop in
return_unused_surplus_pages().  It calls free_pool_huge_page() M
more times to go with the N-M it had already done.  Then T2 releases
the lock with

h->resv_huge_pages == N
h->free_huge_pages == N - M

> > > : When it releases the lock, T1 reclaims it and returns from
> > > : gather_surplus_pages().
> > > : 
> > > : But then hugetlb_acct_memory() checks
> > > : 
> > > : 	if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {
> > > : 		return_unused_surplus_pages(h, delta);
> > > : 		goto out;
> > > : 	}
> > > : 
> > > : and returns -ENOMEM.



> > I'm wondering if this may have more to do with numa allocations of
> > surplus pages.  Do you know if customer uses any memory policy for
> > allocations?  There was a change after 3.12 for this (commit 099730d67417).

FWIW, we do have that commit applied for reasons unrelated to this bug.

I had been wondering about the numa aspect, but the test case reproduces 
the problem on a non-numa system with a more recent vanilla kernel.


-- 
Paul Cassella

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: hugetlb: reservation race leading to under provisioning
  2017-01-06 21:57     ` Paul Cassella
@ 2017-01-08 19:08       ` Mike Kravetz
  2017-01-09 10:25         ` Michal Hocko
  2017-01-09  8:58       ` Michal Hocko
  1 sibling, 1 reply; 8+ messages in thread
From: Mike Kravetz @ 2017-01-08 19:08 UTC (permalink / raw)
  To: Paul Cassella, Michal Hocko; +Cc: Naoya Horiguchi, linux-mm

On 01/06/2017 01:57 PM, Paul Cassella wrote:
> On Fri, 6 Jan 2017, Michal Hocko wrote:
>> On Thu 05-01-17 16:48:03, Mike Kravetz wrote:
>>> On 01/05/2017 07:15 AM, Michal Hocko wrote:
> 
>>>> we have a customer report on an older kernel (3.12) but I believe the
>>>> same issue is present in the current vanilla kernel. There is a race
>>>> between mmap trying to do a reservation which fails when racing with
>>>> truncate_hugepages. See the reproduced attached.
>>>>
>>>> It should go like this (analysis come from the customer and I hope I
>>>> haven't screwed their write up).
> 
> Hi Michal,
> 
> There may have been a step missing from what was sent to you, right at the 
> point Mike asked about.  I've added it below.
> 
> 
>>>> : Task (T1) does mmap and calls into gather_surplus_pages(), looking for N
>>>> : pages.  It determines it needs to allocate N pages, drops the lock, and
>>>> : does so.
>>>> : 
>>>> : We will have:
>>>> : hstate->resv_huge_pages == N
>>>> : hstate->free_huge_pages == N
> 
> Note that those N pages are not T1's.  (The test case involves several 
> tasks creating files of the same size.)  Those N pages belong to a 
> different file that T2 is about to munmap:
> 
>>>> : That mapping is then munmap()ed by task T2, which truncates the file:
>>>> : 
>>>> : truncate_hugepages() {
>>>> : 	for each page of the inode after lstart {
>>>> : 		truncate_huge_page(page) {
>>>> : 			hugetlb_unreserve_pages() {
>>>> : 				hugetlb_acct_memory() {
>>>> : 					return_unused_surplus_pages() {
>>>> : 
>>>> : return_unused_surplus_pages() drops h->resv_huge_pages to 0, then
>>>> : begins calling free_pool_huge_page() N times:
>>>> : 
>>>> : 	h->resv_huge_pages -= unused_resv_pages
>>>> : 	while (nr_pages--) {
>>>> : 		free_pool_huge_page(h, &node_states[N_MEMORY], 1) {
>>>> : 			h->free_huge_pages--;
>>>> : 		}
>>>> : 		cond_resched_lock(&hugetlb_lock);
>>>> : 	}
>>>> : 
>>>> : But the cond_resched_lock() triggers, and it releases the lock with
>>>> : 
>>>> : h->resv_huge_pages == 0
>>>> : h->free_huge_pages == M << N
> 
> T2 at this point has freed N-M pages.
> 
> 
>>>> : T1 having completed its allocations with allocated == N now
>>>> : acquires the lock, and recomputes
>>>> : 
>>>> : needed = (h->resv_huge_pages + delta) - (h->free_huge_pages + allocated);
>>>> : 
>>>> : needed = N - (M + N) = -M
>>>> : 
>>>> : Then
>>>> : 
>>>> : needed += N                  = -M+N
>>>> : h->resv_huge_pages += N       = N
>>>> : 
>>>> : It frees N-M pages to the hugetlb pool via enqueue_huge_page(),
>>>> : 
>>>> : list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
>>>> : 	if ((--needed) < 0)
>>>> : 		break;
>>>> : 		/*
>>>> : 		* This page is now managed by the hugetlb allocator and has
>>>> : 		* no users -- drop the buddy allocator's reference.
>>>> : 		*/
>>>> : 		put_page_testzero(page);
>>>> : 		VM_BUG_ON(page_count(page));
>>>> : 		enqueue_huge_page(h, page) {
>>>> : 			h->free_huge_pages++;
>>>> : 		}
>>>> : 	}
> 
> 
>>> Are you sure about free_huge_page?
> 
> Hi Mike,
> 
> There was a step missing here.  You're right at that this point
> 
>     h->resv-huge_pages == N
>>> h->free_huge_pages == M + (N-M)
>                                     == N
> 
> Continuing with T1 releasing the lock:
> 
>>>> : It releases the lock in order to free the remainder of surplus_list
>>>> : via put_page().
> 
> When T1 releases the lock, T2 reacquires it and continues its loop in
> return_unused_surplus_pages().  It calls free_pool_huge_page() M
> more times to go with the N-M it had already done.  Then T2 releases
> the lock with
> 
> h->resv_huge_pages == N
> h->free_huge_pages == N - M
> 
>>>> : When it releases the lock, T1 reclaims it and returns from
>>>> : gather_surplus_pages().
>>>> : 
>>>> : But then hugetlb_acct_memory() checks
>>>> : 
>>>> : 	if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {
>>>> : 		return_unused_surplus_pages(h, delta);
>>>> : 		goto out;
>>>> : 	}
>>>> : 
>>>> : and returns -ENOMEM.
> 
> 
> 
>>> I'm wondering if this may have more to do with numa allocations of
>>> surplus pages.  Do you know if customer uses any memory policy for
>>> allocations?  There was a change after 3.12 for this (commit 099730d67417).
> 
> FWIW, we do have that commit applied for reasons unrelated to this bug.
> 
> I had been wondering about the numa aspect, but the test case reproduces 
> the problem on a non-numa system with a more recent vanilla kernel.

Thanks for the additional information Paul.

I had to think about it a lot, but agree that the cond_resched_lock added
to return_unused_surplus_pages in commit 7848a4bf51b3.  As noted, the
routine first decrements resv_huge_pages by unused_resv_pages and then
frees surplus pages one by one while potentially dropping the hugetlb_lock
between calls to free pages.  Also note that these surplus pages are
counted as free pages, so they can be used by anyone else.  The only thing
that prevents them from being taken by others is the reservation count
(resv_huge_pages).

If the lock is dropped before freeing all the surplus pages, another task
(such as T1 in this case) can use the pages for it's own proposes.  In the
case of T1, it uses the pages to back the reservation associated with a
mmap call.  Note that one of the rules/assumptions is that the count
free_huge_pages must be greater than or equal to resv_huge_pages.

There is technically no problem with another task (such as T1) using the
free surplus pages.  The problem happens when return_unused_surplus_pages
reacquires the lock and continues to free surplus huge pages.  It does not
know (or take into account) that another task may have claimed the pages.

There are a couple ways to address this issue.  As Michal points out, we
could just revert 7848a4bf51b3.  But, that was added for a reason so we
would reintroduce the soft lockup issue it was trying to fix.  Another
approach is to have return_unused_surplus_pages decrement the resv_huge_pages
as it frees the surplus pages.  Michal's patch to do this did not always
decrement resv_huge_pages as it should.  A fixed up version of Michal's patch
would be something like the following (untested):

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3edb759..221abdc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1783,12 +1783,9 @@ static void return_unused_surplus_pages(struct hstate *h,
 {
 	unsigned long nr_pages;
 
-	/* Uncommit the reservation */
-	h->resv_huge_pages -= unused_resv_pages;
-
 	/* Cannot return gigantic pages currently */
 	if (hstate_is_gigantic(h))
-		return;
+		goto out;
 
 	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
 
@@ -1801,10 +1798,16 @@ static void return_unused_surplus_pages(struct hstate *h,
 	 * on-line nodes with memory and will handle the hstate accounting.
 	 */
 	while (nr_pages--) {
+		h->resv_huge_pages--;
+		unused_resv_pages--;
 		if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
-			break;
+			goto out;
 		cond_resched_lock(&hugetlb_lock);
 	}
+
+out:
+	/* Fully uncommit the reservation */
+	h->resv_huge_pages -= unused_resv_pages;
 }

-- 
Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: hugetlb: reservation race leading to under provisioning
  2017-01-06 21:57     ` Paul Cassella
  2017-01-08 19:08       ` Mike Kravetz
@ 2017-01-09  8:58       ` Michal Hocko
  1 sibling, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2017-01-09  8:58 UTC (permalink / raw)
  To: Paul Cassella; +Cc: Mike Kravetz, Naoya Horiguchi, linux-mm

On Fri 06-01-17 13:57:59, Paul Cassella wrote:
> On Fri, 6 Jan 2017, Michal Hocko wrote:
> > On Thu 05-01-17 16:48:03, Mike Kravetz wrote:
> > > On 01/05/2017 07:15 AM, Michal Hocko wrote:
> 
> > > > we have a customer report on an older kernel (3.12) but I believe the
> > > > same issue is present in the current vanilla kernel. There is a race
> > > > between mmap trying to do a reservation which fails when racing with
> > > > truncate_hugepages. See the reproduced attached.
> > > > 
> > > > It should go like this (analysis come from the customer and I hope I
> > > > haven't screwed their write up).
> 
> Hi Michal,
> 
> There may have been a step missing from what was sent to you, right at the 
> point Mike asked about.  I've added it below.

No, I guess the info I got from you was complete. I was trying to reduce
the info to the minimum and wasn't explicit about this fact enough
assuming it would be clear from the test case which I forgot to add...
Sorry about that!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: hugetlb: reservation race leading to under provisioning
  2017-01-08 19:08       ` Mike Kravetz
@ 2017-01-09 10:25         ` Michal Hocko
  0 siblings, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2017-01-09 10:25 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Paul Cassella, Naoya Horiguchi, linux-mm

On Sun 08-01-17 11:08:30, Mike Kravetz wrote:
[...]
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3edb759..221abdc 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1783,12 +1783,9 @@ static void return_unused_surplus_pages(struct hstate *h,
>  {
>  	unsigned long nr_pages;
>  
> -	/* Uncommit the reservation */
> -	h->resv_huge_pages -= unused_resv_pages;
> -
>  	/* Cannot return gigantic pages currently */
>  	if (hstate_is_gigantic(h))
> -		return;
> +		goto out;
>  
>  	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
>  
> @@ -1801,10 +1798,16 @@ static void return_unused_surplus_pages(struct hstate *h,
>  	 * on-line nodes with memory and will handle the hstate accounting.
>  	 */
>  	while (nr_pages--) {
> +		h->resv_huge_pages--;
> +		unused_resv_pages--;
>  		if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
> -			break;
> +			goto out;
>  		cond_resched_lock(&hugetlb_lock);
>  	}
> +
> +out:
> +	/* Fully uncommit the reservation */
> +	h->resv_huge_pages -= unused_resv_pages;
>  }

OK, this would handle the case I was wondering about. It really deserves
a big fat comment explaining when this can happen.

Other than that this looks OK and safe enough for the stable after
second look into the code.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-01-09 10:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-05 15:15 hugetlb: reservation race leading to under provisioning Michal Hocko
2017-01-06  0:48 ` Mike Kravetz
2017-01-06  8:58   ` Michal Hocko
2017-01-06 15:01     ` Michal Hocko
2017-01-06 21:57     ` Paul Cassella
2017-01-08 19:08       ` Mike Kravetz
2017-01-09 10:25         ` Michal Hocko
2017-01-09  8:58       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.