[PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages
@ 2021-04-21  6:02 Muchun Song
  2021-04-21  8:03 ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Muchun Song @ 2021-04-21  6:02 UTC (permalink / raw)
  To: mike.kravetz, akpm, mhocko, osalvador; +Cc: linux-mm, linux-kernel, Muchun Song

The possible bad scenario:

CPU0:                           CPU1:

                                gather_surplus_pages()
                                  page = alloc_surplus_huge_page()
memory_failure_hugetlb()
  get_hwpoison_page(page)
    __get_hwpoison_page(page)
      get_page_unless_zero(page)
                                  zero = put_page_testzero(page)
                                  VM_BUG_ON_PAGE(!zero, page)
                                  enqueue_huge_page(h, page)
  put_page(page)

The refcount can possibly be increased by memory-failure or soft_offline
handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the
hugetlb pool list.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3476aa06da70..6c96332db34b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2145,17 +2145,14 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 
 	/* Free the needed pages to the hugetlb pool */
 	list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
-		int zeroed;
-
 		if ((--needed) < 0)
 			break;
 		/*
-		 * This page is now managed by the hugetlb allocator and has
-		 * no users -- drop the buddy allocator's reference.
+		 * The refcount can possibly be increased by memory-failure or
+		 * soft_offline handlers.
 		 */
-		zeroed = put_page_testzero(page);
-		VM_BUG_ON_PAGE(!zeroed, page);
-		enqueue_huge_page(h, page);
+		if (likely(put_page_testzero(page)))
+			enqueue_huge_page(h, page);
 	}
 free:
 	spin_unlock_irq(&hugetlb_lock);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages
  2021-04-21  6:02 [PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages Muchun Song
@ 2021-04-21  8:03 ` Michal Hocko
  2021-04-21  8:15   ` [External] " Muchun Song
  2021-04-21  8:33   ` HORIGUCHI NAOYA(堀口　直也)
  0 siblings, 2 replies; 20+ messages in thread
From: Michal Hocko @ 2021-04-21  8:03 UTC (permalink / raw)
  To: Muchun Song
  Cc: mike.kravetz, akpm, osalvador, linux-mm, linux-kernel, Naoya Horiguchi

[Cc Naoya]

On Wed 21-04-21 14:02:59, Muchun Song wrote:
> The possible bad scenario:
> 
> CPU0:                           CPU1:
> 
>                                 gather_surplus_pages()
>                                   page = alloc_surplus_huge_page()
> memory_failure_hugetlb()
>   get_hwpoison_page(page)
>     __get_hwpoison_page(page)
>       get_page_unless_zero(page)
>                                   zero = put_page_testzero(page)
>                                   VM_BUG_ON_PAGE(!zero, page)
>                                   enqueue_huge_page(h, page)
>   put_page(page)
> 
> The refcount can possibly be increased by memory-failure or soft_offline
> handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the
> hugetlb pool list.

The hwpoison side of this looks really suspicious to me. It shouldn't
really touch the reference count of hugetlb pages without being very
careful (and having hugetlb_lock held). What would happen if the
reference count was increased after the page has been enqueed into the
pool? This can just blow up later.

> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/hugetlb.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3476aa06da70..6c96332db34b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2145,17 +2145,14 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>  
>  	/* Free the needed pages to the hugetlb pool */
>  	list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
> -		int zeroed;
> -
>  		if ((--needed) < 0)
>  			break;
>  		/*
> -		 * This page is now managed by the hugetlb allocator and has
> -		 * no users -- drop the buddy allocator's reference.
> +		 * The refcount can possibly be increased by memory-failure or
> +		 * soft_offline handlers.
>  		 */
> -		zeroed = put_page_testzero(page);
> -		VM_BUG_ON_PAGE(!zeroed, page);
> -		enqueue_huge_page(h, page);
> +		if (likely(put_page_testzero(page)))
> +			enqueue_huge_page(h, page);
>  	}
>  free:
>  	spin_unlock_irq(&hugetlb_lock);
> -- 
> 2.11.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [External] Re: [PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages
  2021-04-21  8:03 ` Michal Hocko
@ 2021-04-21  8:15   ` Muchun Song
  2021-04-21  8:21     ` Oscar Salvador
  2021-04-21  8:25     ` Michal Hocko
  2021-04-21  8:33   ` HORIGUCHI NAOYA(堀口　直也)
  1 sibling, 2 replies; 20+ messages in thread
From: Muchun Song @ 2021-04-21  8:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Andrew Morton, Oscar Salvador,
	Linux Memory Management List, LKML, Naoya Horiguchi

On Wed, Apr 21, 2021 at 4:03 PM Michal Hocko <mhocko@suse.com> wrote:
>
> [Cc Naoya]
>
> On Wed 21-04-21 14:02:59, Muchun Song wrote:
> > The possible bad scenario:
> >
> > CPU0:                           CPU1:
> >
> >                                 gather_surplus_pages()
> >                                   page = alloc_surplus_huge_page()
> > memory_failure_hugetlb()
> >   get_hwpoison_page(page)
> >     __get_hwpoison_page(page)
> >       get_page_unless_zero(page)
> >                                   zero = put_page_testzero(page)
> >                                   VM_BUG_ON_PAGE(!zero, page)
> >                                   enqueue_huge_page(h, page)
> >   put_page(page)
> >
> > The refcount can possibly be increased by memory-failure or soft_offline
> > handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the
> > hugetlb pool list.
>
> The hwpoison side of this looks really suspicious to me. It shouldn't
> really touch the reference count of hugetlb pages without being very
> careful (and having hugetlb_lock held). What would happen if the
> reference count was increased after the page has been enqueed into the
> pool? This can just blow up later.

If the page has been enqueued into the pool, then the page can be
allocated to other users. The page reference count will be reset to
1 in the dequeue_huge_page_node_exact(). Then memory-failure
will free the page because of put_page(). This is wrong. Because
there is another user.

>
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  mm/hugetlb.c | 11 ++++-------
> >  1 file changed, 4 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 3476aa06da70..6c96332db34b 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2145,17 +2145,14 @@ static int gather_surplus_pages(struct hstate *h, long delta)
> >
> >       /* Free the needed pages to the hugetlb pool */
> >       list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
> > -             int zeroed;
> > -
> >               if ((--needed) < 0)
> >                       break;
> >               /*
> > -              * This page is now managed by the hugetlb allocator and has
> > -              * no users -- drop the buddy allocator's reference.
> > +              * The refcount can possibly be increased by memory-failure or
> > +              * soft_offline handlers.
> >                */
> > -             zeroed = put_page_testzero(page);
> > -             VM_BUG_ON_PAGE(!zeroed, page);
> > -             enqueue_huge_page(h, page);
> > +             if (likely(put_page_testzero(page)))
> > +                     enqueue_huge_page(h, page);
> >       }
> >  free:
> >       spin_unlock_irq(&hugetlb_lock);
> > --
> > 2.11.0
> >
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [External] Re: [PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages
  2021-04-21  8:15   ` [External] " Muchun Song
@ 2021-04-21  8:21     ` Oscar Salvador
  2021-04-21  8:41       ` Muchun Song
  2021-04-21  8:43       ` Michal Hocko
  2021-04-21  8:25     ` Michal Hocko
  1 sibling, 2 replies; 20+ messages in thread
From: Oscar Salvador @ 2021-04-21  8:21 UTC (permalink / raw)
  To: Muchun Song
  Cc: Michal Hocko, Mike Kravetz, Andrew Morton,
	Linux Memory Management List, LKML, Naoya Horiguchi

On Wed, Apr 21, 2021 at 04:15:00PM +0800, Muchun Song wrote:
> > The hwpoison side of this looks really suspicious to me. It shouldn't
> > really touch the reference count of hugetlb pages without being very
> > careful (and having hugetlb_lock held). What would happen if the
> > reference count was increased after the page has been enqueed into the
> > pool? This can just blow up later.
> 
> If the page has been enqueued into the pool, then the page can be
> allocated to other users. The page reference count will be reset to
> 1 in the dequeue_huge_page_node_exact(). Then memory-failure
> will free the page because of put_page(). This is wrong. Because
> there is another user.

Note that dequeue_huge_page_node_exact() will not hand over any pages
which are poisoned, so in this case it will not be allocated.
But it is true that we might need hugetlb lock, this needs some more
thought.

I will have a look. 

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [External] Re: [PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages
  2021-04-21  8:15   ` [External] " Muchun Song
  2021-04-21  8:21     ` Oscar Salvador
@ 2021-04-21  8:25     ` Michal Hocko
  1 sibling, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2021-04-21  8:25 UTC (permalink / raw)
  To: Muchun Song
  Cc: Mike Kravetz, Andrew Morton, Oscar Salvador,
	Linux Memory Management List, LKML, Naoya Horiguchi

On Wed 21-04-21 16:15:00, Muchun Song wrote:
> On Wed, Apr 21, 2021 at 4:03 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > [Cc Naoya]
> >
> > On Wed 21-04-21 14:02:59, Muchun Song wrote:
> > > The possible bad scenario:
> > >
> > > CPU0:                           CPU1:
> > >
> > >                                 gather_surplus_pages()
> > >                                   page = alloc_surplus_huge_page()
> > > memory_failure_hugetlb()
> > >   get_hwpoison_page(page)
> > >     __get_hwpoison_page(page)
> > >       get_page_unless_zero(page)
> > >                                   zero = put_page_testzero(page)
> > >                                   VM_BUG_ON_PAGE(!zero, page)
> > >                                   enqueue_huge_page(h, page)
> > >   put_page(page)
> > >
> > > The refcount can possibly be increased by memory-failure or soft_offline
> > > handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the
> > > hugetlb pool list.
> >
> > The hwpoison side of this looks really suspicious to me. It shouldn't
> > really touch the reference count of hugetlb pages without being very
> > careful (and having hugetlb_lock held). What would happen if the
> > reference count was increased after the page has been enqueed into the
> > pool? This can just blow up later.
> 
> If the page has been enqueued into the pool, then the page can be
> allocated to other users. The page reference count will be reset to
> 1 in the dequeue_huge_page_node_exact(). Then memory-failure
> will free the page because of put_page(). This is wrong. Because
> there is another user.

Yes that is one of the scenarios but I suspect there are more lurking
there. That was my point that this should be addressed at the hwpoison
side.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages
  2021-04-21  8:03 ` Michal Hocko
  2021-04-21  8:15   ` [External] " Muchun Song
@ 2021-04-21  8:33   ` HORIGUCHI NAOYA(堀口　直也)
  2021-04-21  9:02     ` [External] " Muchun Song
  2021-04-21 18:03     ` Mike Kravetz
  1 sibling, 2 replies; 20+ messages in thread
From: HORIGUCHI NAOYA(堀口　直也) @ 2021-04-21  8:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Muchun Song, mike.kravetz, akpm, osalvador, linux-mm,
	linux-kernel, Naoya Horiguchi

On Wed, Apr 21, 2021 at 10:03:34AM +0200, Michal Hocko wrote:
> [Cc Naoya]
> 
> On Wed 21-04-21 14:02:59, Muchun Song wrote:
> > The possible bad scenario:
> > 
> > CPU0:                           CPU1:
> > 
> >                                 gather_surplus_pages()
> >                                   page = alloc_surplus_huge_page()
> > memory_failure_hugetlb()
> >   get_hwpoison_page(page)
> >     __get_hwpoison_page(page)
> >       get_page_unless_zero(page)
> >                                   zero = put_page_testzero(page)
> >                                   VM_BUG_ON_PAGE(!zero, page)
> >                                   enqueue_huge_page(h, page)
> >   put_page(page)
> > 
> > The refcount can possibly be increased by memory-failure or soft_offline
> > handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the
> > hugetlb pool list.
> 
> The hwpoison side of this looks really suspicious to me. It shouldn't
> really touch the reference count of hugetlb pages without being very
> careful (and having hugetlb_lock held).

I have the same feeling, there is a window where a hugepage is refcounted
during converting from buddy free pages into free hugepage, so refcount
alone is not enough to prevent the race.  hugetlb_lock is retaken after
alloc_surplus_huge_page returns, so simply holding hugetlb_lock in
get_hwpoison_page() seems not work.  Is there any status bit to show that a
hugepage is just being initialized (not in free hugepage pool or in use)?

> What would happen if the
> reference count was increased after the page has been enqueed into the
> pool? This can just blow up later.

Yes, this is another concern.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [External] Re: [PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages
  2021-04-21  8:21     ` Oscar Salvador
@ 2021-04-21  8:41       ` Muchun Song
  2021-04-21  8:49         ` Oscar Salvador
  2021-04-21  8:43       ` Michal Hocko
  1 sibling, 1 reply; 20+ messages in thread
From: Muchun Song @ 2021-04-21  8:41 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Michal Hocko, Mike Kravetz, Andrew Morton,
	Linux Memory Management List, LKML, Naoya Horiguchi

On Wed, Apr 21, 2021 at 4:21 PM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Wed, Apr 21, 2021 at 04:15:00PM +0800, Muchun Song wrote:
> > > The hwpoison side of this looks really suspicious to me. It shouldn't
> > > really touch the reference count of hugetlb pages without being very
> > > careful (and having hugetlb_lock held). What would happen if the
> > > reference count was increased after the page has been enqueed into the
> > > pool? This can just blow up later.
> >
> > If the page has been enqueued into the pool, then the page can be
> > allocated to other users. The page reference count will be reset to
> > 1 in the dequeue_huge_page_node_exact(). Then memory-failure
> > will free the page because of put_page(). This is wrong. Because
> > there is another user.
>
> Note that dequeue_huge_page_node_exact() will not hand over any pages
> which are poisoned, so in this case it will not be allocated.

But softoffline does not set page hwpoison before
__get_hwpoison_page(). So the page still can be
allocated. Right?

> But it is true that we might need hugetlb lock, this needs some more
> thought.
>
> I will have a look.
>
> --
> Oscar Salvador
> SUSE L3

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [External] Re: [PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages
  2021-04-21  8:21     ` Oscar Salvador
  2021-04-21  8:41       ` Muchun Song
@ 2021-04-21  8:43       ` Michal Hocko
  1 sibling, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2021-04-21  8:43 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Muchun Song, Mike Kravetz, Andrew Morton,
	Linux Memory Management List, LKML, Naoya Horiguchi

On Wed 21-04-21 10:21:03, Oscar Salvador wrote:
> On Wed, Apr 21, 2021 at 04:15:00PM +0800, Muchun Song wrote:
> > > The hwpoison side of this looks really suspicious to me. It shouldn't
> > > really touch the reference count of hugetlb pages without being very
> > > careful (and having hugetlb_lock held). What would happen if the
> > > reference count was increased after the page has been enqueed into the
> > > pool? This can just blow up later.
> > 
> > If the page has been enqueued into the pool, then the page can be
> > allocated to other users. The page reference count will be reset to
> > 1 in the dequeue_huge_page_node_exact(). Then memory-failure
> > will free the page because of put_page(). This is wrong. Because
> > there is another user.
> 
> Note that dequeue_huge_page_node_exact() will not hand over any pages
> which are poisoned, so in this case it will not be allocated.

I have to say I have missed the HWPoison check so the this particular
scenario is not possible indeed.

> But it is true that we might need hugetlb lock, this needs some more
> thought.

yes, nobody should be touching to the reference count of hugetlb pool
pages out of the hugetlb proper.

> I will have a look. 

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [External] Re: [PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages
  2021-04-21  8:41       ` Muchun Song
@ 2021-04-21  8:49         ` Oscar Salvador
  2021-04-21  8:58           ` Muchun Song
  0 siblings, 1 reply; 20+ messages in thread
From: Oscar Salvador @ 2021-04-21  8:49 UTC (permalink / raw)
  To: Muchun Song
  Cc: Michal Hocko, Mike Kravetz, Andrew Morton,
	Linux Memory Management List, LKML, Naoya Horiguchi

On Wed, Apr 21, 2021 at 04:41:10PM +0800, Muchun Song wrote:
 
> But softoffline does not set page hwpoison before
> __get_hwpoison_page(). So the page still can be
> allocated. Right?

Yep, soft_offline() only marks the page as hwpoison once the page has been
fully contended and no other use is possible.
But yeah, hugetlb is a bit trickier in that regard.

This needs fixing in there.


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [External] Re: [PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages
  2021-04-21  8:49         ` Oscar Salvador
@ 2021-04-21  8:58           ` Muchun Song
  0 siblings, 0 replies; 20+ messages in thread
From: Muchun Song @ 2021-04-21  8:58 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Michal Hocko, Mike Kravetz, Andrew Morton,
	Linux Memory Management List, LKML, Naoya Horiguchi

On Wed, Apr 21, 2021 at 4:49 PM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Wed, Apr 21, 2021 at 04:41:10PM +0800, Muchun Song wrote:
>
> > But softoffline does not set page hwpoison before
> > __get_hwpoison_page(). So the page still can be
> > allocated. Right?
>
> Yep, soft_offline() only marks the page as hwpoison once the page has been
> fully contended and no other use is possible.
> But yeah, hugetlb is a bit trickier in that regard.
>
> This needs fixing in there.

It is OK to fix it in softoffline/memory-failure.
I just want to expose the race. Thanks.

>
>
> --
> Oscar Salvador
> SUSE L3

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [External] Re: [PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages
  2021-04-21  8:33   ` HORIGUCHI NAOYA(堀口　直也)
@ 2021-04-21  9:02     ` Muchun Song
  2021-04-21 18:03     ` Mike Kravetz
  1 sibling, 0 replies; 20+ messages in thread
From: Muchun Song @ 2021-04-21  9:02 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Michal Hocko, mike.kravetz, akpm, osalvador, linux-mm,
	linux-kernel, Naoya Horiguchi

On Wed, Apr 21, 2021 at 4:33 PM HORIGUCHI NAOYA(堀口　直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Wed, Apr 21, 2021 at 10:03:34AM +0200, Michal Hocko wrote:
> > [Cc Naoya]
> >
> > On Wed 21-04-21 14:02:59, Muchun Song wrote:
> > > The possible bad scenario:
> > >
> > > CPU0:                           CPU1:
> > >
> > >                                 gather_surplus_pages()
> > >                                   page = alloc_surplus_huge_page()
> > > memory_failure_hugetlb()
> > >   get_hwpoison_page(page)
> > >     __get_hwpoison_page(page)
> > >       get_page_unless_zero(page)
> > >                                   zero = put_page_testzero(page)
> > >                                   VM_BUG_ON_PAGE(!zero, page)
> > >                                   enqueue_huge_page(h, page)
> > >   put_page(page)
> > >
> > > The refcount can possibly be increased by memory-failure or soft_offline
> > > handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the
> > > hugetlb pool list.
> >
> > The hwpoison side of this looks really suspicious to me. It shouldn't
> > really touch the reference count of hugetlb pages without being very
> > careful (and having hugetlb_lock held).
>
> I have the same feeling, there is a window where a hugepage is refcounted
> during converting from buddy free pages into free hugepage, so refcount
> alone is not enough to prevent the race.  hugetlb_lock is retaken after
> alloc_surplus_huge_page returns, so simply holding hugetlb_lock in
> get_hwpoison_page() seems not work.  Is there any status bit to show that a
> hugepage is just being initialized (not in free hugepage pool or in use)?

HPageFreed() can indicate whether a page is on the
free pool list.

>
> > What would happen if the
> > reference count was increased after the page has been enqueed into the
> > pool? This can just blow up later.
>
> Yes, this is another concern.
>
> Thanks,
> Naoya Horiguchi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages
  2021-04-21  8:33   ` HORIGUCHI NAOYA(堀口　直也)
  2021-04-21  9:02     ` [External] " Muchun Song
@ 2021-04-21 18:03     ` Mike Kravetz
  2021-04-22  8:27       ` HORIGUCHI NAOYA(堀口　直也)
  1 sibling, 1 reply; 20+ messages in thread
From: Mike Kravetz @ 2021-04-21 18:03 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也), Michal Hocko
  Cc: Muchun Song, akpm, osalvador, linux-mm, linux-kernel, Naoya Horiguchi

On 4/21/21 1:33 AM, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Wed, Apr 21, 2021 at 10:03:34AM +0200, Michal Hocko wrote:
>> [Cc Naoya]
>>
>> On Wed 21-04-21 14:02:59, Muchun Song wrote:
>>> The possible bad scenario:
>>>
>>> CPU0:                           CPU1:
>>>
>>>                                 gather_surplus_pages()
>>>                                   page = alloc_surplus_huge_page()
>>> memory_failure_hugetlb()
>>>   get_hwpoison_page(page)
>>>     __get_hwpoison_page(page)
>>>       get_page_unless_zero(page)
>>>                                   zero = put_page_testzero(page)
>>>                                   VM_BUG_ON_PAGE(!zero, page)
>>>                                   enqueue_huge_page(h, page)
>>>   put_page(page)
>>>
>>> The refcount can possibly be increased by memory-failure or soft_offline
>>> handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the
>>> hugetlb pool list.
>>
>> The hwpoison side of this looks really suspicious to me. It shouldn't
>> really touch the reference count of hugetlb pages without being very
>> careful (and having hugetlb_lock held).
> 
> I have the same feeling, there is a window where a hugepage is refcounted
> during converting from buddy free pages into free hugepage, so refcount
> alone is not enough to prevent the race.  hugetlb_lock is retaken after
> alloc_surplus_huge_page returns, so simply holding hugetlb_lock in
> get_hwpoison_page() seems not work.  Is there any status bit to show that a
> hugepage is just being initialized (not in free hugepage pool or in use)?
> 

It seems we can also race with the code that makes a compound page a
hugetlb page.  The memory failure code could be called after allocating
pages from buddy and before setting compound page DTOR.  So, the memory
handling code will process it as a compound page.

Just thinking that this may not be limited to the hugetlb specific memory
failure handling?
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages
  2021-04-21 18:03     ` Mike Kravetz
@ 2021-04-22  8:27       ` HORIGUCHI NAOYA(堀口　直也)
  2021-04-23  8:01         ` HORIGUCHI NAOYA(堀口　直也)
  0 siblings, 1 reply; 20+ messages in thread
From: HORIGUCHI NAOYA(堀口　直也) @ 2021-04-22  8:27 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Michal Hocko, Muchun Song, akpm, osalvador, linux-mm,
	linux-kernel, Naoya Horiguchi

On Wed, Apr 21, 2021 at 11:03:24AM -0700, Mike Kravetz wrote:
> On 4/21/21 1:33 AM, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, Apr 21, 2021 at 10:03:34AM +0200, Michal Hocko wrote:
> >> [Cc Naoya]
> >>
> >> On Wed 21-04-21 14:02:59, Muchun Song wrote:
> >>> The possible bad scenario:
> >>>
> >>> CPU0:                           CPU1:
> >>>
> >>>                                 gather_surplus_pages()
> >>>                                   page = alloc_surplus_huge_page()
> >>> memory_failure_hugetlb()
> >>>   get_hwpoison_page(page)
> >>>     __get_hwpoison_page(page)
> >>>       get_page_unless_zero(page)
> >>>                                   zero = put_page_testzero(page)
> >>>                                   VM_BUG_ON_PAGE(!zero, page)
> >>>                                   enqueue_huge_page(h, page)
> >>>   put_page(page)
> >>>
> >>> The refcount can possibly be increased by memory-failure or soft_offline
> >>> handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the
> >>> hugetlb pool list.
> >>
> >> The hwpoison side of this looks really suspicious to me. It shouldn't
> >> really touch the reference count of hugetlb pages without being very
> >> careful (and having hugetlb_lock held).
> > 
> > I have the same feeling, there is a window where a hugepage is refcounted
> > during converting from buddy free pages into free hugepage, so refcount
> > alone is not enough to prevent the race.  hugetlb_lock is retaken after
> > alloc_surplus_huge_page returns, so simply holding hugetlb_lock in
> > get_hwpoison_page() seems not work.  Is there any status bit to show that a
> > hugepage is just being initialized (not in free hugepage pool or in use)?
> > 
> 
> It seems we can also race with the code that makes a compound page a
> hugetlb page.  The memory failure code could be called after allocating
> pages from buddy and before setting compound page DTOR.  So, the memory
> handling code will process it as a compound page.

Yes, so get_hwpoison_page() has to call get_page_unless_zero()
only when memory_failure() can surely handle the error.

> 
> Just thinking that this may not be limited to the hugetlb specific memory
> failure handling?

Currently hugetlb page is the only type of compound page supported by memory
failure.  But I agree with you that other types of compound pages have the
same race window, and judging only with get_page_unless_zero() is dangerous.
So I think that __get_hwpoison_page() should have the following structure:

  if (PageCompound) {
      if (PageHuge) {
          if (PageHugeFreed || PageHugeActive) {
              if (get_page_unless_zero)
                  return 0;   // path for in-use hugetlb page
              else
                  return 1;   // path for free hugetlb page
          } else {
              return -EBUSY;  // any transient hugetlb page
          }
      } else {
          ... // any other compound page (like thp, slab, ...)
      }
  } else {
      ...   // any non-compound page
  }

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages
  2021-04-22  8:27       ` HORIGUCHI NAOYA(堀口　直也)
@ 2021-04-23  8:01         ` HORIGUCHI NAOYA(堀口　直也)
  2021-04-28  7:46           ` [PATCH] mm,hwpoison: fix race with compound page allocation Naoya Horiguchi
  0 siblings, 1 reply; 20+ messages in thread
From: HORIGUCHI NAOYA(堀口　直也) @ 2021-04-23  8:01 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Michal Hocko, Muchun Song, akpm, osalvador, linux-mm,
	linux-kernel, Naoya Horiguchi

On Thu, Apr 22, 2021 at 08:27:46AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Wed, Apr 21, 2021 at 11:03:24AM -0700, Mike Kravetz wrote:
> > On 4/21/21 1:33 AM, HORIGUCHI NAOYA(堀口 直也) wrote:
> > > On Wed, Apr 21, 2021 at 10:03:34AM +0200, Michal Hocko wrote:
> > >> [Cc Naoya]
> > >>
> > >> On Wed 21-04-21 14:02:59, Muchun Song wrote:
> > >>> The possible bad scenario:
> > >>>
> > >>> CPU0:                           CPU1:
> > >>>
> > >>>                                 gather_surplus_pages()
> > >>>                                   page = alloc_surplus_huge_page()
> > >>> memory_failure_hugetlb()
> > >>>   get_hwpoison_page(page)
> > >>>     __get_hwpoison_page(page)
> > >>>       get_page_unless_zero(page)
> > >>>                                   zero = put_page_testzero(page)
> > >>>                                   VM_BUG_ON_PAGE(!zero, page)
> > >>>                                   enqueue_huge_page(h, page)
> > >>>   put_page(page)
> > >>>
> > >>> The refcount can possibly be increased by memory-failure or soft_offline
> > >>> handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the
> > >>> hugetlb pool list.
> > >>
> > >> The hwpoison side of this looks really suspicious to me. It shouldn't
> > >> really touch the reference count of hugetlb pages without being very
> > >> careful (and having hugetlb_lock held).
> > > 
> > > I have the same feeling, there is a window where a hugepage is refcounted
> > > during converting from buddy free pages into free hugepage, so refcount
> > > alone is not enough to prevent the race.  hugetlb_lock is retaken after
> > > alloc_surplus_huge_page returns, so simply holding hugetlb_lock in
> > > get_hwpoison_page() seems not work.  Is there any status bit to show that a
> > > hugepage is just being initialized (not in free hugepage pool or in use)?
> > > 
> > 
> > It seems we can also race with the code that makes a compound page a
> > hugetlb page.  The memory failure code could be called after allocating
> > pages from buddy and before setting compound page DTOR.  So, the memory
> > handling code will process it as a compound page.
> 
> Yes, so get_hwpoison_page() has to call get_page_unless_zero()
> only when memory_failure() can surely handle the error.
> 
> > 
> > Just thinking that this may not be limited to the hugetlb specific memory
> > failure handling?
> 
> Currently hugetlb page is the only type of compound page supported by memory
> failure.  But I agree with you that other types of compound pages have the
> same race window, and judging only with get_page_unless_zero() is dangerous.
> So I think that __get_hwpoison_page() should have the following structure:
> 
>   if (PageCompound) {
>       if (PageHuge) {
>           if (PageHugeFreed || PageHugeActive) {
>               if (get_page_unless_zero)
>                   return 0;   // path for in-use hugetlb page
>               else
>                   return 1;   // path for free hugetlb page
>           } else {
>               return -EBUSY;  // any transient hugetlb page
>           }
>       } else {
>           ... // any other compound page (like thp, slab, ...)
>       }
>   } else {
>       ...   // any non-compound page
>   }

The above pseudo code was wrong, so let me update my thought.
I'm now trying to solve the reported issue by changing __get_hwpoison_page()
like below:

  static int __get_hwpoison_page(struct page *page)
  {
          struct page *head = compound_head(page);
  
          if (PageCompound(page)) {
                  if (PageSlab(page)) {
                          return get_page_unless_zero(page);
                  } else if (PageHuge(head)) {
                          if (HPageFreed(head) || HPageMigratable(head))
                                  return get_page_unless_zero(head);
                  } else if (PageTransHuge(head)) {
                          /*
                           * Non anonymous thp exists only in allocation/free time. We
                           * can't handle such a case correctly, so let's give it up.
                           * This should be better than triggering BUG_ON when kernel
                           * tries to touch the "partially handled" page.
                           */
                          if (!PageAnon(head)) {
                                  pr_err("Memory failure: %#lx: non anonymous thp\n",
                                         page_to_pfn(page));
                                  return 0;
                          }
                          if (get_page_unless_zero(head)) {
                                  if (head == compound_head(page))
                                          return 1;
                                  pr_info("Memory failure: %#lx cannot catch tail\n",
                                          page_to_pfn(page));
                                  put_page(head);
                          }
                  }
                  return 0;
          }
  
          return get_page_unless_zero(page);
  }

Some notes: 

  - in hugetlb path, new HPage* checks should avoid the reported race,
    but I still need more testing to confirm it,
  - PageSlab check is added because otherwise I found that "non anonymous thp"
    path is chosen, that's obviously wrong,
  - thp's branch has a known issue unrelated to the current issue, which
    will/should be improved later.

I'll send a patch next week.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] mm,hwpoison: fix race with compound page allocation
  2021-04-23  8:01         ` HORIGUCHI NAOYA(堀口　直也)
@ 2021-04-28  7:46           ` Naoya Horiguchi
  2021-04-28  8:23             ` Oscar Salvador
  0 siblings, 1 reply; 20+ messages in thread
From: Naoya Horiguchi @ 2021-04-28  7:46 UTC (permalink / raw)
  To: Mike Kravetz, Michal Hocko, Muchun Song, akpm, osalvador, linux-mm
  Cc: linux-kernel, Naoya Horiguchi

On Fri, Apr 23, 2021 at 08:01:54AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Thu, Apr 22, 2021 at 08:27:46AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, Apr 21, 2021 at 11:03:24AM -0700, Mike Kravetz wrote:
> > > On 4/21/21 1:33 AM, HORIGUCHI NAOYA(堀口 直也) wrote:
> > > > On Wed, Apr 21, 2021 at 10:03:34AM +0200, Michal Hocko wrote:
> > > >> [Cc Naoya]
> > > >>
> > > >> On Wed 21-04-21 14:02:59, Muchun Song wrote:
> > > >>> The possible bad scenario:
> > > >>>
> > > >>> CPU0:                           CPU1:
> > > >>>
> > > >>>                                 gather_surplus_pages()
> > > >>>                                   page = alloc_surplus_huge_page()
> > > >>> memory_failure_hugetlb()
> > > >>>   get_hwpoison_page(page)
> > > >>>     __get_hwpoison_page(page)
> > > >>>       get_page_unless_zero(page)
> > > >>>                                   zero = put_page_testzero(page)
> > > >>>                                   VM_BUG_ON_PAGE(!zero, page)
> > > >>>                                   enqueue_huge_page(h, page)
> > > >>>   put_page(page)
> > > >>>
> > > >>> The refcount can possibly be increased by memory-failure or soft_offline
> > > >>> handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the
> > > >>> hugetlb pool list.
> > > >>
> > > >> The hwpoison side of this looks really suspicious to me. It shouldn't
> > > >> really touch the reference count of hugetlb pages without being very
> > > >> careful (and having hugetlb_lock held).
> > > > 
> > > > I have the same feeling, there is a window where a hugepage is refcounted
> > > > during converting from buddy free pages into free hugepage, so refcount
> > > > alone is not enough to prevent the race.  hugetlb_lock is retaken after
> > > > alloc_surplus_huge_page returns, so simply holding hugetlb_lock in
> > > > get_hwpoison_page() seems not work.  Is there any status bit to show that a
> > > > hugepage is just being initialized (not in free hugepage pool or in use)?
> > > > 
> > > 
> > > It seems we can also race with the code that makes a compound page a
> > > hugetlb page.  The memory failure code could be called after allocating
> > > pages from buddy and before setting compound page DTOR.  So, the memory
> > > handling code will process it as a compound page.
> > 
> > Yes, so get_hwpoison_page() has to call get_page_unless_zero()
> > only when memory_failure() can surely handle the error.
> > 
> > > 
> > > Just thinking that this may not be limited to the hugetlb specific memory
> > > failure handling?
> > 
> > Currently hugetlb page is the only type of compound page supported by memory
> > failure.  But I agree with you that other types of compound pages have the
> > same race window, and judging only with get_page_unless_zero() is dangerous.
> > So I think that __get_hwpoison_page() should have the following structure:
> > 
> >   if (PageCompound) {
> >       if (PageHuge) {
> >           if (PageHugeFreed || PageHugeActive) {
> >               if (get_page_unless_zero)
> >                   return 0;   // path for in-use hugetlb page
> >               else
> >                   return 1;   // path for free hugetlb page
> >           } else {
> >               return -EBUSY;  // any transient hugetlb page
> >           }
> >       } else {
> >           ... // any other compound page (like thp, slab, ...)
> >       }
> >   } else {
> >       ...   // any non-compound page
> >   }
> 
> The above pseudo code was wrong, so let me update my thought.
> I'm now trying to solve the reported issue by changing __get_hwpoison_page()
> like below:
> 
>   static int __get_hwpoison_page(struct page *page)
>   {
>           struct page *head = compound_head(page);
>   
>           if (PageCompound(page)) {
>                   if (PageSlab(page)) {
>                           return get_page_unless_zero(page);
>                   } else if (PageHuge(head)) {
>                           if (HPageFreed(head) || HPageMigratable(head))
>                                   return get_page_unless_zero(head);
>                   } else if (PageTransHuge(head)) {
>                           /*
>                            * Non anonymous thp exists only in allocation/free time. We
>                            * can't handle such a case correctly, so let's give it up.
>                            * This should be better than triggering BUG_ON when kernel
>                            * tries to touch the "partially handled" page.
>                            */
>                           if (!PageAnon(head)) {
>                                   pr_err("Memory failure: %#lx: non anonymous thp\n",
>                                          page_to_pfn(page));
>                                   return 0;
>                           }
>                           if (get_page_unless_zero(head)) {
>                                   if (head == compound_head(page))
>                                           return 1;
>                                   pr_info("Memory failure: %#lx cannot catch tail\n",
>                                           page_to_pfn(page));
>                                   put_page(head);
>                           }
>                   }
>                   return 0;
>           }
>   
>           return get_page_unless_zero(page);
>   }
> 
> Some notes: 
> 
>   - in hugetlb path, new HPage* checks should avoid the reported race,
>     but I still need more testing to confirm it,
>   - PageSlab check is added because otherwise I found that "non anonymous thp"
>     path is chosen, that's obviously wrong,
>   - thp's branch has a known issue unrelated to the current issue, which
>     will/should be improved later.
> 
> I'll send a patch next week.

I confirmed that the patch fixes the reported problem (in the testcase
triggering VM_BUG_ON_PAGE() without this patch).
So let me suggest this as a fix on hwpoison side.

Thanks,
Naoya Horiguchi

---
From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date: Wed, 28 Apr 2021 15:55:47 +0900
Subject: [PATCH] mm,hwpoison: fix race with compound page allocation

When hugetlb page fault (under overcommiting situation) and memory_failure()
race, VM_BUG_ON_PAGE() is triggered by the following race:

    CPU0:                           CPU1:

                                    gather_surplus_pages()
                                      page = alloc_surplus_huge_page()
    memory_failure_hugetlb()
      get_hwpoison_page(page)
        __get_hwpoison_page(page)
          get_page_unless_zero(page)
                                      zero = put_page_testzero(page)
                                      VM_BUG_ON_PAGE(!zero, page)
                                      enqueue_huge_page(h, page)
      put_page(page)

__get_hwpoison_page() only checks page refcount before taking additional
one for memory error handling, which is wrong because there's time
windows where compound pages have non-zero refcount during initialization.

So makes __get_hwpoison_page() check more page status for a few types
of compound pages. PageSlab() check is added because otherwise
"non anonymous thp" path is wrongly chosen for slab pages.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memory-failure.c | 48 +++++++++++++++++++++++++--------------------
 1 file changed, 27 insertions(+), 21 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a3659619d293..61988e332712 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1095,30 +1095,36 @@ static int __get_hwpoison_page(struct page *page)
 {
 	struct page *head = compound_head(page);
 
-	if (!PageHuge(head) && PageTransHuge(head)) {
-		/*
-		 * Non anonymous thp exists only in allocation/free time. We
-		 * can't handle such a case correctly, so let's give it up.
-		 * This should be better than triggering BUG_ON when kernel
-		 * tries to touch the "partially handled" page.
-		 */
-		if (!PageAnon(head)) {
-			pr_err("Memory failure: %#lx: non anonymous thp\n",
-				page_to_pfn(page));
-			return 0;
+	if (PageCompound(page)) {
+		if (PageSlab(page)) {
+			return get_page_unless_zero(page);
+		} else if (PageHuge(head)) {
+			if (HPageFreed(head) || HPageMigratable(head))
+				return get_page_unless_zero(head);
+		} else if (PageTransHuge(head)) {
+			/*
+			 * Non anonymous thp exists only in allocation/free time. We
+			 * can't handle such a case correctly, so let's give it up.
+			 * This should be better than triggering BUG_ON when kernel
+			 * tries to touch the "partially handled" page.
+			 */
+			if (!PageAnon(head)) {
+				pr_err("Memory failure: %#lx: non anonymous thp\n",
+				       page_to_pfn(page));
+				return 0;
+			}
+			if (get_page_unless_zero(head)) {
+				if (head == compound_head(page))
+					return 1;
+				pr_info("Memory failure: %#lx cannot catch tail\n",
+					page_to_pfn(page));
+				put_page(head);
+			}
 		}
+		return 0;
 	}
 
-	if (get_page_unless_zero(head)) {
-		if (head == compound_head(page))
-			return 1;
-
-		pr_info("Memory failure: %#lx cannot catch tail\n",
-			page_to_pfn(page));
-		put_page(head);
-	}
-
-	return 0;
+	return get_page_unless_zero(page);
 }
 
 /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm,hwpoison: fix race with compound page allocation
  2021-04-28  7:46           ` [PATCH] mm,hwpoison: fix race with compound page allocation Naoya Horiguchi
@ 2021-04-28  8:23             ` Oscar Salvador
  2021-04-28  9:18               ` HORIGUCHI NAOYA(堀口　直也)
  0 siblings, 1 reply; 20+ messages in thread
From: Oscar Salvador @ 2021-04-28  8:23 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Mike Kravetz, Michal Hocko, Muchun Song, akpm, linux-mm,
	linux-kernel, Naoya Horiguchi

On Wed, Apr 28, 2021 at 04:46:54PM +0900, Naoya Horiguchi wrote:
> ---
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Wed, 28 Apr 2021 15:55:47 +0900
> Subject: [PATCH] mm,hwpoison: fix race with compound page allocation
> 
> When hugetlb page fault (under overcommiting situation) and memory_failure()
> race, VM_BUG_ON_PAGE() is triggered by the following race:
> 
>     CPU0:                           CPU1:
> 
>                                     gather_surplus_pages()
>                                       page = alloc_surplus_huge_page()
>     memory_failure_hugetlb()
>       get_hwpoison_page(page)
>         __get_hwpoison_page(page)
>           get_page_unless_zero(page)
>                                       zero = put_page_testzero(page)
>                                       VM_BUG_ON_PAGE(!zero, page)
>                                       enqueue_huge_page(h, page)
>       put_page(page)
> 
> __get_hwpoison_page() only checks page refcount before taking additional
> one for memory error handling, which is wrong because there's time
> windows where compound pages have non-zero refcount during initialization.
> 
> So makes __get_hwpoison_page() check more page status for a few types
> of compound pages. PageSlab() check is added because otherwise
> "non anonymous thp" path is wrongly chosen for slab pages.

Was it wrongly chosen even before? If so, maybe a Fix tag is warranted.

> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Reported-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/memory-failure.c | 48 +++++++++++++++++++++++++--------------------
>  1 file changed, 27 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index a3659619d293..61988e332712 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1095,30 +1095,36 @@ static int __get_hwpoison_page(struct page *page)

> +	if (PageCompound(page)) {
> +		if (PageSlab(page)) {
> +			return get_page_unless_zero(page);
> +		} else if (PageHuge(head)) {
> +			if (HPageFreed(head) || HPageMigratable(head))
> +				return get_page_unless_zero(head);

There were concerns raised wrt. memory-failure should not be fiddling with page's
refcount without holding a hugetlb lock.
So, if we really want to make this more stable, we might want to hold the lock
here.

The clearing and setting of HPageFreed happens under the lock, and for HPageMigratable
that is also true for the clearing part, so I think it would be more sane to do
this under the lock to close any possible race.

Does it make sense?

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm,hwpoison: fix race with compound page allocation
  2021-04-28  8:23             ` Oscar Salvador
@ 2021-04-28  9:18               ` HORIGUCHI NAOYA(堀口　直也)
  2021-05-06  1:31                 ` [PATCH v2] " Naoya Horiguchi
  0 siblings, 1 reply; 20+ messages in thread
From: HORIGUCHI NAOYA(堀口　直也) @ 2021-04-28  9:18 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Naoya Horiguchi, Mike Kravetz, Michal Hocko, Muchun Song, akpm,
	linux-mm, linux-kernel

On Wed, Apr 28, 2021 at 10:23:49AM +0200, Oscar Salvador wrote:
> On Wed, Apr 28, 2021 at 04:46:54PM +0900, Naoya Horiguchi wrote:
> > ---
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Date: Wed, 28 Apr 2021 15:55:47 +0900
> > Subject: [PATCH] mm,hwpoison: fix race with compound page allocation
> > 
> > When hugetlb page fault (under overcommiting situation) and memory_failure()
> > race, VM_BUG_ON_PAGE() is triggered by the following race:
> > 
> >     CPU0:                           CPU1:
> > 
> >                                     gather_surplus_pages()
> >                                       page = alloc_surplus_huge_page()
> >     memory_failure_hugetlb()
> >       get_hwpoison_page(page)
> >         __get_hwpoison_page(page)
> >           get_page_unless_zero(page)
> >                                       zero = put_page_testzero(page)
> >                                       VM_BUG_ON_PAGE(!zero, page)
> >                                       enqueue_huge_page(h, page)
> >       put_page(page)
> > 
> > __get_hwpoison_page() only checks page refcount before taking additional
> > one for memory error handling, which is wrong because there's time
> > windows where compound pages have non-zero refcount during initialization.
> > 
> > So makes __get_hwpoison_page() check more page status for a few types
> > of compound pages. PageSlab() check is added because otherwise
> > "non anonymous thp" path is wrongly chosen for slab pages.
> 
> Was it wrongly chosen even before? If so, maybe a Fix tag is warranted.

OK, I'll check when this was introduced.

> 
> > 
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Reported-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  mm/memory-failure.c | 48 +++++++++++++++++++++++++--------------------
> >  1 file changed, 27 insertions(+), 21 deletions(-)
> > 
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index a3659619d293..61988e332712 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1095,30 +1095,36 @@ static int __get_hwpoison_page(struct page *page)
> 
> > +	if (PageCompound(page)) {
> > +		if (PageSlab(page)) {
> > +			return get_page_unless_zero(page);
> > +		} else if (PageHuge(head)) {
> > +			if (HPageFreed(head) || HPageMigratable(head))
> > +				return get_page_unless_zero(head);
> 
> There were concerns raised wrt. memory-failure should not be fiddling with page's
> refcount without holding a hugetlb lock.
> So, if we really want to make this more stable, we might want to hold the lock
> here.
> 
> The clearing and setting of HPageFreed happens under the lock, and for HPageMigratable
> that is also true for the clearing part, so I think it would be more sane to do
> this under the lock to close any possible race.
> 
> Does it make sense?

Thanks, I'll update to do the check under hugetlb_lock.

- Naoya Horiguchi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2] mm,hwpoison: fix race with compound page allocation
  2021-04-28  9:18               ` HORIGUCHI NAOYA(堀口　直也)
@ 2021-05-06  1:31                 ` Naoya Horiguchi
  2021-05-06  8:51                   ` Oscar Salvador
  0 siblings, 1 reply; 20+ messages in thread
From: Naoya Horiguchi @ 2021-05-06  1:31 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Mike Kravetz, Michal Hocko, Muchun Song, akpm,
	HORIGUCHI NAOYA(堀口　直也),
	linux-mm, linux-kernel

On Wed, Apr 28, 2021 at 09:18:36AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Wed, Apr 28, 2021 at 10:23:49AM +0200, Oscar Salvador wrote:
> > On Wed, Apr 28, 2021 at 04:46:54PM +0900, Naoya Horiguchi wrote:
> > > ---
> > > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > > Date: Wed, 28 Apr 2021 15:55:47 +0900
> > > Subject: [PATCH] mm,hwpoison: fix race with compound page allocation
> > > 
> > > When hugetlb page fault (under overcommiting situation) and memory_failure()
> > > race, VM_BUG_ON_PAGE() is triggered by the following race:
> > > 
> > >     CPU0:                           CPU1:
> > > 
> > >                                     gather_surplus_pages()
> > >                                       page = alloc_surplus_huge_page()
> > >     memory_failure_hugetlb()
> > >       get_hwpoison_page(page)
> > >         __get_hwpoison_page(page)
> > >           get_page_unless_zero(page)
> > >                                       zero = put_page_testzero(page)
> > >                                       VM_BUG_ON_PAGE(!zero, page)
> > >                                       enqueue_huge_page(h, page)
> > >       put_page(page)
> > > 
> > > __get_hwpoison_page() only checks page refcount before taking additional
> > > one for memory error handling, which is wrong because there's time
> > > windows where compound pages have non-zero refcount during initialization.
> > > 
> > > So makes __get_hwpoison_page() check more page status for a few types
> > > of compound pages. PageSlab() check is added because otherwise
> > > "non anonymous thp" path is wrongly chosen for slab pages.
> > 
> > Was it wrongly chosen even before? If so, maybe a Fix tag is warranted.
> 
> OK, I'll check when this was introduced.
> 
> > 
> > > 
> > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > > Reported-by: Muchun Song <songmuchun@bytedance.com>
> > > ---
> > >  mm/memory-failure.c | 48 +++++++++++++++++++++++++--------------------
> > >  1 file changed, 27 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > index a3659619d293..61988e332712 100644
> > > --- a/mm/memory-failure.c
> > > +++ b/mm/memory-failure.c
> > > @@ -1095,30 +1095,36 @@ static int __get_hwpoison_page(struct page *page)
> > 
> > > +	if (PageCompound(page)) {
> > > +		if (PageSlab(page)) {
> > > +			return get_page_unless_zero(page);
> > > +		} else if (PageHuge(head)) {
> > > +			if (HPageFreed(head) || HPageMigratable(head))
> > > +				return get_page_unless_zero(head);
> > 
> > There were concerns raised wrt. memory-failure should not be fiddling with page's
> > refcount without holding a hugetlb lock.
> > So, if we really want to make this more stable, we might want to hold the lock
> > here.
> > 
> > The clearing and setting of HPageFreed happens under the lock, and for HPageMigratable
> > that is also true for the clearing part, so I think it would be more sane to do
> > this under the lock to close any possible race.
> > 
> > Does it make sense?
> 
> Thanks, I'll update to do the check under hugetlb_lock.

Hi,

Let me share the update below.  Two changes:
    - hold hugetlb_lock in hugetlb path,
    - added Fixes tag and cc to stable. I limited the stable branch only to 5.12+
      due to the dependency on HPage* pseudo flags.

- Naoya

---
From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date: Thu, 6 May 2021 09:54:39 +0900
Subject: [PATCH] mm,hwpoison: fix race with compound page allocation

When hugetlb page fault (under overcommiting situation) and memory_failure()
race, VM_BUG_ON_PAGE() is triggered by the following race:

    CPU0:                           CPU1:

                                    gather_surplus_pages()
                                      page = alloc_surplus_huge_page()
    memory_failure_hugetlb()
      get_hwpoison_page(page)
        __get_hwpoison_page(page)
          get_page_unless_zero(page)
                                      zero = put_page_testzero(page)
                                      VM_BUG_ON_PAGE(!zero, page)
                                      enqueue_huge_page(h, page)
      put_page(page)

__get_hwpoison_page() only checks page refcount before taking additional
one for memory error handling, which is wrong because there's time
windows where compound pages have non-zero refcount during initialization.

So makes __get_hwpoison_page() check more page status for a few types
of compound pages. PageSlab() check is added because otherwise
"non anonymous thp" path is wrongly chosen.

Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Muchun Song <songmuchun@bytedance.com>
Cc: stable@vger.kernel.org # 5.12+
---
 mm/memory-failure.c | 53 +++++++++++++++++++++++++++------------------
 1 file changed, 32 insertions(+), 21 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a3659619d293..966a1d6b0bc8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1095,30 +1095,41 @@ static int __get_hwpoison_page(struct page *page)
 {
 	struct page *head = compound_head(page);
 
-	if (!PageHuge(head) && PageTransHuge(head)) {
-		/*
-		 * Non anonymous thp exists only in allocation/free time. We
-		 * can't handle such a case correctly, so let's give it up.
-		 * This should be better than triggering BUG_ON when kernel
-		 * tries to touch the "partially handled" page.
-		 */
-		if (!PageAnon(head)) {
-			pr_err("Memory failure: %#lx: non anonymous thp\n",
-				page_to_pfn(page));
-			return 0;
+	if (PageCompound(page)) {
+		if (PageSlab(page)) {
+			return get_page_unless_zero(page);
+		} else if (PageHuge(head)) {
+			int ret = 0;
+
+			spin_lock(&hugetlb_lock);
+			if (HPageFreed(head) || HPageMigratable(head))
+				ret = get_page_unless_zero(head);
+			spin_unlock(&hugetlb_lock);
+			return ret;
+		} else if (PageTransHuge(head)) {
+			/*
+			 * Non anonymous thp exists only in allocation/free time. We
+			 * can't handle such a case correctly, so let's give it up.
+			 * This should be better than triggering BUG_ON when kernel
+			 * tries to touch the "partially handled" page.
+			 */
+			if (!PageAnon(head)) {
+				pr_err("Memory failure: %#lx: non anonymous thp\n",
+				       page_to_pfn(page));
+				return 0;
+			}
+			if (get_page_unless_zero(head)) {
+				if (head == compound_head(page))
+					return 1;
+				pr_info("Memory failure: %#lx cannot catch tail\n",
+					page_to_pfn(page));
+				put_page(head);
+			}
 		}
+		return 0;
 	}
 
-	if (get_page_unless_zero(head)) {
-		if (head == compound_head(page))
-			return 1;
-
-		pr_info("Memory failure: %#lx cannot catch tail\n",
-			page_to_pfn(page));
-		put_page(head);
-	}
-
-	return 0;
+	return get_page_unless_zero(page);
 }
 
 /*
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2] mm,hwpoison: fix race with compound page allocation
  2021-05-06  1:31                 ` [PATCH v2] " Naoya Horiguchi
@ 2021-05-06  8:51                   ` Oscar Salvador
  2021-05-07  4:17                     ` HORIGUCHI NAOYA(堀口　直也)
  0 siblings, 1 reply; 20+ messages in thread
From: Oscar Salvador @ 2021-05-06  8:51 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Mike Kravetz, Michal Hocko, Muchun Song, akpm,
	HORIGUCHI NAOYA(堀口　直也),
	linux-mm, linux-kernel

On Thu, May 06, 2021 at 10:31:22AM +0900, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Thu, 6 May 2021 09:54:39 +0900
> Subject: [PATCH] mm,hwpoison: fix race with compound page allocation
> 
> When hugetlb page fault (under overcommiting situation) and memory_failure()
> race, VM_BUG_ON_PAGE() is triggered by the following race:
> 
>     CPU0:                           CPU1:
> 
>                                     gather_surplus_pages()
>                                       page = alloc_surplus_huge_page()
>     memory_failure_hugetlb()
>       get_hwpoison_page(page)
>         __get_hwpoison_page(page)
>           get_page_unless_zero(page)
>                                       zero = put_page_testzero(page)
>                                       VM_BUG_ON_PAGE(!zero, page)
>                                       enqueue_huge_page(h, page)
>       put_page(page)
> 
> __get_hwpoison_page() only checks page refcount before taking additional
> one for memory error handling, which is wrong because there's time
> windows where compound pages have non-zero refcount during initialization.
> 
> So makes __get_hwpoison_page() check more page status for a few types
> of compound pages. PageSlab() check is added because otherwise
> "non anonymous thp" path is wrongly chosen.
> 
> Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Reported-by: Muchun Song <songmuchun@bytedance.com>
> Cc: stable@vger.kernel.org # 5.12+

Hi Naoya, 

thanks for the patch.
I have some concerns though, more below:

> ---
>  mm/memory-failure.c | 53 +++++++++++++++++++++++++++------------------
>  1 file changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index a3659619d293..966a1d6b0bc8 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1095,30 +1095,41 @@ static int __get_hwpoison_page(struct page *page)
>  {
>  	struct page *head = compound_head(page);
>  
> -	if (!PageHuge(head) && PageTransHuge(head)) {
> -		/*
> -		 * Non anonymous thp exists only in allocation/free time. We
> -		 * can't handle such a case correctly, so let's give it up.
> -		 * This should be better than triggering BUG_ON when kernel
> -		 * tries to touch the "partially handled" page.
> -		 */
> -		if (!PageAnon(head)) {
> -			pr_err("Memory failure: %#lx: non anonymous thp\n",
> -				page_to_pfn(page));
> -			return 0;
> +	if (PageCompound(page)) {
> +		if (PageSlab(page)) {
> +			return get_page_unless_zero(page);
> +		} else if (PageHuge(head)) {
> +			int ret = 0;
> +
> +			spin_lock(&hugetlb_lock);
> +			if (HPageFreed(head) || HPageMigratable(head))
> +				ret = get_page_unless_zero(head);
> +			spin_unlock(&hugetlb_lock);
> +			return ret;

Ok, I am probably overthinking this but should we re-check under the
lock wehther the page is a hugetlb page?
My concern is, what would happen if:

CPU0                                          CPU1
 __get_hwpoison_page                          
  PageHuge(head) == T                         
                                              dissolve hugetlb page
   hugetlb_lock                               


In that case, by the time we get to check hugetlb flags, those checks
might return false, and we do not get a refcount.
So, I guess my question is: Should we re-check under the lock, and if it
is not, do a "goto try_to_get_ref" that starts right at the beginning,
or goes directly to the get_page_unless_zero at the end (the former
probably better)?

As I said, I might be overthinking this, but well.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2] mm,hwpoison: fix race with compound page allocation
  2021-05-06  8:51                   ` Oscar Salvador
@ 2021-05-07  4:17                     ` HORIGUCHI NAOYA(堀口　直也)
  0 siblings, 0 replies; 20+ messages in thread
From: HORIGUCHI NAOYA(堀口　直也) @ 2021-05-07  4:17 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Naoya Horiguchi, Mike Kravetz, Michal Hocko, Muchun Song, akpm,
	linux-mm, linux-kernel

On Thu, May 06, 2021 at 10:51:33AM +0200, Oscar Salvador wrote:
> On Thu, May 06, 2021 at 10:31:22AM +0900, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Date: Thu, 6 May 2021 09:54:39 +0900
> > Subject: [PATCH] mm,hwpoison: fix race with compound page allocation
> > 
> > When hugetlb page fault (under overcommiting situation) and memory_failure()
> > race, VM_BUG_ON_PAGE() is triggered by the following race:
> > 
> >     CPU0:                           CPU1:
> > 
> >                                     gather_surplus_pages()
> >                                       page = alloc_surplus_huge_page()
> >     memory_failure_hugetlb()
> >       get_hwpoison_page(page)
> >         __get_hwpoison_page(page)
> >           get_page_unless_zero(page)
> >                                       zero = put_page_testzero(page)
> >                                       VM_BUG_ON_PAGE(!zero, page)
> >                                       enqueue_huge_page(h, page)
> >       put_page(page)
> > 
> > __get_hwpoison_page() only checks page refcount before taking additional
> > one for memory error handling, which is wrong because there's time
> > windows where compound pages have non-zero refcount during initialization.
> > 
> > So makes __get_hwpoison_page() check more page status for a few types
> > of compound pages. PageSlab() check is added because otherwise
> > "non anonymous thp" path is wrongly chosen.
> > 
> > Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Reported-by: Muchun Song <songmuchun@bytedance.com>
> > Cc: stable@vger.kernel.org # 5.12+
> 
> Hi Naoya, 
> 
> thanks for the patch.
> I have some concerns though, more below:
> 
> > ---
> >  mm/memory-failure.c | 53 +++++++++++++++++++++++++++------------------
> >  1 file changed, 32 insertions(+), 21 deletions(-)
> > 
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index a3659619d293..966a1d6b0bc8 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1095,30 +1095,41 @@ static int __get_hwpoison_page(struct page *page)
> >  {
> >  	struct page *head = compound_head(page);
> >  
> > -	if (!PageHuge(head) && PageTransHuge(head)) {
> > -		/*
> > -		 * Non anonymous thp exists only in allocation/free time. We
> > -		 * can't handle such a case correctly, so let's give it up.
> > -		 * This should be better than triggering BUG_ON when kernel
> > -		 * tries to touch the "partially handled" page.
> > -		 */
> > -		if (!PageAnon(head)) {
> > -			pr_err("Memory failure: %#lx: non anonymous thp\n",
> > -				page_to_pfn(page));
> > -			return 0;
> > +	if (PageCompound(page)) {
> > +		if (PageSlab(page)) {
> > +			return get_page_unless_zero(page);
> > +		} else if (PageHuge(head)) {
> > +			int ret = 0;
> > +
> > +			spin_lock(&hugetlb_lock);
> > +			if (HPageFreed(head) || HPageMigratable(head))
> > +				ret = get_page_unless_zero(head);
> > +			spin_unlock(&hugetlb_lock);
> > +			return ret;
> 
> Ok, I am probably overthinking this but should we re-check under the
> lock wehther the page is a hugetlb page?
> My concern is, what would happen if:
> 
> CPU0                                          CPU1
>  __get_hwpoison_page                          
>   PageHuge(head) == T                         
>                                               dissolve hugetlb page
>    hugetlb_lock                               
> 
> 
> In that case, by the time we get to check hugetlb flags, those checks
> might return false, and we do not get a refcount.

Thanks, we had better add rechecking as we do in dissolve_free_huge_page().

> So, I guess my question is: Should we re-check under the lock, and if it
> is not, do a "goto try_to_get_ref" that starts right at the beginning,
> or goes directly to the get_page_unless_zero at the end (the former
> probably better)?

Yes, retry could work in this case.  Looking at existing code,
get_any_page() provides "retry" layer, but it's not called now by
get_hwpoison_page() when called from memory_failure().  So I think of trying
to adjust code and make get_hwpoison_page call get_any_page() instead of
calling __get_hwpoison_page(() directly.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-05-07  4:17 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-21  6:02 [PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages Muchun Song
2021-04-21  8:03 ` Michal Hocko
2021-04-21  8:15   ` [External] " Muchun Song
2021-04-21  8:21     ` Oscar Salvador
2021-04-21  8:41       ` Muchun Song
2021-04-21  8:49         ` Oscar Salvador
2021-04-21  8:58           ` Muchun Song
2021-04-21  8:43       ` Michal Hocko
2021-04-21  8:25     ` Michal Hocko
2021-04-21  8:33   ` HORIGUCHI NAOYA(堀口　直也)
2021-04-21  9:02     ` [External] " Muchun Song
2021-04-21 18:03     ` Mike Kravetz
2021-04-22  8:27       ` HORIGUCHI NAOYA(堀口　直也)
2021-04-23  8:01         ` HORIGUCHI NAOYA(堀口　直也)
2021-04-28  7:46           ` [PATCH] mm,hwpoison: fix race with compound page allocation Naoya Horiguchi
2021-04-28  8:23             ` Oscar Salvador
2021-04-28  9:18               ` HORIGUCHI NAOYA(堀口　直也)
2021-05-06  1:31                 ` [PATCH v2] " Naoya Horiguchi
2021-05-06  8:51                   ` Oscar Salvador
2021-05-07  4:17                     ` HORIGUCHI NAOYA(堀口　直也)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).