[PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
@ 2020-09-01 12:46 Pavel Tatashin
  2020-09-01 18:37 ` David Rientjes
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Pavel Tatashin @ 2020-09-01 12:46 UTC (permalink / raw)
  To: linux-kernel, akpm, mhocko, linux-mm, pasha.tatashin

There is a race during page offline that can lead to infinite loop:
a page never ends up on a buddy list and __offline_pages() keeps
retrying infinitely or until a termination signal is received.

Thread#1 - a new process:

load_elf_binary
 begin_new_exec
  exec_mmap
   mmput
    exit_mmap
     tlb_finish_mmu
      tlb_flush_mmu
       release_pages
        free_unref_page_list
         free_unref_page_prepare
          set_pcppage_migratetype(page, migratetype);
             // Set page->index migration type below  MIGRATE_PCPTYPES

Thread#2 - hot-removes memory
__offline_pages
  start_isolate_page_range
    set_migratetype_isolate
      set_pageblock_migratetype(page, MIGRATE_ISOLATE);
        Set migration type to MIGRATE_ISOLATE-> set
        drain_all_pages(zone);
             // drain per-cpu page lists to buddy allocator.

Thread#1 - continue
         free_unref_page_commit
           migratetype = get_pcppage_migratetype(page);
              // get old migration type
           list_add(&page->lru, &pcp->lists[migratetype]);
              // add new page to already drained pcp list

Thread#2
Never drains pcp again, and therefore gets stuck in the loop.

The fix is to try to drain per-cpu lists again after
check_pages_isolated_cb() fails.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: stable@vger.kernel.org
---
 mm/memory_hotplug.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e9d5ab5d3ca0..d6d54922bfce 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1575,6 +1575,15 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		/* check again */
 		ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
 					    NULL, check_pages_isolated_cb);
+		/*
+		 * per-cpu pages are drained in start_isolate_page_range, but if
+		 * there are still pages that are not free, make sure that we
+		 * drain again, because when we isolated range we might
+		 * have raced with another thread that was adding pages to
+		 * pcp list.
+		 */
+		if (ret)
+			drain_all_pages(zone);
 	} while (ret);
 
 	/* Ok, all of our target is isolated.
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-01 12:46 [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline Pavel Tatashin
@ 2020-09-01 18:37 ` David Rientjes
  2020-09-02 14:01 ` Michal Hocko
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: David Rientjes @ 2020-09-01 18:37 UTC (permalink / raw)
  To: Pavel Tatashin; +Cc: linux-kernel, akpm, mhocko, linux-mm

On Tue, 1 Sep 2020, Pavel Tatashin wrote:

> There is a race during page offline that can lead to infinite loop:
> a page never ends up on a buddy list and __offline_pages() keeps
> retrying infinitely or until a termination signal is received.
> 
> Thread#1 - a new process:
> 
> load_elf_binary
>  begin_new_exec
>   exec_mmap
>    mmput
>     exit_mmap
>      tlb_finish_mmu
>       tlb_flush_mmu
>        release_pages
>         free_unref_page_list
>          free_unref_page_prepare
>           set_pcppage_migratetype(page, migratetype);
>              // Set page->index migration type below  MIGRATE_PCPTYPES
> 
> Thread#2 - hot-removes memory
> __offline_pages
>   start_isolate_page_range
>     set_migratetype_isolate
>       set_pageblock_migratetype(page, MIGRATE_ISOLATE);
>         Set migration type to MIGRATE_ISOLATE-> set
>         drain_all_pages(zone);
>              // drain per-cpu page lists to buddy allocator.
> 
> Thread#1 - continue
>          free_unref_page_commit
>            migratetype = get_pcppage_migratetype(page);
>               // get old migration type
>            list_add(&page->lru, &pcp->lists[migratetype]);
>               // add new page to already drained pcp list
> 
> Thread#2
> Never drains pcp again, and therefore gets stuck in the loop.
> 
> The fix is to try to drain per-cpu lists again after
> check_pages_isolated_cb() fails.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: stable@vger.kernel.org

Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-01 12:46 [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline Pavel Tatashin
  2020-09-01 18:37 ` David Rientjes
@ 2020-09-02 14:01 ` Michal Hocko
  2020-09-02 14:10   ` Michal Hocko
  2020-09-02 14:08 ` Michal Hocko
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2020-09-02 14:01 UTC (permalink / raw)
  To: Pavel Tatashin; +Cc: linux-kernel, akpm, linux-mm, Vlastimil Babka, Mel Gorman

[Cc Mel and Vlastimil - I am still rummaging]

On Tue 01-09-20 08:46:15, Pavel Tatashin wrote:
> There is a race during page offline that can lead to infinite loop:
> a page never ends up on a buddy list and __offline_pages() keeps
> retrying infinitely or until a termination signal is received.
> 
> Thread#1 - a new process:
> 
> load_elf_binary
>  begin_new_exec
>   exec_mmap
>    mmput
>     exit_mmap
>      tlb_finish_mmu
>       tlb_flush_mmu
>        release_pages
>         free_unref_page_list
>          free_unref_page_prepare
>           set_pcppage_migratetype(page, migratetype);
>              // Set page->index migration type below  MIGRATE_PCPTYPES
> 
> Thread#2 - hot-removes memory
> __offline_pages
>   start_isolate_page_range
>     set_migratetype_isolate
>       set_pageblock_migratetype(page, MIGRATE_ISOLATE);
>         Set migration type to MIGRATE_ISOLATE-> set
>         drain_all_pages(zone);
>              // drain per-cpu page lists to buddy allocator.
> 
> Thread#1 - continue
>          free_unref_page_commit
>            migratetype = get_pcppage_migratetype(page);
>               // get old migration type
>            list_add(&page->lru, &pcp->lists[migratetype]);
>               // add new page to already drained pcp list
> 
> Thread#2
> Never drains pcp again, and therefore gets stuck in the loop.
> 
> The fix is to try to drain per-cpu lists again after
> check_pages_isolated_cb() fails.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: stable@vger.kernel.org
> ---
>  mm/memory_hotplug.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e9d5ab5d3ca0..d6d54922bfce 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1575,6 +1575,15 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  		/* check again */
>  		ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
>  					    NULL, check_pages_isolated_cb);
> +		/*
> +		 * per-cpu pages are drained in start_isolate_page_range, but if
> +		 * there are still pages that are not free, make sure that we
> +		 * drain again, because when we isolated range we might
> +		 * have raced with another thread that was adding pages to
> +		 * pcp list.
> +		 */
> +		if (ret)
> +			drain_all_pages(zone);
>  	} while (ret);
>  
>  	/* Ok, all of our target is isolated.
> -- 
> 2.25.1
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-01 12:46 [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline Pavel Tatashin
  2020-09-01 18:37 ` David Rientjes
  2020-09-02 14:01 ` Michal Hocko
@ 2020-09-02 14:08 ` Michal Hocko
  2020-09-02 14:26   ` Pavel Tatashin
  2020-09-03  7:07 ` Michal Hocko
  2020-09-03 13:50 ` Vlastimil Babka
  4 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2020-09-02 14:08 UTC (permalink / raw)
  To: Pavel Tatashin; +Cc: linux-kernel, akpm, linux-mm

On Tue 01-09-20 08:46:15, Pavel Tatashin wrote:
> There is a race during page offline that can lead to infinite loop:
> a page never ends up on a buddy list and __offline_pages() keeps
> retrying infinitely or until a termination signal is received.
> 
> Thread#1 - a new process:
> 
> load_elf_binary
>  begin_new_exec
>   exec_mmap
>    mmput
>     exit_mmap
>      tlb_finish_mmu
>       tlb_flush_mmu
>        release_pages
>         free_unref_page_list
>          free_unref_page_prepare
>           set_pcppage_migratetype(page, migratetype);
>              // Set page->index migration type below  MIGRATE_PCPTYPES
> 
> Thread#2 - hot-removes memory
> __offline_pages
>   start_isolate_page_range
>     set_migratetype_isolate
>       set_pageblock_migratetype(page, MIGRATE_ISOLATE);
>         Set migration type to MIGRATE_ISOLATE-> set
>         drain_all_pages(zone);
>              // drain per-cpu page lists to buddy allocator.

It is not really clear to me how we could have passed
has_unmovable_pages at this stage when the page is not PageBuddy. Is
this because you are using Movable Zones?

> 
> Thread#1 - continue
>          free_unref_page_commit
>            migratetype = get_pcppage_migratetype(page);
>               // get old migration type
>            list_add(&page->lru, &pcp->lists[migratetype]);
>               // add new page to already drained pcp list
> 
> Thread#2
> Never drains pcp again, and therefore gets stuck in the loop.
> 
> The fix is to try to drain per-cpu lists again after
> check_pages_isolated_cb() fails.

But this means that the page is not isolated and so it could be reused
for something else. No?
 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: stable@vger.kernel.org
> ---
>  mm/memory_hotplug.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e9d5ab5d3ca0..d6d54922bfce 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1575,6 +1575,15 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  		/* check again */
>  		ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
>  					    NULL, check_pages_isolated_cb);
> +		/*
> +		 * per-cpu pages are drained in start_isolate_page_range, but if
> +		 * there are still pages that are not free, make sure that we
> +		 * drain again, because when we isolated range we might
> +		 * have raced with another thread that was adding pages to
> +		 * pcp list.
> +		 */
> +		if (ret)
> +			drain_all_pages(zone);
>  	} while (ret);
>  
>  	/* Ok, all of our target is isolated.
> -- 
> 2.25.1
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-02 14:01 ` Michal Hocko
@ 2020-09-02 14:10   ` Michal Hocko
  2020-09-02 14:31     ` Pavel Tatashin
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2020-09-02 14:10 UTC (permalink / raw)
  To: Pavel Tatashin; +Cc: linux-kernel, akpm, linux-mm, Vlastimil Babka, Mel Gorman

On Wed 02-09-20 16:01:17, Michal Hocko wrote:
> [Cc Mel and Vlastimil - I am still rummaging]
> 
> On Tue 01-09-20 08:46:15, Pavel Tatashin wrote:
> > There is a race during page offline that can lead to infinite loop:
> > a page never ends up on a buddy list and __offline_pages() keeps
> > retrying infinitely or until a termination signal is received.
> > 
> > Thread#1 - a new process:
> > 
> > load_elf_binary
> >  begin_new_exec
> >   exec_mmap
> >    mmput
> >     exit_mmap
> >      tlb_finish_mmu
> >       tlb_flush_mmu
> >        release_pages
> >         free_unref_page_list
> >          free_unref_page_prepare
> >           set_pcppage_migratetype(page, migratetype);
> >              // Set page->index migration type below  MIGRATE_PCPTYPES
> > 
> > Thread#2 - hot-removes memory
> > __offline_pages
> >   start_isolate_page_range
> >     set_migratetype_isolate
> >       set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> >         Set migration type to MIGRATE_ISOLATE-> set
> >         drain_all_pages(zone);
> >              // drain per-cpu page lists to buddy allocator.
> > 
> > Thread#1 - continue
> >          free_unref_page_commit
> >            migratetype = get_pcppage_migratetype(page);
> >               // get old migration type
> >            list_add(&page->lru, &pcp->lists[migratetype]);
> >               // add new page to already drained pcp list
> > 
> > Thread#2
> > Never drains pcp again, and therefore gets stuck in the loop.
> > 
> > The fix is to try to drain per-cpu lists again after
> > check_pages_isolated_cb() fails.

Still trying to wrap my head around this but I think this is not a
proper fix. It should be the page isolation to make sure no races are
possible with the page freeing path.

> > Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> > Cc: stable@vger.kernel.org
> > ---
> >  mm/memory_hotplug.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index e9d5ab5d3ca0..d6d54922bfce 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1575,6 +1575,15 @@ static int __ref __offline_pages(unsigned long start_pfn,
> >  		/* check again */
> >  		ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> >  					    NULL, check_pages_isolated_cb);
> > +		/*
> > +		 * per-cpu pages are drained in start_isolate_page_range, but if
> > +		 * there are still pages that are not free, make sure that we
> > +		 * drain again, because when we isolated range we might
> > +		 * have raced with another thread that was adding pages to
> > +		 * pcp list.
> > +		 */
> > +		if (ret)
> > +			drain_all_pages(zone);
> >  	} while (ret);
> >  
> >  	/* Ok, all of our target is isolated.
> > -- 
> > 2.25.1
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-02 14:08 ` Michal Hocko
@ 2020-09-02 14:26   ` Pavel Tatashin
  2020-09-02 14:55     ` Vlastimil Babka
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Tatashin @ 2020-09-02 14:26 UTC (permalink / raw)
  To: Michal Hocko; +Cc: LKML, Andrew Morton, linux-mm

On Wed, Sep 2, 2020 at 10:08 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 01-09-20 08:46:15, Pavel Tatashin wrote:
> > There is a race during page offline that can lead to infinite loop:
> > a page never ends up on a buddy list and __offline_pages() keeps
> > retrying infinitely or until a termination signal is received.
> >
> > Thread#1 - a new process:
> >
> > load_elf_binary
> >  begin_new_exec
> >   exec_mmap
> >    mmput
> >     exit_mmap
> >      tlb_finish_mmu
> >       tlb_flush_mmu
> >        release_pages
> >         free_unref_page_list
> >          free_unref_page_prepare
> >           set_pcppage_migratetype(page, migratetype);
> >              // Set page->index migration type below  MIGRATE_PCPTYPES
> >
> > Thread#2 - hot-removes memory
> > __offline_pages
> >   start_isolate_page_range
> >     set_migratetype_isolate
> >       set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> >         Set migration type to MIGRATE_ISOLATE-> set
> >         drain_all_pages(zone);
> >              // drain per-cpu page lists to buddy allocator.
>
> It is not really clear to me how we could have passed
> has_unmovable_pages at this stage when the page is not PageBuddy. Is
> this because you are using Movable Zones?

Yes, we hot-remove memory from the movable zone.

>
> >
> > Thread#1 - continue
> >          free_unref_page_commit
> >            migratetype = get_pcppage_migratetype(page);
> >               // get old migration type
> >            list_add(&page->lru, &pcp->lists[migratetype]);
> >               // add new page to already drained pcp list
> >
> > Thread#2
> > Never drains pcp again, and therefore gets stuck in the loop.
> >
> > The fix is to try to drain per-cpu lists again after
> > check_pages_isolated_cb() fails.
>
> But this means that the page is not isolated and so it could be reused
> for something else. No?

The page is in a movable zone, has zero references, and the section is
isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
set. The page should be offlinable, but it is lost in a pcp list as
that list is never drained again after the first failure to migrate
all pages in the range.

>
> > Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> > Cc: stable@vger.kernel.org
> > ---
> >  mm/memory_hotplug.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index e9d5ab5d3ca0..d6d54922bfce 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1575,6 +1575,15 @@ static int __ref __offline_pages(unsigned long start_pfn,
> >               /* check again */
> >               ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> >                                           NULL, check_pages_isolated_cb);
> > +             /*
> > +              * per-cpu pages are drained in start_isolate_page_range, but if
> > +              * there are still pages that are not free, make sure that we
> > +              * drain again, because when we isolated range we might
> > +              * have raced with another thread that was adding pages to
> > +              * pcp list.
> > +              */
> > +             if (ret)
> > +                     drain_all_pages(zone);
> >       } while (ret);
> >
> >       /* Ok, all of our target is isolated.
> > --
> > 2.25.1
> >
>
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-02 14:10   ` Michal Hocko
@ 2020-09-02 14:31     ` Pavel Tatashin
  2020-09-02 14:49       ` Vlastimil Babka
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Tatashin @ 2020-09-02 14:31 UTC (permalink / raw)
  To: Michal Hocko; +Cc: LKML, Andrew Morton, linux-mm, Vlastimil Babka, Mel Gorman

> > > The fix is to try to drain per-cpu lists again after
> > > check_pages_isolated_cb() fails.
>
> Still trying to wrap my head around this but I think this is not a
> proper fix. It should be the page isolation to make sure no races are
> possible with the page freeing path.
>

As Bharata B Rao found in another thread, the problem was introduced
by this change:
c52e75935f8d: mm: remove extra drain pages on pcp list

So, the drain used to be tried every time with lru_add_drain_all();
Which, I think is excessive, as we start a thread per cpu to try to
drain and catch a rare race condition. With the proposed change we
drain again only when we find such a condition. Fixing it in
start_isolate_page_range means that we must somehow synchronize it
with the release_pages() which adds costs to runtime code, instead of
to hot-remove code.

Pasha

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-02 14:31     ` Pavel Tatashin
@ 2020-09-02 14:49       ` Vlastimil Babka
  0 siblings, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2020-09-02 14:49 UTC (permalink / raw)
  To: Pavel Tatashin, Michal Hocko; +Cc: LKML, Andrew Morton, linux-mm, Mel Gorman

On 9/2/20 4:31 PM, Pavel Tatashin wrote:
>> > > The fix is to try to drain per-cpu lists again after
>> > > check_pages_isolated_cb() fails.
>>
>> Still trying to wrap my head around this but I think this is not a
>> proper fix. It should be the page isolation to make sure no races are
>> possible with the page freeing path.
>>
> 
> As Bharata B Rao found in another thread, the problem was introduced
> by this change:
> c52e75935f8d: mm: remove extra drain pages on pcp list
> 
> So, the drain used to be tried every time with lru_add_drain_all();
> Which, I think is excessive, as we start a thread per cpu to try to
> drain and catch a rare race condition. With the proposed change we
> drain again only when we find such a condition. Fixing it in
> start_isolate_page_range means that we must somehow synchronize it
> with the release_pages() which adds costs to runtime code, instead of
> to hot-remove code.

Agreed. Isolation was always racy wrt freeing to pcplists, and it was simply
acceptable to do some extra drains if needed. Removing that race would be indeed
acceptable only if it didn't affect alloc/free fastpaths.

> Pasha
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-02 14:26   ` Pavel Tatashin
@ 2020-09-02 14:55     ` Vlastimil Babka
  2020-09-02 15:13       ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Vlastimil Babka @ 2020-09-02 14:55 UTC (permalink / raw)
  To: Pavel Tatashin, Michal Hocko; +Cc: LKML, Andrew Morton, linux-mm

On 9/2/20 4:26 PM, Pavel Tatashin wrote:
> On Wed, Sep 2, 2020 at 10:08 AM Michal Hocko <mhocko@suse.com> wrote:
>>
>> >
>> > Thread#1 - continue
>> >          free_unref_page_commit
>> >            migratetype = get_pcppage_migratetype(page);
>> >               // get old migration type
>> >            list_add(&page->lru, &pcp->lists[migratetype]);
>> >               // add new page to already drained pcp list
>> >
>> > Thread#2
>> > Never drains pcp again, and therefore gets stuck in the loop.
>> >
>> > The fix is to try to drain per-cpu lists again after
>> > check_pages_isolated_cb() fails.
>>
>> But this means that the page is not isolated and so it could be reused
>> for something else. No?
> 
> The page is in a movable zone, has zero references, and the section is
> isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
> set. The page should be offlinable, but it is lost in a pcp list as
> that list is never drained again after the first failure to migrate
> all pages in the range.

Yeah. To answer Michal's "it could be reused for something else" - yes, somebody
could allocate it from the pcplist before we do the extra drain. But then it
becomes "visible again" and the loop in __offline_pages() should catch it by
scan_movable_pages() - do_migrate_range(). And this time the pageblock is
already marked as isolated, so the page (freed by migration) won't end up on the
pcplist again.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-02 14:55     ` Vlastimil Babka
@ 2020-09-02 15:13       ` Michal Hocko
  2020-09-02 15:40         ` Pavel Tatashin
  2020-09-02 17:51         ` Vlastimil Babka
  0 siblings, 2 replies; 23+ messages in thread
From: Michal Hocko @ 2020-09-02 15:13 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Pavel Tatashin, LKML, Andrew Morton, linux-mm

On Wed 02-09-20 16:55:05, Vlastimil Babka wrote:
> On 9/2/20 4:26 PM, Pavel Tatashin wrote:
> > On Wed, Sep 2, 2020 at 10:08 AM Michal Hocko <mhocko@suse.com> wrote:
> >>
> >> >
> >> > Thread#1 - continue
> >> >          free_unref_page_commit
> >> >            migratetype = get_pcppage_migratetype(page);
> >> >               // get old migration type
> >> >            list_add(&page->lru, &pcp->lists[migratetype]);
> >> >               // add new page to already drained pcp list
> >> >
> >> > Thread#2
> >> > Never drains pcp again, and therefore gets stuck in the loop.
> >> >
> >> > The fix is to try to drain per-cpu lists again after
> >> > check_pages_isolated_cb() fails.
> >>
> >> But this means that the page is not isolated and so it could be reused
> >> for something else. No?
> > 
> > The page is in a movable zone, has zero references, and the section is
> > isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
> > set. The page should be offlinable, but it is lost in a pcp list as
> > that list is never drained again after the first failure to migrate
> > all pages in the range.
> 
> Yeah. To answer Michal's "it could be reused for something else" - yes, somebody
> could allocate it from the pcplist before we do the extra drain. But then it
> becomes "visible again" and the loop in __offline_pages() should catch it by
> scan_movable_pages() - do_migrate_range(). And this time the pageblock is
> already marked as isolated, so the page (freed by migration) won't end up on the
> pcplist again.

So the page block is marked MIGRATE_ISOLATE but the allocation itself
could be used for non migrateable objects. Or does anything prevent that
from happening?

We really do depend on isolation to not allow reuse when offlining.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-02 15:13       ` Michal Hocko
@ 2020-09-02 15:40         ` Pavel Tatashin
  2020-09-02 17:51         ` Vlastimil Babka
  1 sibling, 0 replies; 23+ messages in thread
From: Pavel Tatashin @ 2020-09-02 15:40 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Vlastimil Babka, LKML, Andrew Morton, linux-mm

> > >> But this means that the page is not isolated and so it could be reused
> > >> for something else. No?
> > >
> > > The page is in a movable zone, has zero references, and the section is
> > > isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
> > > set. The page should be offlinable, but it is lost in a pcp list as
> > > that list is never drained again after the first failure to migrate
> > > all pages in the range.
> >
> > Yeah. To answer Michal's "it could be reused for something else" - yes, somebody
> > could allocate it from the pcplist before we do the extra drain. But then it
> > becomes "visible again" and the loop in __offline_pages() should catch it by
> > scan_movable_pages() - do_migrate_range(). And this time the pageblock is
> > already marked as isolated, so the page (freed by migration) won't end up on the
> > pcplist again.
>
> So the page block is marked MIGRATE_ISOLATE but the allocation itself
> could be used for non migrateable objects. Or does anything prevent that
> from happening?

Vlastimil is right, we could allocate from pcplist, if someone
requests allocation, nothing from what I can tell prevents that, and
we will immediately migrate that page in do_migrate_range().

> We really do depend on isolation to not allow reuse when offlining.

Once a page is isolated it is not re-used but here this page is not
isolated because of the race between adding to pcp and isolation.
Draining the second time on a failure fixes the race.

Pasha


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-02 15:13       ` Michal Hocko
  2020-09-02 15:40         ` Pavel Tatashin
@ 2020-09-02 17:51         ` Vlastimil Babka
  2020-09-03  6:38           ` Michal Hocko
  1 sibling, 1 reply; 23+ messages in thread
From: Vlastimil Babka @ 2020-09-02 17:51 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Pavel Tatashin, LKML, Andrew Morton, linux-mm

On 9/2/20 5:13 PM, Michal Hocko wrote:
> On Wed 02-09-20 16:55:05, Vlastimil Babka wrote:
>> On 9/2/20 4:26 PM, Pavel Tatashin wrote:
>> > On Wed, Sep 2, 2020 at 10:08 AM Michal Hocko <mhocko@suse.com> wrote:
>> >>
>> >> >
>> >> > Thread#1 - continue
>> >> >          free_unref_page_commit
>> >> >            migratetype = get_pcppage_migratetype(page);
>> >> >               // get old migration type
>> >> >            list_add(&page->lru, &pcp->lists[migratetype]);
>> >> >               // add new page to already drained pcp list
>> >> >
>> >> > Thread#2
>> >> > Never drains pcp again, and therefore gets stuck in the loop.
>> >> >
>> >> > The fix is to try to drain per-cpu lists again after
>> >> > check_pages_isolated_cb() fails.
>> >>
>> >> But this means that the page is not isolated and so it could be reused
>> >> for something else. No?
>> > 
>> > The page is in a movable zone, has zero references, and the section is
>> > isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
>> > set. The page should be offlinable, but it is lost in a pcp list as
>> > that list is never drained again after the first failure to migrate
>> > all pages in the range.
>> 
>> Yeah. To answer Michal's "it could be reused for something else" - yes, somebody
>> could allocate it from the pcplist before we do the extra drain. But then it
>> becomes "visible again" and the loop in __offline_pages() should catch it by
>> scan_movable_pages() - do_migrate_range(). And this time the pageblock is
>> already marked as isolated, so the page (freed by migration) won't end up on the
>> pcplist again.
> 
> So the page block is marked MIGRATE_ISOLATE but the allocation itself
> could be used for non migrateable objects. Or does anything prevent that
> from happening?

In a movable zone, the allocation should not be used for non migrateable
objects. E.g. if the zone was not ZONE_MOVABLE, the offlining could fail
regardless of this race (analogically for migrating away from CMA pageblocks).

> We really do depend on isolation to not allow reuse when offlining.

This is not really different than if the page on pcplist was allocated just a
moment before the offlining, thus isolation started. We ultimately rely on being
able to migrate any allocated pages away during the isolation. This "freeing to
pcplists" race doesn't fundamentally change anything in this regard. We just
have to guarantee that pages on pcplists will be eventually flushed, to make
forward progress, and there was a bug in this aspect.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-02 17:51         ` Vlastimil Babka
@ 2020-09-03  6:38           ` Michal Hocko
  2020-09-03 18:20             ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2020-09-03  6:38 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Pavel Tatashin, LKML, Andrew Morton, linux-mm

On Wed 02-09-20 19:51:45, Vlastimil Babka wrote:
> On 9/2/20 5:13 PM, Michal Hocko wrote:
> > On Wed 02-09-20 16:55:05, Vlastimil Babka wrote:
> >> On 9/2/20 4:26 PM, Pavel Tatashin wrote:
> >> > On Wed, Sep 2, 2020 at 10:08 AM Michal Hocko <mhocko@suse.com> wrote:
> >> >>
> >> >> >
> >> >> > Thread#1 - continue
> >> >> >          free_unref_page_commit
> >> >> >            migratetype = get_pcppage_migratetype(page);
> >> >> >               // get old migration type
> >> >> >            list_add(&page->lru, &pcp->lists[migratetype]);
> >> >> >               // add new page to already drained pcp list
> >> >> >
> >> >> > Thread#2
> >> >> > Never drains pcp again, and therefore gets stuck in the loop.
> >> >> >
> >> >> > The fix is to try to drain per-cpu lists again after
> >> >> > check_pages_isolated_cb() fails.
> >> >>
> >> >> But this means that the page is not isolated and so it could be reused
> >> >> for something else. No?
> >> > 
> >> > The page is in a movable zone, has zero references, and the section is
> >> > isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
> >> > set. The page should be offlinable, but it is lost in a pcp list as
> >> > that list is never drained again after the first failure to migrate
> >> > all pages in the range.
> >> 
> >> Yeah. To answer Michal's "it could be reused for something else" - yes, somebody
> >> could allocate it from the pcplist before we do the extra drain. But then it
> >> becomes "visible again" and the loop in __offline_pages() should catch it by
> >> scan_movable_pages() - do_migrate_range(). And this time the pageblock is
> >> already marked as isolated, so the page (freed by migration) won't end up on the
> >> pcplist again.
> > 
> > So the page block is marked MIGRATE_ISOLATE but the allocation itself
> > could be used for non migrateable objects. Or does anything prevent that
> > from happening?
> 
> In a movable zone, the allocation should not be used for non migrateable
> objects. E.g. if the zone was not ZONE_MOVABLE, the offlining could fail
> regardless of this race (analogically for migrating away from CMA pageblocks).
> 
> > We really do depend on isolation to not allow reuse when offlining.
> 
> This is not really different than if the page on pcplist was allocated just a
> moment before the offlining, thus isolation started. We ultimately rely on being
> able to migrate any allocated pages away during the isolation. This "freeing to
> pcplists" race doesn't fundamentally change anything in this regard. We just
> have to guarantee that pages on pcplists will be eventually flushed, to make
> forward progress, and there was a bug in this aspect.

You are right. I managed to confuse myself yesterday. The race is
impossible for !ZONE_MOVABLE because we do PageBuddy check there. And on
the movable zone we are not losing the migrateability property.

Pavel I think this will be a useful information to add to the changelog.
We should also document this in the code to prevent from further
confusion. I would suggest something like the following:

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 242c03121d73..56d4892bceb8 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * pageblocks we may have modified and return -EBUSY to caller. This
  * prevents two threads from simultaneously working on overlapping ranges.
  *
+ * Please note that there is no strong synchronization with the page allocator
+ * either. Pages might be freed while their page blocks are marked ISOLATED.
+ * In some cases pages might still end up on pcp lists and that would allow
+ * for their allocation even when they are in fact isolated already. Depending on
+ * how strong of a guarantee the caller needs drain_all_pages might be needed
+ * (e.g. __offline_pages will need to call it after check for isolated range for
+ * a next retry).
+ *
  * Return: the number of isolated pageblocks on success and -EBUSY if any part
  * of range cannot be isolated.
  */
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-01 12:46 [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline Pavel Tatashin
                   ` (2 preceding siblings ...)
  2020-09-02 14:08 ` Michal Hocko
@ 2020-09-03  7:07 ` Michal Hocko
  2020-09-03 13:43   ` Pavel Tatashin
  2020-09-03 13:50 ` Vlastimil Babka
  4 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2020-09-03  7:07 UTC (permalink / raw)
  To: Pavel Tatashin; +Cc: linux-kernel, akpm, linux-mm

On Tue 01-09-20 08:46:15, Pavel Tatashin wrote:
[...]
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e9d5ab5d3ca0..d6d54922bfce 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1575,6 +1575,15 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  		/* check again */
>  		ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
>  					    NULL, check_pages_isolated_cb);
> +		/*
> +		 * per-cpu pages are drained in start_isolate_page_range, but if
> +		 * there are still pages that are not free, make sure that we
> +		 * drain again, because when we isolated range we might
> +		 * have raced with another thread that was adding pages to
> +		 * pcp list.

I would also add
		 * Forward progress should be still guaranteed because
		 * pages on the pcp list can only belong to MOVABLE_ZONE
		 * because has_unmovable_pages explicitly checks for
		 * PageBuddy on freed pages on other zones.
> +		 */
> +		if (ret)
> +			drain_all_pages(zone);
>  	} while (ret);
>  
>  	/* Ok, all of our target is isolated.
> -- 
> 2.25.1
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-03  7:07 ` Michal Hocko
@ 2020-09-03 13:43   ` Pavel Tatashin
  0 siblings, 0 replies; 23+ messages in thread
From: Pavel Tatashin @ 2020-09-03 13:43 UTC (permalink / raw)
  To: Michal Hocko; +Cc: LKML, Andrew Morton, linux-mm

Thanks Michal, I will add your comments.

Pasha

On Thu, Sep 3, 2020 at 3:07 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 01-09-20 08:46:15, Pavel Tatashin wrote:
> [...]
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index e9d5ab5d3ca0..d6d54922bfce 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1575,6 +1575,15 @@ static int __ref __offline_pages(unsigned long start_pfn,
> >               /* check again */
> >               ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> >                                           NULL, check_pages_isolated_cb);
> > +             /*
> > +              * per-cpu pages are drained in start_isolate_page_range, but if
> > +              * there are still pages that are not free, make sure that we
> > +              * drain again, because when we isolated range we might
> > +              * have raced with another thread that was adding pages to
> > +              * pcp list.
>
> I would also add
>                  * Forward progress should be still guaranteed because
>                  * pages on the pcp list can only belong to MOVABLE_ZONE
>                  * because has_unmovable_pages explicitly checks for
>                  * PageBuddy on freed pages on other zones.
> > +              */
> > +             if (ret)
> > +                     drain_all_pages(zone);
> >       } while (ret);
> >
> >       /* Ok, all of our target is isolated.
> > --
> > 2.25.1
> >
>
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-01 12:46 [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline Pavel Tatashin
                   ` (3 preceding siblings ...)
  2020-09-03  7:07 ` Michal Hocko
@ 2020-09-03 13:50 ` Vlastimil Babka
  4 siblings, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2020-09-03 13:50 UTC (permalink / raw)
  To: Pavel Tatashin, linux-kernel, akpm, mhocko, linux-mm

On 9/1/20 2:46 PM, Pavel Tatashin wrote:
> There is a race during page offline that can lead to infinite loop:
> a page never ends up on a buddy list and __offline_pages() keeps
> retrying infinitely or until a termination signal is received.
> 
> Thread#1 - a new process:
> 
> load_elf_binary
>  begin_new_exec
>   exec_mmap
>    mmput
>     exit_mmap
>      tlb_finish_mmu
>       tlb_flush_mmu
>        release_pages
>         free_unref_page_list
>          free_unref_page_prepare
>           set_pcppage_migratetype(page, migratetype);
>              // Set page->index migration type below  MIGRATE_PCPTYPES
> 
> Thread#2 - hot-removes memory
> __offline_pages
>   start_isolate_page_range
>     set_migratetype_isolate
>       set_pageblock_migratetype(page, MIGRATE_ISOLATE);
>         Set migration type to MIGRATE_ISOLATE-> set
>         drain_all_pages(zone);
>              // drain per-cpu page lists to buddy allocator.
> 
> Thread#1 - continue
>          free_unref_page_commit
>            migratetype = get_pcppage_migratetype(page);
>               // get old migration type
>            list_add(&page->lru, &pcp->lists[migratetype]);
>               // add new page to already drained pcp list
> 
> Thread#2
> Never drains pcp again, and therefore gets stuck in the loop.
> 
> The fix is to try to drain per-cpu lists again after
> check_pages_isolated_cb() fails.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: stable@vger.kernel.org

Fixes: ?
Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-03  6:38           ` Michal Hocko
@ 2020-09-03 18:20             ` David Hildenbrand
  2020-09-03 18:23               ` Pavel Tatashin
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2020-09-03 18:20 UTC (permalink / raw)
  To: Michal Hocko, Vlastimil Babka
  Cc: Pavel Tatashin, LKML, Andrew Morton, linux-mm

On 03.09.20 08:38, Michal Hocko wrote:
> On Wed 02-09-20 19:51:45, Vlastimil Babka wrote:
>> On 9/2/20 5:13 PM, Michal Hocko wrote:
>>> On Wed 02-09-20 16:55:05, Vlastimil Babka wrote:
>>>> On 9/2/20 4:26 PM, Pavel Tatashin wrote:
>>>>> On Wed, Sep 2, 2020 at 10:08 AM Michal Hocko <mhocko@suse.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Thread#1 - continue
>>>>>>>          free_unref_page_commit
>>>>>>>            migratetype = get_pcppage_migratetype(page);
>>>>>>>               // get old migration type
>>>>>>>            list_add(&page->lru, &pcp->lists[migratetype]);
>>>>>>>               // add new page to already drained pcp list
>>>>>>>
>>>>>>> Thread#2
>>>>>>> Never drains pcp again, and therefore gets stuck in the loop.
>>>>>>>
>>>>>>> The fix is to try to drain per-cpu lists again after
>>>>>>> check_pages_isolated_cb() fails.
>>>>>>
>>>>>> But this means that the page is not isolated and so it could be reused
>>>>>> for something else. No?
>>>>>
>>>>> The page is in a movable zone, has zero references, and the section is
>>>>> isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
>>>>> set. The page should be offlinable, but it is lost in a pcp list as
>>>>> that list is never drained again after the first failure to migrate
>>>>> all pages in the range.
>>>>
>>>> Yeah. To answer Michal's "it could be reused for something else" - yes, somebody
>>>> could allocate it from the pcplist before we do the extra drain. But then it
>>>> becomes "visible again" and the loop in __offline_pages() should catch it by
>>>> scan_movable_pages() - do_migrate_range(). And this time the pageblock is
>>>> already marked as isolated, so the page (freed by migration) won't end up on the
>>>> pcplist again.
>>>
>>> So the page block is marked MIGRATE_ISOLATE but the allocation itself
>>> could be used for non migrateable objects. Or does anything prevent that
>>> from happening?
>>
>> In a movable zone, the allocation should not be used for non migrateable
>> objects. E.g. if the zone was not ZONE_MOVABLE, the offlining could fail
>> regardless of this race (analogically for migrating away from CMA pageblocks).
>>
>>> We really do depend on isolation to not allow reuse when offlining.
>>
>> This is not really different than if the page on pcplist was allocated just a
>> moment before the offlining, thus isolation started. We ultimately rely on being
>> able to migrate any allocated pages away during the isolation. This "freeing to
>> pcplists" race doesn't fundamentally change anything in this regard. We just
>> have to guarantee that pages on pcplists will be eventually flushed, to make
>> forward progress, and there was a bug in this aspect.
> 
> You are right. I managed to confuse myself yesterday. The race is
> impossible for !ZONE_MOVABLE because we do PageBuddy check there. And on
> the movable zone we are not losing the migrateability property.
> 
> Pavel I think this will be a useful information to add to the changelog.
> We should also document this in the code to prevent from further
> confusion. I would suggest something like the following:
> 
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 242c03121d73..56d4892bceb8 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>   * pageblocks we may have modified and return -EBUSY to caller. This
>   * prevents two threads from simultaneously working on overlapping ranges.
>   *
> + * Please note that there is no strong synchronization with the page allocator
> + * either. Pages might be freed while their page blocks are marked ISOLATED.
> + * In some cases pages might still end up on pcp lists and that would allow
> + * for their allocation even when they are in fact isolated already. Depending on
> + * how strong of a guarantee the caller needs drain_all_pages might be needed
> + * (e.g. __offline_pages will need to call it after check for isolated range for
> + * a next retry).
> + *

As expressed in reply to v2, I dislike this hack. There is strong
synchronization, just PCP is special. Allocating from MIGRATE_ISOLATE is
just plain ugly.

Can't we temporarily disable PCP (while some pageblock in the zone is
isolated, which we know e.g., due to the counter), so no new pages get
put into PCP lists after draining, and re-enable after no pageblocks are
isolated again? We keep draining the PCP, so it doesn't seem to be of a
lot of use during that period, no? It's a performance hit already.

Then, we would only need exactly one drain. And we would only have to
check on the free path whether PCP is temporarily disabled.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-03 18:20             ` David Hildenbrand
@ 2020-09-03 18:23               ` Pavel Tatashin
  2020-09-03 18:31                 ` David Hildenbrand
  2020-09-04  6:32                 ` Vlastimil Babka
  0 siblings, 2 replies; 23+ messages in thread
From: Pavel Tatashin @ 2020-09-03 18:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Vlastimil Babka, LKML, Andrew Morton, linux-mm

On Thu, Sep 3, 2020 at 2:20 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 03.09.20 08:38, Michal Hocko wrote:
> > On Wed 02-09-20 19:51:45, Vlastimil Babka wrote:
> >> On 9/2/20 5:13 PM, Michal Hocko wrote:
> >>> On Wed 02-09-20 16:55:05, Vlastimil Babka wrote:
> >>>> On 9/2/20 4:26 PM, Pavel Tatashin wrote:
> >>>>> On Wed, Sep 2, 2020 at 10:08 AM Michal Hocko <mhocko@suse.com> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> Thread#1 - continue
> >>>>>>>          free_unref_page_commit
> >>>>>>>            migratetype = get_pcppage_migratetype(page);
> >>>>>>>               // get old migration type
> >>>>>>>            list_add(&page->lru, &pcp->lists[migratetype]);
> >>>>>>>               // add new page to already drained pcp list
> >>>>>>>
> >>>>>>> Thread#2
> >>>>>>> Never drains pcp again, and therefore gets stuck in the loop.
> >>>>>>>
> >>>>>>> The fix is to try to drain per-cpu lists again after
> >>>>>>> check_pages_isolated_cb() fails.
> >>>>>>
> >>>>>> But this means that the page is not isolated and so it could be reused
> >>>>>> for something else. No?
> >>>>>
> >>>>> The page is in a movable zone, has zero references, and the section is
> >>>>> isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
> >>>>> set. The page should be offlinable, but it is lost in a pcp list as
> >>>>> that list is never drained again after the first failure to migrate
> >>>>> all pages in the range.
> >>>>
> >>>> Yeah. To answer Michal's "it could be reused for something else" - yes, somebody
> >>>> could allocate it from the pcplist before we do the extra drain. But then it
> >>>> becomes "visible again" and the loop in __offline_pages() should catch it by
> >>>> scan_movable_pages() - do_migrate_range(). And this time the pageblock is
> >>>> already marked as isolated, so the page (freed by migration) won't end up on the
> >>>> pcplist again.
> >>>
> >>> So the page block is marked MIGRATE_ISOLATE but the allocation itself
> >>> could be used for non migrateable objects. Or does anything prevent that
> >>> from happening?
> >>
> >> In a movable zone, the allocation should not be used for non migrateable
> >> objects. E.g. if the zone was not ZONE_MOVABLE, the offlining could fail
> >> regardless of this race (analogically for migrating away from CMA pageblocks).
> >>
> >>> We really do depend on isolation to not allow reuse when offlining.
> >>
> >> This is not really different than if the page on pcplist was allocated just a
> >> moment before the offlining, thus isolation started. We ultimately rely on being
> >> able to migrate any allocated pages away during the isolation. This "freeing to
> >> pcplists" race doesn't fundamentally change anything in this regard. We just
> >> have to guarantee that pages on pcplists will be eventually flushed, to make
> >> forward progress, and there was a bug in this aspect.
> >
> > You are right. I managed to confuse myself yesterday. The race is
> > impossible for !ZONE_MOVABLE because we do PageBuddy check there. And on
> > the movable zone we are not losing the migrateability property.
> >
> > Pavel I think this will be a useful information to add to the changelog.
> > We should also document this in the code to prevent from further
> > confusion. I would suggest something like the following:
> >
> > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > index 242c03121d73..56d4892bceb8 100644
> > --- a/mm/page_isolation.c
> > +++ b/mm/page_isolation.c
> > @@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> >   * pageblocks we may have modified and return -EBUSY to caller. This
> >   * prevents two threads from simultaneously working on overlapping ranges.
> >   *
> > + * Please note that there is no strong synchronization with the page allocator
> > + * either. Pages might be freed while their page blocks are marked ISOLATED.
> > + * In some cases pages might still end up on pcp lists and that would allow
> > + * for their allocation even when they are in fact isolated already. Depending on
> > + * how strong of a guarantee the caller needs drain_all_pages might be needed
> > + * (e.g. __offline_pages will need to call it after check for isolated range for
> > + * a next retry).
> > + *
>
> As expressed in reply to v2, I dislike this hack. There is strong
> synchronization, just PCP is special. Allocating from MIGRATE_ISOLATE is
> just plain ugly.
>
> Can't we temporarily disable PCP (while some pageblock in the zone is
> isolated, which we know e.g., due to the counter), so no new pages get
> put into PCP lists after draining, and re-enable after no pageblocks are
> isolated again? We keep draining the PCP, so it doesn't seem to be of a
> lot of use during that period, no? It's a performance hit already.
>
> Then, we would only need exactly one drain. And we would only have to
> check on the free path whether PCP is temporarily disabled.

Hm, we could use a static branches to disable it, that would keep
release code just as fast, but I am worried it will make code even
uglier. Let's see what others in this thread think about this idea.

Thank you,
Pasha


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-03 18:23               ` Pavel Tatashin
@ 2020-09-03 18:31                 ` David Hildenbrand
  2020-09-04  7:02                   ` Michal Hocko
  2020-09-04  6:32                 ` Vlastimil Babka
  1 sibling, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2020-09-03 18:31 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Michal Hocko, Vlastimil Babka, LKML, Andrew Morton, linux-mm

On 03.09.20 20:23, Pavel Tatashin wrote:
> On Thu, Sep 3, 2020 at 2:20 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 03.09.20 08:38, Michal Hocko wrote:
>>> On Wed 02-09-20 19:51:45, Vlastimil Babka wrote:
>>>> On 9/2/20 5:13 PM, Michal Hocko wrote:
>>>>> On Wed 02-09-20 16:55:05, Vlastimil Babka wrote:
>>>>>> On 9/2/20 4:26 PM, Pavel Tatashin wrote:
>>>>>>> On Wed, Sep 2, 2020 at 10:08 AM Michal Hocko <mhocko@suse.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thread#1 - continue
>>>>>>>>>          free_unref_page_commit
>>>>>>>>>            migratetype = get_pcppage_migratetype(page);
>>>>>>>>>               // get old migration type
>>>>>>>>>            list_add(&page->lru, &pcp->lists[migratetype]);
>>>>>>>>>               // add new page to already drained pcp list
>>>>>>>>>
>>>>>>>>> Thread#2
>>>>>>>>> Never drains pcp again, and therefore gets stuck in the loop.
>>>>>>>>>
>>>>>>>>> The fix is to try to drain per-cpu lists again after
>>>>>>>>> check_pages_isolated_cb() fails.
>>>>>>>>
>>>>>>>> But this means that the page is not isolated and so it could be reused
>>>>>>>> for something else. No?
>>>>>>>
>>>>>>> The page is in a movable zone, has zero references, and the section is
>>>>>>> isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
>>>>>>> set. The page should be offlinable, but it is lost in a pcp list as
>>>>>>> that list is never drained again after the first failure to migrate
>>>>>>> all pages in the range.
>>>>>>
>>>>>> Yeah. To answer Michal's "it could be reused for something else" - yes, somebody
>>>>>> could allocate it from the pcplist before we do the extra drain. But then it
>>>>>> becomes "visible again" and the loop in __offline_pages() should catch it by
>>>>>> scan_movable_pages() - do_migrate_range(). And this time the pageblock is
>>>>>> already marked as isolated, so the page (freed by migration) won't end up on the
>>>>>> pcplist again.
>>>>>
>>>>> So the page block is marked MIGRATE_ISOLATE but the allocation itself
>>>>> could be used for non migrateable objects. Or does anything prevent that
>>>>> from happening?
>>>>
>>>> In a movable zone, the allocation should not be used for non migrateable
>>>> objects. E.g. if the zone was not ZONE_MOVABLE, the offlining could fail
>>>> regardless of this race (analogically for migrating away from CMA pageblocks).
>>>>
>>>>> We really do depend on isolation to not allow reuse when offlining.
>>>>
>>>> This is not really different than if the page on pcplist was allocated just a
>>>> moment before the offlining, thus isolation started. We ultimately rely on being
>>>> able to migrate any allocated pages away during the isolation. This "freeing to
>>>> pcplists" race doesn't fundamentally change anything in this regard. We just
>>>> have to guarantee that pages on pcplists will be eventually flushed, to make
>>>> forward progress, and there was a bug in this aspect.
>>>
>>> You are right. I managed to confuse myself yesterday. The race is
>>> impossible for !ZONE_MOVABLE because we do PageBuddy check there. And on
>>> the movable zone we are not losing the migrateability property.
>>>
>>> Pavel I think this will be a useful information to add to the changelog.
>>> We should also document this in the code to prevent from further
>>> confusion. I would suggest something like the following:
>>>
>>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>>> index 242c03121d73..56d4892bceb8 100644
>>> --- a/mm/page_isolation.c
>>> +++ b/mm/page_isolation.c
>>> @@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>>   * pageblocks we may have modified and return -EBUSY to caller. This
>>>   * prevents two threads from simultaneously working on overlapping ranges.
>>>   *
>>> + * Please note that there is no strong synchronization with the page allocator
>>> + * either. Pages might be freed while their page blocks are marked ISOLATED.
>>> + * In some cases pages might still end up on pcp lists and that would allow
>>> + * for their allocation even when they are in fact isolated already. Depending on
>>> + * how strong of a guarantee the caller needs drain_all_pages might be needed
>>> + * (e.g. __offline_pages will need to call it after check for isolated range for
>>> + * a next retry).
>>> + *
>>
>> As expressed in reply to v2, I dislike this hack. There is strong
>> synchronization, just PCP is special. Allocating from MIGRATE_ISOLATE is
>> just plain ugly.
>>
>> Can't we temporarily disable PCP (while some pageblock in the zone is
>> isolated, which we know e.g., due to the counter), so no new pages get
>> put into PCP lists after draining, and re-enable after no pageblocks are
>> isolated again? We keep draining the PCP, so it doesn't seem to be of a
>> lot of use during that period, no? It's a performance hit already.
>>
>> Then, we would only need exactly one drain. And we would only have to
>> check on the free path whether PCP is temporarily disabled.
> 
> Hm, we could use a static branches to disable it, that would keep
> release code just as fast, but I am worried it will make code even
> uglier. Let's see what others in this thread think about this idea.

It would at least stop this "allocate from MIOGRATE_ISOLATE" behavior
and the "drain when you feel like it, and drain more frequently to work
around broken PCP code" handling.

Anyhow, I'll be offline until next Tuesday, cheers!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-03 18:23               ` Pavel Tatashin
  2020-09-03 18:31                 ` David Hildenbrand
@ 2020-09-04  6:32                 ` Vlastimil Babka
  1 sibling, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2020-09-04  6:32 UTC (permalink / raw)
  To: Pavel Tatashin, David Hildenbrand
  Cc: Michal Hocko, LKML, Andrew Morton, linux-mm

On 9/3/20 8:23 PM, Pavel Tatashin wrote:
>>
>> As expressed in reply to v2, I dislike this hack. There is strong
>> synchronization, just PCP is special. Allocating from MIGRATE_ISOLATE is
>> just plain ugly.
>>
>> Can't we temporarily disable PCP (while some pageblock in the zone is
>> isolated, which we know e.g., due to the counter), so no new pages get
>> put into PCP lists after draining, and re-enable after no pageblocks are
>> isolated again? We keep draining the PCP, so it doesn't seem to be of a
>> lot of use during that period, no? It's a performance hit already.
>>
>> Then, we would only need exactly one drain. And we would only have to
>> check on the free path whether PCP is temporarily disabled.
> 
> Hm, we could use a static branches to disable it, that would keep
> release code just as fast, but I am worried it will make code even
> uglier. Let's see what others in this thread think about this idea.

Maybe we could just set pcp->high = 0 or something, make sure the
pcplist user only reads this value while irqs are disabled. Then the the
IPI in drain_all_pages() should guarantee there's nobody racing freeing
to pcplist. But careful to not introduce bugs like the one fixed with
[1]. And not sure if this guarantee survives when RT comes and replaces
the disabled irq's with local_lock or something.

[1]
https://lore.kernel.org/linux-mm/1597150703-19003-1-git-send-email-charante@codeaurora.org/


> Thank you,
> Pasha
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-03 18:31                 ` David Hildenbrand
@ 2020-09-04  7:02                   ` Michal Hocko
  2020-09-04 14:25                     ` Pavel Tatashin
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2020-09-04  7:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pavel Tatashin, Vlastimil Babka, LKML, Andrew Morton, linux-mm

On Thu 03-09-20 20:31:04, David Hildenbrand wrote:
> On 03.09.20 20:23, Pavel Tatashin wrote:
> > On Thu, Sep 3, 2020 at 2:20 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 03.09.20 08:38, Michal Hocko wrote:
[...]
> >>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> >>> index 242c03121d73..56d4892bceb8 100644
> >>> --- a/mm/page_isolation.c
> >>> +++ b/mm/page_isolation.c
> >>> @@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> >>>   * pageblocks we may have modified and return -EBUSY to caller. This
> >>>   * prevents two threads from simultaneously working on overlapping ranges.
> >>>   *
> >>> + * Please note that there is no strong synchronization with the page allocator
> >>> + * either. Pages might be freed while their page blocks are marked ISOLATED.
> >>> + * In some cases pages might still end up on pcp lists and that would allow
> >>> + * for their allocation even when they are in fact isolated already. Depending on
> >>> + * how strong of a guarantee the caller needs drain_all_pages might be needed
> >>> + * (e.g. __offline_pages will need to call it after check for isolated range for
> >>> + * a next retry).
> >>> + *
> >>
> >> As expressed in reply to v2, I dislike this hack. There is strong
> >> synchronization, just PCP is special. Allocating from MIGRATE_ISOLATE is
> >> just plain ugly.

Completely agreed! I am not happy about that either. But I believe this
hack is the easiest way forward for stable trees and as an immediate
fix. We can build on top of that of course.

> >> Can't we temporarily disable PCP (while some pageblock in the zone is
> >> isolated, which we know e.g., due to the counter), so no new pages get
> >> put into PCP lists after draining, and re-enable after no pageblocks are
> >> isolated again? We keep draining the PCP, so it doesn't seem to be of a
> >> lot of use during that period, no? It's a performance hit already.

This is a good point.

> >> Then, we would only need exactly one drain. And we would only have to
> >> check on the free path whether PCP is temporarily disabled.
> > 
> > Hm, we could use a static branches to disable it, that would keep
> > release code just as fast, but I am worried it will make code even
> > uglier. Let's see what others in this thread think about this idea.

I know that static branches are a very effective way to enable/disable
features but I have no experience in how they perform for a very
shortlived use. Maybe that is just fine for a single place which needs
to be patched. This would be especially a problem if the static branch
is to be enabled from start_isolate_page_range because that includes all
cma allocator users.

Another alternative would be to enable/disable static branch only from
users who really care but this is quite tricky because how do you tell
you need or not? It seems that alloc_contig_range would be just fine
with a weaker semantic because it would "only" to a spurious failure.
Memory hotplug on the other hand really needs to have a point where
nobody interferes with the offlined memory so it could ask for a
stronger semantic.

Yet another option would be to make draining stronger and actually
guarantee there are no in-flight pages to be freed to the pcp list.
One way would be to tweak pcp->high and implement a strong barrier
(IPI?) to sync with all CPUs. Quite expensive, especially when there are
many draining requests (read cma users because hotplug doesn't really
matter much as it happens seldom).

So no nice&cheap solution I can think of...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-04  7:02                   ` Michal Hocko
@ 2020-09-04 14:25                     ` Pavel Tatashin
  2020-09-07  7:26                       ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Tatashin @ 2020-09-04 14:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Vlastimil Babka, LKML, Andrew Morton, linux-mm

> Another alternative would be to enable/disable static branch only from
> users who really care but this is quite tricky because how do you tell
> you need or not? It seems that alloc_contig_range would be just fine
> with a weaker semantic because it would "only" to a spurious failure.
> Memory hotplug on the other hand really needs to have a point where
> nobody interferes with the offlined memory so it could ask for a
> stronger semantic.
>
> Yet another option would be to make draining stronger and actually
> guarantee there are no in-flight pages to be freed to the pcp list.
> One way would be to tweak pcp->high and implement a strong barrier
> (IPI?) to sync with all CPUs. Quite expensive, especially when there are
> many draining requests (read cma users because hotplug doesn't really
> matter much as it happens seldom).
>
> So no nice&cheap solution I can think of...

I think start_isolate_page_range() should not be doing page draining
at all. It should isolate ranges, meaning set appropriate flags, but
draining should be performed by the users when appropriate: next to
lru_add_drain_all() calls both in CMA and hotplug.
Currently, the way start_isolate_page_range() drains pages is very
inefficient. It calls drain_all_pages() for every valid page block,
which is a slow call as it starts a thread per cpu, and waits for
those threads to finish before returning.

We could optimize by moving the drain_all_pages() calls from
set_migratetype_isolate() to start_isolate_page_range() and call it
once for every different zone, but both current users of this
interface guarantee that all pfns [start_pfn, end_pfn] are within the
same zone, and I think we should keep it this way, so again the extra
traversal is going to be overhead overhead.

This way we will have on average only a single drain per hot-remove
(instead of one per block), and also it is going to be symmetric only
in one place. Faster hot-remove and cma alloc, and no race, imo
win-win.

Pasha

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-04 14:25                     ` Pavel Tatashin
@ 2020-09-07  7:26                       ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2020-09-07  7:26 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: David Hildenbrand, Vlastimil Babka, LKML, Andrew Morton, linux-mm

On Fri 04-09-20 10:25:02, Pavel Tatashin wrote:
> > Another alternative would be to enable/disable static branch only from
> > users who really care but this is quite tricky because how do you tell
> > you need or not? It seems that alloc_contig_range would be just fine
> > with a weaker semantic because it would "only" to a spurious failure.
> > Memory hotplug on the other hand really needs to have a point where
> > nobody interferes with the offlined memory so it could ask for a
> > stronger semantic.
> >
> > Yet another option would be to make draining stronger and actually
> > guarantee there are no in-flight pages to be freed to the pcp list.
> > One way would be to tweak pcp->high and implement a strong barrier
> > (IPI?) to sync with all CPUs. Quite expensive, especially when there are
> > many draining requests (read cma users because hotplug doesn't really
> > matter much as it happens seldom).
> >
> > So no nice&cheap solution I can think of...
> 
> I think start_isolate_page_range() should not be doing page draining
> at all. It should isolate ranges, meaning set appropriate flags, but
> draining should be performed by the users when appropriate: next to
> lru_add_drain_all() calls both in CMA and hotplug.

I disagree. The pcp draining is an implementation detail and we
shouldn't bother callers to be aware of it.

> Currently, the way start_isolate_page_range() drains pages is very
> inefficient. It calls drain_all_pages() for every valid page block,
> which is a slow call as it starts a thread per cpu, and waits for
> those threads to finish before returning.

This is an implementation detail.

> We could optimize by moving the drain_all_pages() calls from
> set_migratetype_isolate() to start_isolate_page_range() and call it
> once for every different zone, but both current users of this
> interface guarantee that all pfns [start_pfn, end_pfn] are within the
> same zone, and I think we should keep it this way, so again the extra
> traversal is going to be overhead overhead.

Again this just leads to tricky code. Just look at how easy it was to
break this by removing something that looked clearly a duplicate call.
It is true that memory isolation usage is limited to only few usecasaes
but I would strongly prefer to make the semantic clear so that we do not
repeat this regressions.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-09-07  7:26 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-01 12:46 [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline Pavel Tatashin
2020-09-01 18:37 ` David Rientjes
2020-09-02 14:01 ` Michal Hocko
2020-09-02 14:10   ` Michal Hocko
2020-09-02 14:31     ` Pavel Tatashin
2020-09-02 14:49       ` Vlastimil Babka
2020-09-02 14:08 ` Michal Hocko
2020-09-02 14:26   ` Pavel Tatashin
2020-09-02 14:55     ` Vlastimil Babka
2020-09-02 15:13       ` Michal Hocko
2020-09-02 15:40         ` Pavel Tatashin
2020-09-02 17:51         ` Vlastimil Babka
2020-09-03  6:38           ` Michal Hocko
2020-09-03 18:20             ` David Hildenbrand
2020-09-03 18:23               ` Pavel Tatashin
2020-09-03 18:31                 ` David Hildenbrand
2020-09-04  7:02                   ` Michal Hocko
2020-09-04 14:25                     ` Pavel Tatashin
2020-09-07  7:26                       ` Michal Hocko
2020-09-04  6:32                 ` Vlastimil Babka
2020-09-03  7:07 ` Michal Hocko
2020-09-03 13:43   ` Pavel Tatashin
2020-09-03 13:50 ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).