linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] mm/memory_hotplug: drain per-cpu pages again during memory offline
@ 2020-09-03 14:00 Pavel Tatashin
  2020-09-03 17:36 ` David Hildenbrand
  0 siblings, 1 reply; 6+ messages in thread
From: Pavel Tatashin @ 2020-09-03 14:00 UTC (permalink / raw)
  To: linux-kernel, akpm, mhocko, linux-mm, pasha.tatashin, osalvador,
	richard.weiyang, david, vbabka, rientjes

There is a race during page offline that can lead to infinite loop:
a page never ends up on a buddy list and __offline_pages() keeps
retrying infinitely or until a termination signal is received.

Thread#1 - a new process:

load_elf_binary
 begin_new_exec
  exec_mmap
   mmput
    exit_mmap
     tlb_finish_mmu
      tlb_flush_mmu
       release_pages
        free_unref_page_list
         free_unref_page_prepare
          set_pcppage_migratetype(page, migratetype);
             // Set page->index migration type below  MIGRATE_PCPTYPES

Thread#2 - hot-removes memory
__offline_pages
  start_isolate_page_range
    set_migratetype_isolate
      set_pageblock_migratetype(page, MIGRATE_ISOLATE);
        Set migration type to MIGRATE_ISOLATE-> set
        drain_all_pages(zone);
             // drain per-cpu page lists to buddy allocator.

Thread#1 - continue
         free_unref_page_commit
           migratetype = get_pcppage_migratetype(page);
              // get old migration type
           list_add(&page->lru, &pcp->lists[migratetype]);
              // add new page to already drained pcp list

Thread#2
Never drains pcp again, and therefore gets stuck in the loop.

The fix is to try to drain per-cpu lists again after
check_pages_isolated_cb() fails.

Fixes: c52e75935f8d ("mm: remove extra drain pages on pcp list")

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: stable@vger.kernel.org
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/memory_hotplug.c | 14 ++++++++++++++
 mm/page_isolation.c |  8 ++++++++
 2 files changed, 22 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e9d5ab5d3ca0..b11a269e2356 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1575,6 +1575,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		/* check again */
 		ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
 					    NULL, check_pages_isolated_cb);
+		/*
+		 * per-cpu pages are drained in start_isolate_page_range, but if
+		 * there are still pages that are not free, make sure that we
+		 * drain again, because when we isolated range we might
+		 * have raced with another thread that was adding pages to pcp
+		 * list.
+		 *
+		 * Forward progress should be still guaranteed because
+		 * pages on the pcp list can only belong to MOVABLE_ZONE
+		 * because has_unmovable_pages explicitly checks for
+		 * PageBuddy on freed pages on other zones.
+		 */
+		if (ret)
+			drain_all_pages(zone);
 	} while (ret);
 
 	/* Ok, all of our target is isolated.
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 242c03121d73..63a3db10a8c0 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * pageblocks we may have modified and return -EBUSY to caller. This
  * prevents two threads from simultaneously working on overlapping ranges.
  *
+ * Please note that there is no strong synchronization with the page allocator
+ * either. Pages might be freed while their page blocks are marked ISOLATED.
+ * In some cases pages might still end up on pcp lists and that would allow
+ * for their allocation even when they are in fact isolated already. Depending
+ * on how strong of a guarantee the caller needs drain_all_pages might be needed
+ * (e.g. __offline_pages will need to call it after check for isolated range for
+ * a next retry).
+ *
  * Return: the number of isolated pageblocks on success and -EBUSY if any part
  * of range cannot be isolated.
  */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-03 14:00 [PATCH v2] mm/memory_hotplug: drain per-cpu pages again during memory offline Pavel Tatashin
@ 2020-09-03 17:36 ` David Hildenbrand
  2020-09-03 18:07   ` Pavel Tatashin
  2020-09-03 19:31   ` Andrew Morton
  0 siblings, 2 replies; 6+ messages in thread
From: David Hildenbrand @ 2020-09-03 17:36 UTC (permalink / raw)
  To: Pavel Tatashin, linux-kernel, akpm, mhocko, linux-mm, osalvador,
	richard.weiyang, vbabka, rientjes

On 03.09.20 16:00, Pavel Tatashin wrote:
> There is a race during page offline that can lead to infinite loop:
> a page never ends up on a buddy list and __offline_pages() keeps
> retrying infinitely or until a termination signal is received.
> 
> Thread#1 - a new process:
> 
> load_elf_binary
>  begin_new_exec
>   exec_mmap
>    mmput
>     exit_mmap
>      tlb_finish_mmu
>       tlb_flush_mmu
>        release_pages
>         free_unref_page_list
>          free_unref_page_prepare
>           set_pcppage_migratetype(page, migratetype);
>              // Set page->index migration type below  MIGRATE_PCPTYPES
> 
> Thread#2 - hot-removes memory
> __offline_pages
>   start_isolate_page_range
>     set_migratetype_isolate
>       set_pageblock_migratetype(page, MIGRATE_ISOLATE);
>         Set migration type to MIGRATE_ISOLATE-> set
>         drain_all_pages(zone);
>              // drain per-cpu page lists to buddy allocator.
> 
> Thread#1 - continue
>          free_unref_page_commit
>            migratetype = get_pcppage_migratetype(page);
>               // get old migration type
>            list_add(&page->lru, &pcp->lists[migratetype]);
>               // add new page to already drained pcp list
> 
> Thread#2
> Never drains pcp again, and therefore gets stuck in the loop.
> 
> The fix is to try to drain per-cpu lists again after
> check_pages_isolated_cb() fails.
> 
> Fixes: c52e75935f8d ("mm: remove extra drain pages on pcp list")
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: stable@vger.kernel.org
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/memory_hotplug.c | 14 ++++++++++++++
>  mm/page_isolation.c |  8 ++++++++
>  2 files changed, 22 insertions(+)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e9d5ab5d3ca0..b11a269e2356 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1575,6 +1575,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  		/* check again */
>  		ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
>  					    NULL, check_pages_isolated_cb);
> +		/*
> +		 * per-cpu pages are drained in start_isolate_page_range, but if
> +		 * there are still pages that are not free, make sure that we
> +		 * drain again, because when we isolated range we might
> +		 * have raced with another thread that was adding pages to pcp
> +		 * list.
> +		 *
> +		 * Forward progress should be still guaranteed because
> +		 * pages on the pcp list can only belong to MOVABLE_ZONE
> +		 * because has_unmovable_pages explicitly checks for
> +		 * PageBuddy on freed pages on other zones.
> +		 */
> +		if (ret)
> +			drain_all_pages(zone);
>  	} while (ret);
>  
>  	/* Ok, all of our target is isolated.
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 242c03121d73..63a3db10a8c0 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>   * pageblocks we may have modified and return -EBUSY to caller. This
>   * prevents two threads from simultaneously working on overlapping ranges.
>   *
> + * Please note that there is no strong synchronization with the page allocator
> + * either. Pages might be freed while their page blocks are marked ISOLATED.
> + * In some cases pages might still end up on pcp lists and that would allow
> + * for their allocation even when they are in fact isolated already. Depending
> + * on how strong of a guarantee the caller needs drain_all_pages might be needed
> + * (e.g. __offline_pages will need to call it after check for isolated range for
> + * a next retry).
> + *
>   * Return: the number of isolated pageblocks on success and -EBUSY if any part
>   * of range cannot be isolated.
>   */
> 

(still on vacation, back next week on Tuesday)

I didn't look into discussions in v1, but to me this looks like we are
trying to hide an actual bug by implementing hacks in the caller
(repeated calls to drain_all_pages()). What about alloc_contig_range()
users - you get more allocation errors just because PCP code doesn't
play along.

There *is* strong synchronization with the page allocator - however,
there seems to be one corner case race where we allow to allocate pages
from isolated pageblocks.

I want that fixed instead if possible, otherwise this is just an ugly
hack to make the obvious symptoms (offlining looping forever) disappear.

If that is not possible easily, I'd much rather want to see all
drain_all_pages() calls being moved to the caller and have the expected
behavior documented instead of specifying "there is no strong
synchronization with the page allocator" - which is wrong in all but PCP
cases (and there only in one possible race?).

I do wonder why we hit this issue now and not before - I suspect
something in PCP code changed that made this race possible.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-03 17:36 ` David Hildenbrand
@ 2020-09-03 18:07   ` Pavel Tatashin
  2020-09-03 19:31   ` Andrew Morton
  1 sibling, 0 replies; 6+ messages in thread
From: Pavel Tatashin @ 2020-09-03 18:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: LKML, Andrew Morton, Michal Hocko, linux-mm, Oscar Salvador,
	Wei Yang, Vlastimil Babka, rientjes

On Thu, Sep 3, 2020 at 1:36 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 03.09.20 16:00, Pavel Tatashin wrote:
> > There is a race during page offline that can lead to infinite loop:
> > a page never ends up on a buddy list and __offline_pages() keeps
> > retrying infinitely or until a termination signal is received.
> >
> > Thread#1 - a new process:
> >
> > load_elf_binary
> >  begin_new_exec
> >   exec_mmap
> >    mmput
> >     exit_mmap
> >      tlb_finish_mmu
> >       tlb_flush_mmu
> >        release_pages
> >         free_unref_page_list
> >          free_unref_page_prepare
> >           set_pcppage_migratetype(page, migratetype);
> >              // Set page->index migration type below  MIGRATE_PCPTYPES
> >
> > Thread#2 - hot-removes memory
> > __offline_pages
> >   start_isolate_page_range
> >     set_migratetype_isolate
> >       set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> >         Set migration type to MIGRATE_ISOLATE-> set
> >         drain_all_pages(zone);
> >              // drain per-cpu page lists to buddy allocator.
> >
> > Thread#1 - continue
> >          free_unref_page_commit
> >            migratetype = get_pcppage_migratetype(page);
> >               // get old migration type
> >            list_add(&page->lru, &pcp->lists[migratetype]);
> >               // add new page to already drained pcp list
> >
> > Thread#2
> > Never drains pcp again, and therefore gets stuck in the loop.
> >
> > The fix is to try to drain per-cpu lists again after
> > check_pages_isolated_cb() fails.
> >
> > Fixes: c52e75935f8d ("mm: remove extra drain pages on pcp list")
> >
> > Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> > Cc: stable@vger.kernel.org
> > Acked-by: David Rientjes <rientjes@google.com>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  mm/memory_hotplug.c | 14 ++++++++++++++
> >  mm/page_isolation.c |  8 ++++++++
> >  2 files changed, 22 insertions(+)
> >
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index e9d5ab5d3ca0..b11a269e2356 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1575,6 +1575,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
> >               /* check again */
> >               ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> >                                           NULL, check_pages_isolated_cb);
> > +             /*
> > +              * per-cpu pages are drained in start_isolate_page_range, but if
> > +              * there are still pages that are not free, make sure that we
> > +              * drain again, because when we isolated range we might
> > +              * have raced with another thread that was adding pages to pcp
> > +              * list.
> > +              *
> > +              * Forward progress should be still guaranteed because
> > +              * pages on the pcp list can only belong to MOVABLE_ZONE
> > +              * because has_unmovable_pages explicitly checks for
> > +              * PageBuddy on freed pages on other zones.
> > +              */
> > +             if (ret)
> > +                     drain_all_pages(zone);
> >       } while (ret);
> >
> >       /* Ok, all of our target is isolated.
> > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > index 242c03121d73..63a3db10a8c0 100644
> > --- a/mm/page_isolation.c
> > +++ b/mm/page_isolation.c
> > @@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> >   * pageblocks we may have modified and return -EBUSY to caller. This
> >   * prevents two threads from simultaneously working on overlapping ranges.
> >   *
> > + * Please note that there is no strong synchronization with the page allocator
> > + * either. Pages might be freed while their page blocks are marked ISOLATED.
> > + * In some cases pages might still end up on pcp lists and that would allow
> > + * for their allocation even when they are in fact isolated already. Depending
> > + * on how strong of a guarantee the caller needs drain_all_pages might be needed
> > + * (e.g. __offline_pages will need to call it after check for isolated range for
> > + * a next retry).
> > + *
> >   * Return: the number of isolated pageblocks on success and -EBUSY if any part
> >   * of range cannot be isolated.
> >   */
> >
>
> (still on vacation, back next week on Tuesday)
>
> I didn't look into discussions in v1, but to me this looks like we are
> trying to hide an actual bug by implementing hacks in the caller

Hi David,

Please read discussion in v1 [1], some of the things that you brought
up were covered in that discussion.

> (repeated calls to drain_all_pages()). What about alloc_contig_range()
> users - you get more allocation errors just because PCP code doesn't
> play along.

Yes, I looked at that, and it sounds normal to me: alloc error, try
again if needed, at least we do not get stuck in an infinite loop like
here.

>
> There *is* strong synchronization with the page allocator - however,
> there seems to be one corner case race where we allow to allocate pages
> from isolated pageblocks.
>
> I want that fixed instead if possible, otherwise this is just an ugly
> hack to make the obvious symptoms (offlining looping forever) disappear.

The fix would be to synchronize isolation code with release code,
which would slow down the fastpath of the release code instead of rare
page offline code.

>
> If that is not possible easily, I'd much rather want to see all
> drain_all_pages() calls being moved to the caller and have the expected
> behavior documented instead of specifying "there is no strong
> synchronization with the page allocator" - which is wrong in all but PCP
> cases (and there only in one possible race?).
>
> I do wonder why we hit this issue now and not before - I suspect
> something in PCP code changed that made this race possible.

I suspect because memory hot-remove was never stressed enough? We have
a special use case where we do not hor-remove on every shutdown, and
repro rate was only one in ~400 reboots. This bug and other deadlocks
[2], [3] were tough too root cause partially because of so slow repro
rate.

Pasha

[1] https://lore.kernel.org/lkml/20200901124615.137200-1-pasha.tatashin@soleen.com
[2] https://lore.kernel.org/lkml/CA+CK2bCQcnTpzq2wGFa3D50PtKwBoWbDBm56S9y8c+j+pD+KSw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+CK2bBEHFuLLg79_h6bv4Vey+B0B2YXyBxTBa=Le12OKbNdwA@mail.gmail.com/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-03 17:36 ` David Hildenbrand
  2020-09-03 18:07   ` Pavel Tatashin
@ 2020-09-03 19:31   ` Andrew Morton
  2020-09-03 19:35     ` David Hildenbrand
  2020-09-04 12:42     ` Michal Hocko
  1 sibling, 2 replies; 6+ messages in thread
From: Andrew Morton @ 2020-09-03 19:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pavel Tatashin, linux-kernel, mhocko, linux-mm, osalvador,
	richard.weiyang, vbabka, rientjes

On Thu, 3 Sep 2020 19:36:26 +0200 David Hildenbrand <david@redhat.com> wrote:

> (still on vacation, back next week on Tuesday)
> 
> I didn't look into discussions in v1, but to me this looks like we are
> trying to hide an actual bug by implementing hacks in the caller
> (repeated calls to drain_all_pages()). What about alloc_contig_range()
> users - you get more allocation errors just because PCP code doesn't
> play along.
> 
> There *is* strong synchronization with the page allocator - however,
> there seems to be one corner case race where we allow to allocate pages
> from isolated pageblocks.
> 
> I want that fixed instead if possible, otherwise this is just an ugly
> hack to make the obvious symptoms (offlining looping forever) disappear.
> 
> If that is not possible easily, I'd much rather want to see all
> drain_all_pages() calls being moved to the caller and have the expected
> behavior documented instead of specifying "there is no strong
> synchronization with the page allocator" - which is wrong in all but PCP
> cases (and there only in one possible race?).
> 

It's a two-line hack which fixes a bug in -stable kernels, so I'm
inclined to proceed with it anyway.  We can undo it later on as part of
a better fix, OK?

Unless you think there's some new misbehaviour which we might see as a
result of this approach?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-03 19:31   ` Andrew Morton
@ 2020-09-03 19:35     ` David Hildenbrand
  2020-09-04 12:42     ` Michal Hocko
  1 sibling, 0 replies; 6+ messages in thread
From: David Hildenbrand @ 2020-09-03 19:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Pavel Tatashin, linux-kernel, mhocko,
	linux-mm, osalvador, richard.weiyang, vbabka, rientjes



> Am 03.09.2020 um 21:31 schrieb Andrew Morton <akpm@linux-foundation.org>:
> 
> On Thu, 3 Sep 2020 19:36:26 +0200 David Hildenbrand <david@redhat.com> wrote:
> 
>> (still on vacation, back next week on Tuesday)
>> 
>> I didn't look into discussions in v1, but to me this looks like we are
>> trying to hide an actual bug by implementing hacks in the caller
>> (repeated calls to drain_all_pages()). What about alloc_contig_range()
>> users - you get more allocation errors just because PCP code doesn't
>> play along.
>> 
>> There *is* strong synchronization with the page allocator - however,
>> there seems to be one corner case race where we allow to allocate pages
>> from isolated pageblocks.
>> 
>> I want that fixed instead if possible, otherwise this is just an ugly
>> hack to make the obvious symptoms (offlining looping forever) disappear.
>> 
>> If that is not possible easily, I'd much rather want to see all
>> drain_all_pages() calls being moved to the caller and have the expected
>> behavior documented instead of specifying "there is no strong
>> synchronization with the page allocator" - which is wrong in all but PCP
>> cases (and there only in one possible race?).
>> 
> 
> It's a two-line hack which fixes a bug in -stable kernels, so I'm
> inclined to proceed with it anyway.  We can undo it later on as part of
> a better fix, OK?

Agreed as a stable fix, but I really want to see a proper fix (e.g., disabling PCP while having isolated pageblocks) on top.

> 
> Unless you think there's some new misbehaviour which we might see as a
> result of this approach?
> 

We basically disable PCP by keeping to flush it. But performance shouldn‘t matter.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-03 19:31   ` Andrew Morton
  2020-09-03 19:35     ` David Hildenbrand
@ 2020-09-04 12:42     ` Michal Hocko
  1 sibling, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2020-09-04 12:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Pavel Tatashin, linux-kernel, linux-mm,
	osalvador, richard.weiyang, vbabka, rientjes

On Thu 03-09-20 12:31:36, Andrew Morton wrote:
> On Thu, 3 Sep 2020 19:36:26 +0200 David Hildenbrand <david@redhat.com> wrote:
> 
> > (still on vacation, back next week on Tuesday)
> > 
> > I didn't look into discussions in v1, but to me this looks like we are
> > trying to hide an actual bug by implementing hacks in the caller
> > (repeated calls to drain_all_pages()). What about alloc_contig_range()
> > users - you get more allocation errors just because PCP code doesn't
> > play along.
> > 
> > There *is* strong synchronization with the page allocator - however,
> > there seems to be one corner case race where we allow to allocate pages
> > from isolated pageblocks.
> > 
> > I want that fixed instead if possible, otherwise this is just an ugly
> > hack to make the obvious symptoms (offlining looping forever) disappear.
> > 
> > If that is not possible easily, I'd much rather want to see all
> > drain_all_pages() calls being moved to the caller and have the expected
> > behavior documented instead of specifying "there is no strong
> > synchronization with the page allocator" - which is wrong in all but PCP
> > cases (and there only in one possible race?).
> > 
> 
> It's a two-line hack which fixes a bug in -stable kernels, so I'm
> inclined to proceed with it anyway.  We can undo it later on as part of
> a better fix, OK?

Agreed. http://lkml.kernel.org/r/20200904070235.GA15277@dhcp22.suse.cz
for reference.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-09-04 12:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-03 14:00 [PATCH v2] mm/memory_hotplug: drain per-cpu pages again during memory offline Pavel Tatashin
2020-09-03 17:36 ` David Hildenbrand
2020-09-03 18:07   ` Pavel Tatashin
2020-09-03 19:31   ` Andrew Morton
2020-09-03 19:35     ` David Hildenbrand
2020-09-04 12:42     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).