Re: [PATCH] mm, page_alloc: fix core hung in free_pcppages_bulk()

From: David Hildenbrand <david@redhat.com>
To: Charan Teja Reddy <charante@codeaurora.org>,
	akpm@linux-foundation.org, mhocko@suse.com, vbabka@suse.cz,
	linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, vinmenon@codeaurora.org
Subject: Re: [PATCH] mm, page_alloc: fix core hung in free_pcppages_bulk()
Date: Tue, 11 Aug 2020 10:29:24 +0200	[thread overview]
Message-ID: <3b07d2a6-8ce7-5957-8ca5-a8d977852e14@redhat.com> (raw)
In-Reply-To: <1597075833-16736-1-git-send-email-charante@codeaurora.org>

On 10.08.20 18:10, Charan Teja Reddy wrote:
> The following race is observed with the repeated online, offline and a
> delay between two successive online of memory blocks of movable zone.
> 
> P1						P2
> 
> Online the first memory block in
> the movable zone. The pcp struct
> values are initialized to default
> values,i.e., pcp->high = 0 &
> pcp->batch = 1.
> 
> 					Allocate the pages from the
> 					movable zone.
> 
> Try to Online the second memory
> block in the movable zone thus it
> entered the online_pages() but yet
> to call zone_pcp_update().
> 					This process is entered into
> 					the exit path thus it tries
> 					to release the order-0 pages
> 					to pcp lists through
> 					free_unref_page_commit().
> 					As pcp->high = 0, pcp->count = 1
> 					proceed to call the function
> 					free_pcppages_bulk().
> Update the pcp values thus the
> new pcp values are like, say,
> pcp->high = 378, pcp->batch = 63.
> 					Read the pcp's batch value using
> 					READ_ONCE() and pass the same to
> 					free_pcppages_bulk(), pcp values
> 					passed here are, batch = 63,
> 					count = 1.
> 
> 					Since num of pages in the pcp
> 					lists are less than ->batch,
> 					then it will stuck in
> 					while(list_empty(list)) loop
> 					with interrupts disabled thus
> 					a core hung.
> 
> Avoid this by ensuring free_pcppages_bulk() called with proper count of
> pcp list pages.
> 
> The mentioned race is some what easily reproducible without [1] because
> pcp's are not updated for the first memory block online and thus there
> is a enough race window for P2 between alloc+free and pcp struct values
> update through onlining of second memory block.
> 
> With [1], the race is still exists but it is very much narrow as we
> update the pcp struct values for the first memory block online itself.
> 
> [1]: https://patchwork.kernel.org/patch/11696389/
> 
> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
> ---
>  mm/page_alloc.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e4896e6..25e7e12 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3106,6 +3106,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>  	struct zone *zone = page_zone(page);
>  	struct per_cpu_pages *pcp;
>  	int migratetype;
> +	int high;
>  
>  	migratetype = get_pcppage_migratetype(page);
>  	__count_vm_event(PGFREE);
> @@ -3128,8 +3129,19 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>  	list_add(&page->lru, &pcp->lists[migratetype]);
>  	pcp->count++;
> -	if (pcp->count >= pcp->high) {
> -		unsigned long batch = READ_ONCE(pcp->batch);
> +	high = READ_ONCE(pcp->high);
> +	if (pcp->count >= high) {
> +		int batch;
> +
> +		batch = READ_ONCE(pcp->batch);
> +		/*
> +		 * For non-default pcp struct values, high is always
> +		 * greater than the batch. If high < batch then pass
> +		 * proper count to free the pcp's list pages.
> +		 */
> +		if (unlikely(high < batch))
> +			batch = min(pcp->count, batch);
> +
>  		free_pcppages_bulk(zone, batch, pcp);
>  	}
>  }
> 

I was wondering if we should rather set all pageblocks to
MIGRATE_ISOLATE in online_pages() before doing the online_pages_range()
call, and do undo_isolate_page_range() after onlining is done.

move_pfn_range_to_zone()->memmap_init_zone() marks all pageblocks
MIGRATE_MOVABLE, and as that function is used also during boot, we could
supply a parameter to configure this.

This would prevent another race from happening: Having pages exposed to
the buddy ready for allocation in online_pages_range() before the
sections are marked online.

This would avoid any pages from getting allocated before we're
completely done onlining.

We would need MIGRATE_ISOLATE/CONFIG_MEMORY_ISOLATION also for
CONFIG_MEMORY_HOTPLUG.

-- 
Thanks,

David / dhildenb