linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()
@ 2020-08-11 12:58 Charan Teja Reddy
  2020-08-11 21:05 ` David Hildenbrand
  2020-08-13 11:41 ` Michal Hocko
  0 siblings, 2 replies; 12+ messages in thread
From: Charan Teja Reddy @ 2020-08-11 12:58 UTC (permalink / raw)
  To: akpm, mhocko, vbabka, david, rientjes, linux-mm
  Cc: linux-kernel, vinmenon, Charan Teja Reddy

The following race is observed with the repeated online, offline and a
delay between two successive online of memory blocks of movable zone.

P1						P2

Online the first memory block in
the movable zone. The pcp struct
values are initialized to default
values,i.e., pcp->high = 0 &
pcp->batch = 1.

					Allocate the pages from the
					movable zone.

Try to Online the second memory
block in the movable zone thus it
entered the online_pages() but yet
to call zone_pcp_update().
					This process is entered into
					the exit path thus it tries
					to release the order-0 pages
					to pcp lists through
					free_unref_page_commit().
					As pcp->high = 0, pcp->count = 1
					proceed to call the function
					free_pcppages_bulk().
Update the pcp values thus the
new pcp values are like, say,
pcp->high = 378, pcp->batch = 63.
					Read the pcp's batch value using
					READ_ONCE() and pass the same to
					free_pcppages_bulk(), pcp values
					passed here are, batch = 63,
					count = 1.

					Since num of pages in the pcp
					lists are less than ->batch,
					then it will stuck in
					while(list_empty(list)) loop
					with interrupts disabled thus
					a core hung.

Avoid this by ensuring free_pcppages_bulk() is called with proper count
of pcp list pages.

The mentioned race is some what easily reproducible without [1] because
pcp's are not updated for the first memory block online and thus there
is a enough race window for P2 between alloc+free and pcp struct values
update through onlining of second memory block.

With [1], the race is still exists but it is very much narrow as we
update the pcp struct values for the first memory block online itself.

[1]: https://patchwork.kernel.org/patch/11696389/

Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
---

v1: https://patchwork.kernel.org/patch/11707637/

 mm/page_alloc.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e4896e6..839039f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	struct page *page, *tmp;
 	LIST_HEAD(head);
 
+	/*
+	 * Ensure proper count is passed which otherwise would stuck in the
+	 * below while (list_empty(list)) loop.
+	 */
+	count = min(pcp->count, count);
 	while (count) {
 		struct list_head *list;
 
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()
  2020-08-11 12:58 [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk() Charan Teja Reddy
@ 2020-08-11 21:05 ` David Hildenbrand
  2020-08-12  9:46   ` Charan Teja Kalla
  2020-08-13 11:41 ` Michal Hocko
  1 sibling, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2020-08-11 21:05 UTC (permalink / raw)
  To: Charan Teja Reddy, akpm, mhocko, vbabka, rientjes, linux-mm
  Cc: linux-kernel, vinmenon

On 11.08.20 14:58, Charan Teja Reddy wrote:
> The following race is observed with the repeated online, offline and a
> delay between two successive online of memory blocks of movable zone.
> 
> P1						P2
> 
> Online the first memory block in
> the movable zone. The pcp struct
> values are initialized to default
> values,i.e., pcp->high = 0 &
> pcp->batch = 1.
> 
> 					Allocate the pages from the
> 					movable zone.
> 
> Try to Online the second memory
> block in the movable zone thus it
> entered the online_pages() but yet
> to call zone_pcp_update().
> 					This process is entered into
> 					the exit path thus it tries
> 					to release the order-0 pages
> 					to pcp lists through
> 					free_unref_page_commit().
> 					As pcp->high = 0, pcp->count = 1
> 					proceed to call the function
> 					free_pcppages_bulk().
> Update the pcp values thus the
> new pcp values are like, say,
> pcp->high = 378, pcp->batch = 63.
> 					Read the pcp's batch value using
> 					READ_ONCE() and pass the same to
> 					free_pcppages_bulk(), pcp values
> 					passed here are, batch = 63,
> 					count = 1.
> 
> 					Since num of pages in the pcp
> 					lists are less than ->batch,
> 					then it will stuck in
> 					while(list_empty(list)) loop
> 					with interrupts disabled thus
> 					a core hung.
> 
> Avoid this by ensuring free_pcppages_bulk() is called with proper count
> of pcp list pages.
> 
> The mentioned race is some what easily reproducible without [1] because
> pcp's are not updated for the first memory block online and thus there
> is a enough race window for P2 between alloc+free and pcp struct values
> update through onlining of second memory block.
> 
> With [1], the race is still exists but it is very much narrow as we
> update the pcp struct values for the first memory block online itself.
> 
> [1]: https://patchwork.kernel.org/patch/11696389/
> 

IIUC, this is not limited to the movable zone, it could also happen in
corner cases with the normal zone (e.g., hotplug to a node that only has
DMA memory, or no other memory yet).

> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
> ---
> 
> v1: https://patchwork.kernel.org/patch/11707637/
> 
>  mm/page_alloc.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e4896e6..839039f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  	struct page *page, *tmp;
>  	LIST_HEAD(head);
>  
> +	/*
> +	 * Ensure proper count is passed which otherwise would stuck in the
> +	 * below while (list_empty(list)) loop.
> +	 */
> +	count = min(pcp->count, count);
>  	while (count) {
>  		struct list_head *list;
>  
> 

Fixes: and Cc: stable... tags?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()
  2020-08-11 21:05 ` David Hildenbrand
@ 2020-08-12  9:46   ` Charan Teja Kalla
  2020-08-12 10:00     ` David Hildenbrand
  2020-08-12 18:53     ` David Rientjes
  0 siblings, 2 replies; 12+ messages in thread
From: Charan Teja Kalla @ 2020-08-12  9:46 UTC (permalink / raw)
  To: David Hildenbrand, akpm, mhocko, vbabka, rientjes, linux-mm
  Cc: linux-kernel, vinmenon


Thanks David for the inputs.

On 8/12/2020 2:35 AM, David Hildenbrand wrote:
> On 11.08.20 14:58, Charan Teja Reddy wrote:
>> The following race is observed with the repeated online, offline and a
>> delay between two successive online of memory blocks of movable zone.
>>
>> P1						P2
>>
>> Online the first memory block in
>> the movable zone. The pcp struct
>> values are initialized to default
>> values,i.e., pcp->high = 0 &
>> pcp->batch = 1.
>>
>> 					Allocate the pages from the
>> 					movable zone.
>>
>> Try to Online the second memory
>> block in the movable zone thus it
>> entered the online_pages() but yet
>> to call zone_pcp_update().
>> 					This process is entered into
>> 					the exit path thus it tries
>> 					to release the order-0 pages
>> 					to pcp lists through
>> 					free_unref_page_commit().
>> 					As pcp->high = 0, pcp->count = 1
>> 					proceed to call the function
>> 					free_pcppages_bulk().
>> Update the pcp values thus the
>> new pcp values are like, say,
>> pcp->high = 378, pcp->batch = 63.
>> 					Read the pcp's batch value using
>> 					READ_ONCE() and pass the same to
>> 					free_pcppages_bulk(), pcp values
>> 					passed here are, batch = 63,
>> 					count = 1.
>>
>> 					Since num of pages in the pcp
>> 					lists are less than ->batch,
>> 					then it will stuck in
>> 					while(list_empty(list)) loop
>> 					with interrupts disabled thus
>> 					a core hung.
>>
>> Avoid this by ensuring free_pcppages_bulk() is called with proper count
>> of pcp list pages.
>>
>> The mentioned race is some what easily reproducible without [1] because
>> pcp's are not updated for the first memory block online and thus there
>> is a enough race window for P2 between alloc+free and pcp struct values
>> update through onlining of second memory block.
>>
>> With [1], the race is still exists but it is very much narrow as we
>> update the pcp struct values for the first memory block online itself.
>>
>> [1]: https://patchwork.kernel.org/patch/11696389/
>>
> 
> IIUC, this is not limited to the movable zone, it could also happen in
> corner cases with the normal zone (e.g., hotplug to a node that only has
> DMA memory, or no other memory yet).

Yes, this is my understanding too. I explained the above race in terms
of just movable zone for which it is observed. We can add the below line
in the end in patch commit message:
"This is not limited to the movable zone, it could also happen in cases
with the normal zone (e.g., hotplug to a node that only has DMA memory,
or no other memory yet)."

Just curious, there exists such systems where just a dma zone present
and we hot add the normal zone? I am not aware such thing in the
embedded world.
> 
>> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
>> ---
>>
>> v1: https://patchwork.kernel.org/patch/11707637/
>>
>>  mm/page_alloc.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e4896e6..839039f 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>  	struct page *page, *tmp;
>>  	LIST_HEAD(head);
>>  
>> +	/*
>> +	 * Ensure proper count is passed which otherwise would stuck in the
>> +	 * below while (list_empty(list)) loop.
>> +	 */
>> +	count = min(pcp->count, count);
>>  	while (count) {
>>  		struct list_head *list;
>>  
>>
> 
> Fixes: and Cc: stable... tags?

Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into
one-list-per-migrate-type")
Cc: <stable@vger.kernel.org> [2.6+]

I am not sure If I should have to raise V3 including these?
> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()
  2020-08-12  9:46   ` Charan Teja Kalla
@ 2020-08-12 10:00     ` David Hildenbrand
  2020-08-12 10:11       ` Charan Teja Kalla
  2020-08-12 18:53     ` David Rientjes
  1 sibling, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2020-08-12 10:00 UTC (permalink / raw)
  To: Charan Teja Kalla, akpm, mhocko, vbabka, rientjes, linux-mm
  Cc: linux-kernel, vinmenon

On 12.08.20 11:46, Charan Teja Kalla wrote:
> 
> Thanks David for the inputs.
> 
> On 8/12/2020 2:35 AM, David Hildenbrand wrote:
>> On 11.08.20 14:58, Charan Teja Reddy wrote:
>>> The following race is observed with the repeated online, offline and a
>>> delay between two successive online of memory blocks of movable zone.
>>>
>>> P1						P2
>>>
>>> Online the first memory block in
>>> the movable zone. The pcp struct
>>> values are initialized to default
>>> values,i.e., pcp->high = 0 &
>>> pcp->batch = 1.
>>>
>>> 					Allocate the pages from the
>>> 					movable zone.
>>>
>>> Try to Online the second memory
>>> block in the movable zone thus it
>>> entered the online_pages() but yet
>>> to call zone_pcp_update().
>>> 					This process is entered into
>>> 					the exit path thus it tries
>>> 					to release the order-0 pages
>>> 					to pcp lists through
>>> 					free_unref_page_commit().
>>> 					As pcp->high = 0, pcp->count = 1
>>> 					proceed to call the function
>>> 					free_pcppages_bulk().
>>> Update the pcp values thus the
>>> new pcp values are like, say,
>>> pcp->high = 378, pcp->batch = 63.
>>> 					Read the pcp's batch value using
>>> 					READ_ONCE() and pass the same to
>>> 					free_pcppages_bulk(), pcp values
>>> 					passed here are, batch = 63,
>>> 					count = 1.
>>>
>>> 					Since num of pages in the pcp
>>> 					lists are less than ->batch,
>>> 					then it will stuck in
>>> 					while(list_empty(list)) loop
>>> 					with interrupts disabled thus
>>> 					a core hung.
>>>
>>> Avoid this by ensuring free_pcppages_bulk() is called with proper count
>>> of pcp list pages.
>>>
>>> The mentioned race is some what easily reproducible without [1] because
>>> pcp's are not updated for the first memory block online and thus there
>>> is a enough race window for P2 between alloc+free and pcp struct values
>>> update through onlining of second memory block.
>>>
>>> With [1], the race is still exists but it is very much narrow as we
>>> update the pcp struct values for the first memory block online itself.
>>>
>>> [1]: https://patchwork.kernel.org/patch/11696389/
>>>
>>
>> IIUC, this is not limited to the movable zone, it could also happen in
>> corner cases with the normal zone (e.g., hotplug to a node that only has
>> DMA memory, or no other memory yet).
> 
> Yes, this is my understanding too. I explained the above race in terms
> of just movable zone for which it is observed. We can add the below line
> in the end in patch commit message:
> "This is not limited to the movable zone, it could also happen in cases
> with the normal zone (e.g., hotplug to a node that only has DMA memory,
> or no other memory yet)."

Yeah, that makes sense!

> 
> Just curious, there exists such systems where just a dma zone present
> and we hot add the normal zone? I am not aware such thing in the
> embedded world.

You can easily create such setups using QEMU.

IIRC, just specify a QEMU guest with 2G initial memory and a single NUMA
node, or 4G initial memory and two NUMA nodes. Then hotplug memory.

(IIRC kata containers always start a VM with 2G and then hotplug memory)

>>
>>> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
>>> ---
>>>
>>> v1: https://patchwork.kernel.org/patch/11707637/
>>>
>>>  mm/page_alloc.c | 5 +++++
>>>  1 file changed, 5 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index e4896e6..839039f 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>>  	struct page *page, *tmp;
>>>  	LIST_HEAD(head);
>>>  
>>> +	/*
>>> +	 * Ensure proper count is passed which otherwise would stuck in the
>>> +	 * below while (list_empty(list)) loop.
>>> +	 */
>>> +	count = min(pcp->count, count);
>>>  	while (count) {
>>>  		struct list_head *list;
>>>  
>>>
>>
>> Fixes: and Cc: stable... tags?
> 
> Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into
> one-list-per-migrate-type")
> Cc: <stable@vger.kernel.org> [2.6+]

Did we have memory hotplug support then already?

> 
> I am not sure If I should have to raise V3 including these?


Maybe Andrew can fixup when applying.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()
  2020-08-12 10:00     ` David Hildenbrand
@ 2020-08-12 10:11       ` Charan Teja Kalla
  2020-08-13  9:32         ` David Hildenbrand
  0 siblings, 1 reply; 12+ messages in thread
From: Charan Teja Kalla @ 2020-08-12 10:11 UTC (permalink / raw)
  To: David Hildenbrand, akpm, mhocko, vbabka, rientjes, linux-mm
  Cc: linux-kernel, vinmenon



On 8/12/2020 3:30 PM, David Hildenbrand wrote:
> On 12.08.20 11:46, Charan Teja Kalla wrote:
>>
>> Thanks David for the inputs.
>>
>> On 8/12/2020 2:35 AM, David Hildenbrand wrote:
>>> On 11.08.20 14:58, Charan Teja Reddy wrote:
>>>> The following race is observed with the repeated online, offline and a
>>>> delay between two successive online of memory blocks of movable zone.
>>>>
>>>> P1						P2
>>>>
>>>> Online the first memory block in
>>>> the movable zone. The pcp struct
>>>> values are initialized to default
>>>> values,i.e., pcp->high = 0 &
>>>> pcp->batch = 1.
>>>>
>>>> 					Allocate the pages from the
>>>> 					movable zone.
>>>>
>>>> Try to Online the second memory
>>>> block in the movable zone thus it
>>>> entered the online_pages() but yet
>>>> to call zone_pcp_update().
>>>> 					This process is entered into
>>>> 					the exit path thus it tries
>>>> 					to release the order-0 pages
>>>> 					to pcp lists through
>>>> 					free_unref_page_commit().
>>>> 					As pcp->high = 0, pcp->count = 1
>>>> 					proceed to call the function
>>>> 					free_pcppages_bulk().
>>>> Update the pcp values thus the
>>>> new pcp values are like, say,
>>>> pcp->high = 378, pcp->batch = 63.
>>>> 					Read the pcp's batch value using
>>>> 					READ_ONCE() and pass the same to
>>>> 					free_pcppages_bulk(), pcp values
>>>> 					passed here are, batch = 63,
>>>> 					count = 1.
>>>>
>>>> 					Since num of pages in the pcp
>>>> 					lists are less than ->batch,
>>>> 					then it will stuck in
>>>> 					while(list_empty(list)) loop
>>>> 					with interrupts disabled thus
>>>> 					a core hung.
>>>>
>>>> Avoid this by ensuring free_pcppages_bulk() is called with proper count
>>>> of pcp list pages.
>>>>
>>>> The mentioned race is some what easily reproducible without [1] because
>>>> pcp's are not updated for the first memory block online and thus there
>>>> is a enough race window for P2 between alloc+free and pcp struct values
>>>> update through onlining of second memory block.
>>>>
>>>> With [1], the race is still exists but it is very much narrow as we
>>>> update the pcp struct values for the first memory block online itself.
>>>>
>>>> [1]: https://patchwork.kernel.org/patch/11696389/
>>>>
>>>
>>> IIUC, this is not limited to the movable zone, it could also happen in
>>> corner cases with the normal zone (e.g., hotplug to a node that only has
>>> DMA memory, or no other memory yet).
>>
>> Yes, this is my understanding too. I explained the above race in terms
>> of just movable zone for which it is observed. We can add the below line
>> in the end in patch commit message:
>> "This is not limited to the movable zone, it could also happen in cases
>> with the normal zone (e.g., hotplug to a node that only has DMA memory,
>> or no other memory yet)."
> 
> Yeah, that makes sense!
> 
>>
>> Just curious, there exists such systems where just a dma zone present
>> and we hot add the normal zone? I am not aware such thing in the
>> embedded world.
> 
> You can easily create such setups using QEMU.
> 
> IIRC, just specify a QEMU guest with 2G initial memory and a single NUMA
> node, or 4G initial memory and two NUMA nodes. Then hotplug memory.
> 
> (IIRC kata containers always start a VM with 2G and then hotplug memory)
>
I see. Thanks for letting me know this.

>>>
>>>> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
>>>> ---
>>>>
>>>> v1: https://patchwork.kernel.org/patch/11707637/
>>>>
>>>>  mm/page_alloc.c | 5 +++++
>>>>  1 file changed, 5 insertions(+)
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index e4896e6..839039f 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>>>  	struct page *page, *tmp;
>>>>  	LIST_HEAD(head);
>>>>  
>>>> +	/*
>>>> +	 * Ensure proper count is passed which otherwise would stuck in the
>>>> +	 * below while (list_empty(list)) loop.
>>>> +	 */
>>>> +	count = min(pcp->count, count);
>>>>  	while (count) {
>>>>  		struct list_head *list;
>>>>  
>>>>
>>>
>>> Fixes: and Cc: stable... tags?
>>
>> Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into
>> one-list-per-migrate-type")
>> Cc: <stable@vger.kernel.org> [2.6+]
> 
> Did we have memory hotplug support then already?

Yes, it exist.
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/mm/memory_hotplug.c?h=v2.6.39

> 
>>
>> I am not sure If I should have to raise V3 including these?
> 
> 
> Maybe Andrew can fixup when applying.

Okay, let Andrew decide on this. Meanwhile If you find that this patch
looks correct, ACK from you helps here.

> 
> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()
  2020-08-12  9:46   ` Charan Teja Kalla
  2020-08-12 10:00     ` David Hildenbrand
@ 2020-08-12 18:53     ` David Rientjes
  1 sibling, 0 replies; 12+ messages in thread
From: David Rientjes @ 2020-08-12 18:53 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: David Hildenbrand, akpm, mhocko, vbabka, linux-mm, linux-kernel,
	vinmenon

On Wed, 12 Aug 2020, Charan Teja Kalla wrote:

> >> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
> >> ---
> >>
> >> v1: https://patchwork.kernel.org/patch/11707637/
> >>
> >>  mm/page_alloc.c | 5 +++++
> >>  1 file changed, 5 insertions(+)
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index e4896e6..839039f 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >>  	struct page *page, *tmp;
> >>  	LIST_HEAD(head);
> >>  
> >> +	/*
> >> +	 * Ensure proper count is passed which otherwise would stuck in the
> >> +	 * below while (list_empty(list)) loop.
> >> +	 */
> >> +	count = min(pcp->count, count);
> >>  	while (count) {
> >>  		struct list_head *list;
> >>  
> >>
> > 
> > Fixes: and Cc: stable... tags?
> 
> Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into
> one-list-per-migrate-type")
> Cc: <stable@vger.kernel.org> [2.6+]
> 

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()
  2020-08-12 10:11       ` Charan Teja Kalla
@ 2020-08-13  9:32         ` David Hildenbrand
  0 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2020-08-13  9:32 UTC (permalink / raw)
  To: Charan Teja Kalla, akpm, mhocko, vbabka, rientjes, linux-mm
  Cc: linux-kernel, vinmenon

On 12.08.20 12:11, Charan Teja Kalla wrote:
> 
> 
> On 8/12/2020 3:30 PM, David Hildenbrand wrote:
>> On 12.08.20 11:46, Charan Teja Kalla wrote:
>>>
>>> Thanks David for the inputs.
>>>
>>> On 8/12/2020 2:35 AM, David Hildenbrand wrote:
>>>> On 11.08.20 14:58, Charan Teja Reddy wrote:
>>>>> The following race is observed with the repeated online, offline and a
>>>>> delay between two successive online of memory blocks of movable zone.
>>>>>
>>>>> P1						P2
>>>>>
>>>>> Online the first memory block in
>>>>> the movable zone. The pcp struct
>>>>> values are initialized to default
>>>>> values,i.e., pcp->high = 0 &
>>>>> pcp->batch = 1.
>>>>>
>>>>> 					Allocate the pages from the
>>>>> 					movable zone.
>>>>>
>>>>> Try to Online the second memory
>>>>> block in the movable zone thus it
>>>>> entered the online_pages() but yet
>>>>> to call zone_pcp_update().
>>>>> 					This process is entered into
>>>>> 					the exit path thus it tries
>>>>> 					to release the order-0 pages
>>>>> 					to pcp lists through
>>>>> 					free_unref_page_commit().
>>>>> 					As pcp->high = 0, pcp->count = 1
>>>>> 					proceed to call the function
>>>>> 					free_pcppages_bulk().
>>>>> Update the pcp values thus the
>>>>> new pcp values are like, say,
>>>>> pcp->high = 378, pcp->batch = 63.
>>>>> 					Read the pcp's batch value using
>>>>> 					READ_ONCE() and pass the same to
>>>>> 					free_pcppages_bulk(), pcp values
>>>>> 					passed here are, batch = 63,
>>>>> 					count = 1.
>>>>>
>>>>> 					Since num of pages in the pcp
>>>>> 					lists are less than ->batch,
>>>>> 					then it will stuck in
>>>>> 					while(list_empty(list)) loop
>>>>> 					with interrupts disabled thus
>>>>> 					a core hung.
>>>>>
>>>>> Avoid this by ensuring free_pcppages_bulk() is called with proper count
>>>>> of pcp list pages.
>>>>>
>>>>> The mentioned race is some what easily reproducible without [1] because
>>>>> pcp's are not updated for the first memory block online and thus there
>>>>> is a enough race window for P2 between alloc+free and pcp struct values
>>>>> update through onlining of second memory block.
>>>>>
>>>>> With [1], the race is still exists but it is very much narrow as we
>>>>> update the pcp struct values for the first memory block online itself.
>>>>>
>>>>> [1]: https://patchwork.kernel.org/patch/11696389/
>>>>>
>>>>
>>>> IIUC, this is not limited to the movable zone, it could also happen in
>>>> corner cases with the normal zone (e.g., hotplug to a node that only has
>>>> DMA memory, or no other memory yet).
>>>
>>> Yes, this is my understanding too. I explained the above race in terms
>>> of just movable zone for which it is observed. We can add the below line
>>> in the end in patch commit message:
>>> "This is not limited to the movable zone, it could also happen in cases
>>> with the normal zone (e.g., hotplug to a node that only has DMA memory,
>>> or no other memory yet)."
>>
>> Yeah, that makes sense!
>>
>>>
>>> Just curious, there exists such systems where just a dma zone present
>>> and we hot add the normal zone? I am not aware such thing in the
>>> embedded world.
>>
>> You can easily create such setups using QEMU.
>>
>> IIRC, just specify a QEMU guest with 2G initial memory and a single NUMA
>> node, or 4G initial memory and two NUMA nodes. Then hotplug memory.
>>
>> (IIRC kata containers always start a VM with 2G and then hotplug memory)
>>
> I see. Thanks for letting me know this.
> 
>>>>
>>>>> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
>>>>> ---
>>>>>
>>>>> v1: https://patchwork.kernel.org/patch/11707637/
>>>>>
>>>>>  mm/page_alloc.c | 5 +++++
>>>>>  1 file changed, 5 insertions(+)
>>>>>
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index e4896e6..839039f 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>>>>  	struct page *page, *tmp;
>>>>>  	LIST_HEAD(head);
>>>>>  
>>>>> +	/*
>>>>> +	 * Ensure proper count is passed which otherwise would stuck in the
>>>>> +	 * below while (list_empty(list)) loop.
>>>>> +	 */
>>>>> +	count = min(pcp->count, count);
>>>>>  	while (count) {
>>>>>  		struct list_head *list;
>>>>>  
>>>>>
>>>>
>>>> Fixes: and Cc: stable... tags?
>>>
>>> Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into
>>> one-list-per-migrate-type")
>>> Cc: <stable@vger.kernel.org> [2.6+]
>>
>> Did we have memory hotplug support then already?
> 
> Yes, it exist.
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/mm/memory_hotplug.c?h=v2.6.39

Okay, so I guess these tags make sense.

> 
>>
>>>
>>> I am not sure If I should have to raise V3 including these?
>>
>>
>> Maybe Andrew can fixup when applying.
> 
> Okay, let Andrew decide on this. Meanwhile If you find that this patch
> looks correct, ACK from you helps here.


Sure, I think this is good enough as a simple fix.

Acked-by: David Hildenbrand <david@redhat.com>


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()
  2020-08-11 12:58 [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk() Charan Teja Reddy
  2020-08-11 21:05 ` David Hildenbrand
@ 2020-08-13 11:41 ` Michal Hocko
  2020-08-13 16:21   ` Charan Teja Kalla
  1 sibling, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2020-08-13 11:41 UTC (permalink / raw)
  To: Charan Teja Reddy
  Cc: akpm, vbabka, david, rientjes, linux-mm, linux-kernel, vinmenon

On Tue 11-08-20 18:28:23, Charan Teja Reddy wrote:
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e4896e6..839039f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  	struct page *page, *tmp;
>  	LIST_HEAD(head);
>  
> +	/*
> +	 * Ensure proper count is passed which otherwise would stuck in the
> +	 * below while (list_empty(list)) loop.
> +	 */
> +	count = min(pcp->count, count);
>  	while (count) {
>  		struct list_head *list;


How does this prevent the race actually? Don't we need something like
the following instead?
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e028b87ce294..45bcc7ba37c4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1317,9 +1317,16 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		 * lists
 		 */
 		do {
+			bool looped = false;
+
 			batch_free++;
-			if (++migratetype == MIGRATE_PCPTYPES)
+			if (++migratetype == MIGRATE_PCPTYPES) {
+				if (looped)
+					goto free;
+
 				migratetype = 0;
+				looped = true;
+			}
 			list = &pcp->lists[migratetype];
 		} while (list_empty(list));
 
@@ -1352,6 +1359,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		} while (--count && --batch_free && !list_empty(list));
 	}
 
+free:
 	spin_lock(&zone->lock);
 	isolated_pageblocks = has_isolate_pageblock(zone);
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()
  2020-08-13 11:41 ` Michal Hocko
@ 2020-08-13 16:21   ` Charan Teja Kalla
  2020-08-13 16:30     ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Charan Teja Kalla @ 2020-08-13 16:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, vbabka, david, rientjes, linux-mm, linux-kernel, vinmenon

Thanks Michal for comments.

On 8/13/2020 5:11 PM, Michal Hocko wrote:
> On Tue 11-08-20 18:28:23, Charan Teja Reddy wrote:
> [...]
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e4896e6..839039f 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>  	struct page *page, *tmp;
>>  	LIST_HEAD(head);
>>  
>> +	/*
>> +	 * Ensure proper count is passed which otherwise would stuck in the
>> +	 * below while (list_empty(list)) loop.
>> +	 */
>> +	count = min(pcp->count, count);
>>  	while (count) {
>>  		struct list_head *list;
> 
> 
> How does this prevent the race actually?

This doesn't prevent the race. This only fixes the core hung(as this is
called with spin_lock_irq()) caused by the race condition. This core
hung is because of incorrect count value is passed to the
free_pcppages_bulk() function.

The actual race should be fixed by David's suggestion (isolate, online
and undo isolation).

Something needs to be improved in commit message? May be like below:
s/The following race is observed with the repeated ... / Cpu core hung
is observed as a result of race with the use case of repeated...

 Don't we need something like
> the following instead?
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e028b87ce294..45bcc7ba37c4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1317,9 +1317,16 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  		 * lists
>  		 */
>  		do {
> +			bool looped = false;

IIUC, this looped will always be initialzed to false thus never jumped
to free.
But I think I got your idea that looping of the pcp lists for any pages.
If not found despite MIGRATE_PCPTYPES count lists are traversed, just
break the loop. Does this checking really required? Shouldn't pcp->count
tells the same whether any pages left in the pcp lists?

> +
>  			batch_free++;
> -			if (++migratetype == MIGRATE_PCPTYPES)
> +			if (++migratetype == MIGRATE_PCPTYPES) {
> +				if (looped)
> +					goto free;
> +
>  				migratetype = 0;
> +				looped = true;
> +			}
>  			list = &pcp->lists[migratetype];
>  		} while (list_empty(list));
>  
> @@ -1352,6 +1359,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  		} while (--count && --batch_free && !list_empty(list));
>  	}
>  
> +free:
>  	spin_lock(&zone->lock);
>  	isolated_pageblocks = has_isolate_pageblock(zone);
>  
> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()
  2020-08-13 16:21   ` Charan Teja Kalla
@ 2020-08-13 16:30     ` Michal Hocko
  2020-08-13 17:27       ` Charan Teja Kalla
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2020-08-13 16:30 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: akpm, vbabka, david, rientjes, linux-mm, linux-kernel, vinmenon

On Thu 13-08-20 21:51:29, Charan Teja Kalla wrote:
> Thanks Michal for comments.
> 
> On 8/13/2020 5:11 PM, Michal Hocko wrote:
> > On Tue 11-08-20 18:28:23, Charan Teja Reddy wrote:
> > [...]
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index e4896e6..839039f 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >>  	struct page *page, *tmp;
> >>  	LIST_HEAD(head);
> >>  
> >> +	/*
> >> +	 * Ensure proper count is passed which otherwise would stuck in the
> >> +	 * below while (list_empty(list)) loop.
> >> +	 */
> >> +	count = min(pcp->count, count);
> >>  	while (count) {
> >>  		struct list_head *list;
> > 
> > 
> > How does this prevent the race actually?
> 
> This doesn't prevent the race. This only fixes the core hung(as this is
> called with spin_lock_irq()) caused by the race condition. This core
> hung is because of incorrect count value is passed to the
> free_pcppages_bulk() function.

Let me ask differently. What does enforce that the count and lists do
not get out of sync in the loop. Your changelog says that the fix is to
use the proper value without any specifics.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()
  2020-08-13 16:30     ` Michal Hocko
@ 2020-08-13 17:27       ` Charan Teja Kalla
  2020-08-14  6:39         ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Charan Teja Kalla @ 2020-08-13 17:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, vbabka, david, rientjes, linux-mm, linux-kernel, vinmenon

Thanks Michal.

On 8/13/2020 10:00 PM, Michal Hocko wrote:
> On Thu 13-08-20 21:51:29, Charan Teja Kalla wrote:
>> Thanks Michal for comments.
>>
>> On 8/13/2020 5:11 PM, Michal Hocko wrote:
>>> On Tue 11-08-20 18:28:23, Charan Teja Reddy wrote:
>>> [...]
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index e4896e6..839039f 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>>>  	struct page *page, *tmp;
>>>>  	LIST_HEAD(head);
>>>>  
>>>> +	/*
>>>> +	 * Ensure proper count is passed which otherwise would stuck in the
>>>> +	 * below while (list_empty(list)) loop.
>>>> +	 */
>>>> +	count = min(pcp->count, count);
>>>>  	while (count) {
>>>>  		struct list_head *list;
>>>
>>>
>>> How does this prevent the race actually?
>>
>> This doesn't prevent the race. This only fixes the core hung(as this is
>> called with spin_lock_irq()) caused by the race condition. This core
>> hung is because of incorrect count value is passed to the
>> free_pcppages_bulk() function.
> 
> Let me ask differently. What does enforce that the count and lists do
> not get out of sync in the loop. 

count value is updated whenever an order-0 page is being added to the
pcp lists through free_unref_page_commit(), which is being called with
both interrupts, premption disabled.
static void free_unref_page_commit(struct page *page, {
	....
	list_add(&page->lru, &pcp->lists[migratetype]);
	pcp->count++
}

As these are pcp lists, they only gets touched by another process when
this process is context switched, which happens only after enabling
premption or interrupts. So, as long as process is operating on these
pcp lists in free_unref_page_commit function, the count and lists are
always synced.

However, the problem here is not that the count and lists are being out
of sync. They do always in sync, as explained above. It is with the
asking free_pcppages_bulk() to free the pages more than what is present
in the pcp lists which is ending up in while(list_empty()).

> Your changelog says that the fix is to
> use the proper value without any specifics.
> 
Will change this to: Ensure the count value passed is not greater than
the pcp lists count. Any better you suggest?

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()
  2020-08-13 17:27       ` Charan Teja Kalla
@ 2020-08-14  6:39         ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2020-08-14  6:39 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: akpm, vbabka, david, rientjes, linux-mm, linux-kernel, vinmenon

On Thu 13-08-20 22:57:32, Charan Teja Kalla wrote:
> Thanks Michal.
> 
> On 8/13/2020 10:00 PM, Michal Hocko wrote:
> > On Thu 13-08-20 21:51:29, Charan Teja Kalla wrote:
> >> Thanks Michal for comments.
> >>
> >> On 8/13/2020 5:11 PM, Michal Hocko wrote:
> >>> On Tue 11-08-20 18:28:23, Charan Teja Reddy wrote:
> >>> [...]
> >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>> index e4896e6..839039f 100644
> >>>> --- a/mm/page_alloc.c
> >>>> +++ b/mm/page_alloc.c
> >>>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >>>>  	struct page *page, *tmp;
> >>>>  	LIST_HEAD(head);
> >>>>  
> >>>> +	/*
> >>>> +	 * Ensure proper count is passed which otherwise would stuck in the
> >>>> +	 * below while (list_empty(list)) loop.
> >>>> +	 */
> >>>> +	count = min(pcp->count, count);
> >>>>  	while (count) {
> >>>>  		struct list_head *list;
> >>>
> >>>
> >>> How does this prevent the race actually?
> >>
> >> This doesn't prevent the race. This only fixes the core hung(as this is
> >> called with spin_lock_irq()) caused by the race condition. This core
> >> hung is because of incorrect count value is passed to the
> >> free_pcppages_bulk() function.
> > 
> > Let me ask differently. What does enforce that the count and lists do
> > not get out of sync in the loop. 
> 
> count value is updated whenever an order-0 page is being added to the
> pcp lists through free_unref_page_commit(), which is being called with
> both interrupts, premption disabled.
> static void free_unref_page_commit(struct page *page, {
> 	....
> 	list_add(&page->lru, &pcp->lists[migratetype]);
> 	pcp->count++
> }
> 
> As these are pcp lists, they only gets touched by another process when
> this process is context switched, which happens only after enabling
> premption or interrupts. So, as long as process is operating on these
> pcp lists in free_unref_page_commit function, the count and lists are
> always synced.
> 
> However, the problem here is not that the count and lists are being out
> of sync. They do always in sync, as explained above. It is with the
> asking free_pcppages_bulk() to free the pages more than what is present
> in the pcp lists which is ending up in while(list_empty()).

You are right. I managed to confuse myself. The thing is that the batch
count is out of sync.

> > Your changelog says that the fix is to
> > use the proper value without any specifics.
> > 
> Will change this to: Ensure the count value passed is not greater than
> the pcp lists count. Any better you suggest?

Yes, this makes it more clear.

Feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-08-14  6:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-11 12:58 [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk() Charan Teja Reddy
2020-08-11 21:05 ` David Hildenbrand
2020-08-12  9:46   ` Charan Teja Kalla
2020-08-12 10:00     ` David Hildenbrand
2020-08-12 10:11       ` Charan Teja Kalla
2020-08-13  9:32         ` David Hildenbrand
2020-08-12 18:53     ` David Rientjes
2020-08-13 11:41 ` Michal Hocko
2020-08-13 16:21   ` Charan Teja Kalla
2020-08-13 16:30     ` Michal Hocko
2020-08-13 17:27       ` Charan Teja Kalla
2020-08-14  6:39         ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).