linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 1/2] mm/page_alloc: drop the unnecessary pfn_valid() for start pfn
@ 2023-04-24 13:45 Baolin Wang
  2023-04-24 13:45 ` [PATCH v3 2/2] mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page() Baolin Wang
  0 siblings, 1 reply; 10+ messages in thread
From: Baolin Wang @ 2023-04-24 13:45 UTC (permalink / raw)
  To: akpm
  Cc: rppt, ying.huang, mgorman, vbabka, mhocko, david, baolin.wang,
	linux-mm, linux-kernel

__pageblock_pfn_to_page() currently performs both pfn_valid check and
pfn_to_online_page(). The former one is redundant because the latter is a
stronger check. Drop pfn_valid().

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
Changes from v2:
 - Update the commit message per Michal. Thanks.
 - Add acked tag from Michal.
Changes from v1:
 - Collect reviewed tags. Thanks David and Ying.
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9de2a18519a1..6457b64fe562 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1512,7 +1512,7 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
 	/* end_pfn is one past the range we are checking */
 	end_pfn--;
 
-	if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn))
+	if (!pfn_valid(end_pfn))
 		return NULL;
 
 	start_page = pfn_to_online_page(start_pfn);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v3 2/2] mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page()
  2023-04-24 13:45 [PATCH v3 1/2] mm/page_alloc: drop the unnecessary pfn_valid() for start pfn Baolin Wang
@ 2023-04-24 13:45 ` Baolin Wang
  2023-04-24 14:47   ` Michal Hocko
  2023-04-25  0:22   ` Huang, Ying
  0 siblings, 2 replies; 10+ messages in thread
From: Baolin Wang @ 2023-04-24 13:45 UTC (permalink / raw)
  To: akpm
  Cc: rppt, ying.huang, mgorman, vbabka, mhocko, david, baolin.wang,
	linux-mm, linux-kernel

Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(), which
checks whether the given zone contains holes, and uses pfn_to_online_page()
to validate if the start pfn is online and valid, as well as using pfn_valid()
to validate the end pfn.

However, the __pageblock_pfn_to_page() function may return non-NULL even
if the end pfn of a pageblock is in a memory hole in some situations. For
example, if the pageblock order is MAX_ORDER, which will fall into 2
sub-sections, and the end pfn of the pageblock may be hole even though
the start pfn is online and valid.

See below memory layout as an example and suppose the pageblock order
is MAX_ORDER.

[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000040000000-0x00000000ffffffff]
[    0.000000]   DMA32    empty
[    0.000000]   Normal   [mem 0x0000000100000000-0x0000001fa7ffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000040000000-0x0000001fa3c7ffff]
[    0.000000]   node   0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff]
[    0.000000]   node   0: [mem 0x0000001fa4000000-0x0000001fa402ffff]
[    0.000000]   node   0: [mem 0x0000001fa4030000-0x0000001fa40effff]
[    0.000000]   node   0: [mem 0x0000001fa40f0000-0x0000001fa73cffff]
[    0.000000]   node   0: [mem 0x0000001fa73d0000-0x0000001fa745ffff]
[    0.000000]   node   0: [mem 0x0000001fa7460000-0x0000001fa746ffff]
[    0.000000]   node   0: [mem 0x0000001fa7470000-0x0000001fa758ffff]
[    0.000000]   node   0: [mem 0x0000001fa7590000-0x0000001fa7dfffff]

Focus on the last memory range, and there is a hole for the range [mem
0x0000001fa7590000-0x0000001fa7dfffff]. That means the last pageblock
will contain the range from 0x1fa7c00000 to 0x1fa7ffffff, since the
pageblock must be 4M aligned. And in this pageblock, these pfns will
fall into 2 sub-section (the sub-section size is 2M aligned).

So, the 1st sub-section (indicates pfn range: 0x1fa7c00000 -
0x1fa7dfffff ) in this pageblock is valid by calling subsection_map_init()
in free_area_init(), but the 2nd sub-section (indicates pfn range:
0x1fa7e00000 - 0x1fa7ffffff ) in this pageblock is not valid.

This did not break anything until now, but the zone continuous is fragile
in this possible scenario. So as previous discussion[1], it is better to
add some comments to explain this possible issue in case there are some
future pfn walkers that rely on this.

[1] https://lore.kernel.org/all/87r0sdsmr6.fsf@yhuang6-desk2.ccr.corp.intel.com/

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
Changes from v2:
 - Update the commit log and comments per Michal, thanks.
Changes from v1:
 - Update the comments per Ying and Mike, thanks.

Note, I did not add Huang Ying's reviewed tag, since there are some
updates per Michal's suggestion. Ying, please review the v3. Thanks.
---
 mm/page_alloc.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6457b64fe562..bd124390c79b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1502,6 +1502,15 @@ void __free_pages_core(struct page *page, unsigned int order)
  * interleaving within a single pageblock. It is therefore sufficient to check
  * the first and last page of a pageblock and avoid checking each individual
  * page in a pageblock.
+ *
+ * Note: the function may return non-NULL struct page even for a page block
+ * which contains a memory hole (i.e. there is no physical memory for a subset
+ * of the pfn range). For example, if the pageblock order is MAX_ORDER, which
+ * will fall into 2 sub-sections, and the end pfn of the pageblock may be hole
+ * even though the start pfn is online and valid. This should be safe most of
+ * the time because struct pages are still zero pre-filled and pfn walkers
+ * shouldn't touch any physical memory range for which they do not recognize
+ * any specific metadata in struct pages.
  */
 struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
 				     unsigned long end_pfn, struct zone *zone)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page()
  2023-04-24 13:45 ` [PATCH v3 2/2] mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page() Baolin Wang
@ 2023-04-24 14:47   ` Michal Hocko
  2023-04-25  0:22   ` Huang, Ying
  1 sibling, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2023-04-24 14:47 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, rppt, ying.huang, mgorman, vbabka, david, linux-mm, linux-kernel

On Mon 24-04-23 21:45:40, Baolin Wang wrote:
> Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(), which
> checks whether the given zone contains holes, and uses pfn_to_online_page()
> to validate if the start pfn is online and valid, as well as using pfn_valid()
> to validate the end pfn.
> 
> However, the __pageblock_pfn_to_page() function may return non-NULL even
> if the end pfn of a pageblock is in a memory hole in some situations. For
> example, if the pageblock order is MAX_ORDER, which will fall into 2
> sub-sections, and the end pfn of the pageblock may be hole even though
> the start pfn is online and valid.
> 
> See below memory layout as an example and suppose the pageblock order
> is MAX_ORDER.
> 
> [    0.000000] Zone ranges:
> [    0.000000]   DMA      [mem 0x0000000040000000-0x00000000ffffffff]
> [    0.000000]   DMA32    empty
> [    0.000000]   Normal   [mem 0x0000000100000000-0x0000001fa7ffffff]
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x0000000040000000-0x0000001fa3c7ffff]
> [    0.000000]   node   0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff]
> [    0.000000]   node   0: [mem 0x0000001fa4000000-0x0000001fa402ffff]
> [    0.000000]   node   0: [mem 0x0000001fa4030000-0x0000001fa40effff]
> [    0.000000]   node   0: [mem 0x0000001fa40f0000-0x0000001fa73cffff]
> [    0.000000]   node   0: [mem 0x0000001fa73d0000-0x0000001fa745ffff]
> [    0.000000]   node   0: [mem 0x0000001fa7460000-0x0000001fa746ffff]
> [    0.000000]   node   0: [mem 0x0000001fa7470000-0x0000001fa758ffff]
> [    0.000000]   node   0: [mem 0x0000001fa7590000-0x0000001fa7dfffff]
> 
> Focus on the last memory range, and there is a hole for the range [mem
> 0x0000001fa7590000-0x0000001fa7dfffff]. That means the last pageblock
> will contain the range from 0x1fa7c00000 to 0x1fa7ffffff, since the
> pageblock must be 4M aligned. And in this pageblock, these pfns will
> fall into 2 sub-section (the sub-section size is 2M aligned).
> 
> So, the 1st sub-section (indicates pfn range: 0x1fa7c00000 -
> 0x1fa7dfffff ) in this pageblock is valid by calling subsection_map_init()
> in free_area_init(), but the 2nd sub-section (indicates pfn range:
> 0x1fa7e00000 - 0x1fa7ffffff ) in this pageblock is not valid.
> 
> This did not break anything until now, but the zone continuous is fragile
> in this possible scenario. So as previous discussion[1], it is better to
> add some comments to explain this possible issue in case there are some
> future pfn walkers that rely on this.
> 
> [1] https://lore.kernel.org/all/87r0sdsmr6.fsf@yhuang6-desk2.ccr.corp.intel.com/
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> Changes from v2:
>  - Update the commit log and comments per Michal, thanks.
> Changes from v1:
>  - Update the comments per Ying and Mike, thanks.
> 
> Note, I did not add Huang Ying's reviewed tag, since there are some
> updates per Michal's suggestion. Ying, please review the v3. Thanks.
> ---
>  mm/page_alloc.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6457b64fe562..bd124390c79b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1502,6 +1502,15 @@ void __free_pages_core(struct page *page, unsigned int order)
>   * interleaving within a single pageblock. It is therefore sufficient to check
>   * the first and last page of a pageblock and avoid checking each individual
>   * page in a pageblock.
> + *
> + * Note: the function may return non-NULL struct page even for a page block
> + * which contains a memory hole (i.e. there is no physical memory for a subset
> + * of the pfn range). For example, if the pageblock order is MAX_ORDER, which
> + * will fall into 2 sub-sections, and the end pfn of the pageblock may be hole
> + * even though the start pfn is online and valid. This should be safe most of
> + * the time because struct pages are still zero pre-filled and pfn walkers
> + * shouldn't touch any physical memory range for which they do not recognize
> + * any specific metadata in struct pages.
>   */
>  struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
>  				     unsigned long end_pfn, struct zone *zone)
> -- 
> 2.27.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page()
  2023-04-24 13:45 ` [PATCH v3 2/2] mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page() Baolin Wang
  2023-04-24 14:47   ` Michal Hocko
@ 2023-04-25  0:22   ` Huang, Ying
  2023-04-25  1:27     ` Baolin Wang
  1 sibling, 1 reply; 10+ messages in thread
From: Huang, Ying @ 2023-04-25  0:22 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, rppt, mgorman, vbabka, mhocko, david, linux-mm, linux-kernel

Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(), which
> checks whether the given zone contains holes, and uses pfn_to_online_page()
> to validate if the start pfn is online and valid, as well as using pfn_valid()
> to validate the end pfn.
>
> However, the __pageblock_pfn_to_page() function may return non-NULL even
> if the end pfn of a pageblock is in a memory hole in some situations. For
> example, if the pageblock order is MAX_ORDER, which will fall into 2
> sub-sections, and the end pfn of the pageblock may be hole even though
> the start pfn is online and valid.
>
> See below memory layout as an example and suppose the pageblock order
> is MAX_ORDER.
>
> [    0.000000] Zone ranges:
> [    0.000000]   DMA      [mem 0x0000000040000000-0x00000000ffffffff]
> [    0.000000]   DMA32    empty
> [    0.000000]   Normal   [mem 0x0000000100000000-0x0000001fa7ffffff]
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x0000000040000000-0x0000001fa3c7ffff]
> [    0.000000]   node   0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff]
> [    0.000000]   node   0: [mem 0x0000001fa4000000-0x0000001fa402ffff]
> [    0.000000]   node   0: [mem 0x0000001fa4030000-0x0000001fa40effff]
> [    0.000000]   node   0: [mem 0x0000001fa40f0000-0x0000001fa73cffff]
> [    0.000000]   node   0: [mem 0x0000001fa73d0000-0x0000001fa745ffff]
> [    0.000000]   node   0: [mem 0x0000001fa7460000-0x0000001fa746ffff]
> [    0.000000]   node   0: [mem 0x0000001fa7470000-0x0000001fa758ffff]
> [    0.000000]   node   0: [mem 0x0000001fa7590000-0x0000001fa7dfffff]
>
> Focus on the last memory range, and there is a hole for the range [mem
> 0x0000001fa7590000-0x0000001fa7dfffff]. That means the last pageblock
> will contain the range from 0x1fa7c00000 to 0x1fa7ffffff, since the
> pageblock must be 4M aligned. And in this pageblock, these pfns will
> fall into 2 sub-section (the sub-section size is 2M aligned).
>
> So, the 1st sub-section (indicates pfn range: 0x1fa7c00000 -
> 0x1fa7dfffff ) in this pageblock is valid by calling subsection_map_init()
> in free_area_init(), but the 2nd sub-section (indicates pfn range:
> 0x1fa7e00000 - 0x1fa7ffffff ) in this pageblock is not valid.
>
> This did not break anything until now, but the zone continuous is fragile
> in this possible scenario. So as previous discussion[1], it is better to
> add some comments to explain this possible issue in case there are some
> future pfn walkers that rely on this.
>
> [1] https://lore.kernel.org/all/87r0sdsmr6.fsf@yhuang6-desk2.ccr.corp.intel.com/
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> Changes from v2:
>  - Update the commit log and comments per Michal, thanks.
> Changes from v1:
>  - Update the comments per Ying and Mike, thanks.
>
> Note, I did not add Huang Ying's reviewed tag, since there are some
> updates per Michal's suggestion. Ying, please review the v3. Thanks.
> ---
>  mm/page_alloc.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6457b64fe562..bd124390c79b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1502,6 +1502,15 @@ void __free_pages_core(struct page *page, unsigned int order)
>   * interleaving within a single pageblock. It is therefore sufficient to check
>   * the first and last page of a pageblock and avoid checking each individual
>   * page in a pageblock.
> + *
> + * Note: the function may return non-NULL struct page even for a page block
> + * which contains a memory hole (i.e. there is no physical memory for a subset
> + * of the pfn range). For example, if the pageblock order is MAX_ORDER, which
> + * will fall into 2 sub-sections, and the end pfn of the pageblock may be hole
> + * even though the start pfn is online and valid. This should be safe most of
> + * the time because struct pages are still zero pre-filled and pfn walkers

I don't think the pfn is just zero-filled even it's a hole.  Can you
confirm that?  In memmap_init() and memmap_init_zone_range(),
init_unavailable_range() is called to initialize the struct page.

Best Regards,
Huang, Ying

> + * shouldn't touch any physical memory range for which they do not recognize
> + * any specific metadata in struct pages.
>   */
>  struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
>  				     unsigned long end_pfn, struct zone *zone)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page()
  2023-04-25  0:22   ` Huang, Ying
@ 2023-04-25  1:27     ` Baolin Wang
  2023-04-25  2:30       ` Huang, Ying
  2023-04-25  9:05       ` Michal Hocko
  0 siblings, 2 replies; 10+ messages in thread
From: Baolin Wang @ 2023-04-25  1:27 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, rppt, mgorman, vbabka, mhocko, david, linux-mm, linux-kernel



On 4/25/2023 8:22 AM, Huang, Ying wrote:
> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> 
>> Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(), which
>> checks whether the given zone contains holes, and uses pfn_to_online_page()
>> to validate if the start pfn is online and valid, as well as using pfn_valid()
>> to validate the end pfn.
>>
>> However, the __pageblock_pfn_to_page() function may return non-NULL even
>> if the end pfn of a pageblock is in a memory hole in some situations. For
>> example, if the pageblock order is MAX_ORDER, which will fall into 2
>> sub-sections, and the end pfn of the pageblock may be hole even though
>> the start pfn is online and valid.
>>
>> See below memory layout as an example and suppose the pageblock order
>> is MAX_ORDER.
>>
>> [    0.000000] Zone ranges:
>> [    0.000000]   DMA      [mem 0x0000000040000000-0x00000000ffffffff]
>> [    0.000000]   DMA32    empty
>> [    0.000000]   Normal   [mem 0x0000000100000000-0x0000001fa7ffffff]
>> [    0.000000] Movable zone start for each node
>> [    0.000000] Early memory node ranges
>> [    0.000000]   node   0: [mem 0x0000000040000000-0x0000001fa3c7ffff]
>> [    0.000000]   node   0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff]
>> [    0.000000]   node   0: [mem 0x0000001fa4000000-0x0000001fa402ffff]
>> [    0.000000]   node   0: [mem 0x0000001fa4030000-0x0000001fa40effff]
>> [    0.000000]   node   0: [mem 0x0000001fa40f0000-0x0000001fa73cffff]
>> [    0.000000]   node   0: [mem 0x0000001fa73d0000-0x0000001fa745ffff]
>> [    0.000000]   node   0: [mem 0x0000001fa7460000-0x0000001fa746ffff]
>> [    0.000000]   node   0: [mem 0x0000001fa7470000-0x0000001fa758ffff]
>> [    0.000000]   node   0: [mem 0x0000001fa7590000-0x0000001fa7dfffff]
>>
>> Focus on the last memory range, and there is a hole for the range [mem
>> 0x0000001fa7590000-0x0000001fa7dfffff]. That means the last pageblock
>> will contain the range from 0x1fa7c00000 to 0x1fa7ffffff, since the
>> pageblock must be 4M aligned. And in this pageblock, these pfns will
>> fall into 2 sub-section (the sub-section size is 2M aligned).
>>
>> So, the 1st sub-section (indicates pfn range: 0x1fa7c00000 -
>> 0x1fa7dfffff ) in this pageblock is valid by calling subsection_map_init()
>> in free_area_init(), but the 2nd sub-section (indicates pfn range:
>> 0x1fa7e00000 - 0x1fa7ffffff ) in this pageblock is not valid.
>>
>> This did not break anything until now, but the zone continuous is fragile
>> in this possible scenario. So as previous discussion[1], it is better to
>> add some comments to explain this possible issue in case there are some
>> future pfn walkers that rely on this.
>>
>> [1] https://lore.kernel.org/all/87r0sdsmr6.fsf@yhuang6-desk2.ccr.corp.intel.com/
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>> Changes from v2:
>>   - Update the commit log and comments per Michal, thanks.
>> Changes from v1:
>>   - Update the comments per Ying and Mike, thanks.
>>
>> Note, I did not add Huang Ying's reviewed tag, since there are some
>> updates per Michal's suggestion. Ying, please review the v3. Thanks.
>> ---
>>   mm/page_alloc.c | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6457b64fe562..bd124390c79b 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1502,6 +1502,15 @@ void __free_pages_core(struct page *page, unsigned int order)
>>    * interleaving within a single pageblock. It is therefore sufficient to check
>>    * the first and last page of a pageblock and avoid checking each individual
>>    * page in a pageblock.
>> + *
>> + * Note: the function may return non-NULL struct page even for a page block
>> + * which contains a memory hole (i.e. there is no physical memory for a subset
>> + * of the pfn range). For example, if the pageblock order is MAX_ORDER, which
>> + * will fall into 2 sub-sections, and the end pfn of the pageblock may be hole
>> + * even though the start pfn is online and valid. This should be safe most of
>> + * the time because struct pages are still zero pre-filled and pfn walkers
> 
> I don't think the pfn is just zero-filled even it's a hole.  Can you
> confirm that?  In memmap_init() and memmap_init_zone_range(),
> init_unavailable_range() is called to initialize the struct page.

Yes, what I mean is the page frames were initialized to zero firstly, 
and some fields were initialized to default value. The "zero pre-filled" 
seems confusing, may be change to "initialized"?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page()
  2023-04-25  1:27     ` Baolin Wang
@ 2023-04-25  2:30       ` Huang, Ying
  2023-04-25  9:05       ` Michal Hocko
  1 sibling, 0 replies; 10+ messages in thread
From: Huang, Ying @ 2023-04-25  2:30 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, rppt, mgorman, vbabka, mhocko, david, linux-mm, linux-kernel

Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> On 4/25/2023 8:22 AM, Huang, Ying wrote:
>> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
>> 
>>> Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(), which
>>> checks whether the given zone contains holes, and uses pfn_to_online_page()
>>> to validate if the start pfn is online and valid, as well as using pfn_valid()
>>> to validate the end pfn.
>>>
>>> However, the __pageblock_pfn_to_page() function may return non-NULL even
>>> if the end pfn of a pageblock is in a memory hole in some situations. For
>>> example, if the pageblock order is MAX_ORDER, which will fall into 2
>>> sub-sections, and the end pfn of the pageblock may be hole even though
>>> the start pfn is online and valid.
>>>
>>> See below memory layout as an example and suppose the pageblock order
>>> is MAX_ORDER.
>>>
>>> [    0.000000] Zone ranges:
>>> [    0.000000]   DMA      [mem 0x0000000040000000-0x00000000ffffffff]
>>> [    0.000000]   DMA32    empty
>>> [    0.000000]   Normal   [mem 0x0000000100000000-0x0000001fa7ffffff]
>>> [    0.000000] Movable zone start for each node
>>> [    0.000000] Early memory node ranges
>>> [    0.000000]   node   0: [mem 0x0000000040000000-0x0000001fa3c7ffff]
>>> [    0.000000]   node   0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff]
>>> [    0.000000]   node   0: [mem 0x0000001fa4000000-0x0000001fa402ffff]
>>> [    0.000000]   node   0: [mem 0x0000001fa4030000-0x0000001fa40effff]
>>> [    0.000000]   node   0: [mem 0x0000001fa40f0000-0x0000001fa73cffff]
>>> [    0.000000]   node   0: [mem 0x0000001fa73d0000-0x0000001fa745ffff]
>>> [    0.000000]   node   0: [mem 0x0000001fa7460000-0x0000001fa746ffff]
>>> [    0.000000]   node   0: [mem 0x0000001fa7470000-0x0000001fa758ffff]
>>> [    0.000000]   node   0: [mem 0x0000001fa7590000-0x0000001fa7dfffff]
>>>
>>> Focus on the last memory range, and there is a hole for the range [mem
>>> 0x0000001fa7590000-0x0000001fa7dfffff]. That means the last pageblock
>>> will contain the range from 0x1fa7c00000 to 0x1fa7ffffff, since the
>>> pageblock must be 4M aligned. And in this pageblock, these pfns will
>>> fall into 2 sub-section (the sub-section size is 2M aligned).
>>>
>>> So, the 1st sub-section (indicates pfn range: 0x1fa7c00000 -
>>> 0x1fa7dfffff ) in this pageblock is valid by calling subsection_map_init()
>>> in free_area_init(), but the 2nd sub-section (indicates pfn range:
>>> 0x1fa7e00000 - 0x1fa7ffffff ) in this pageblock is not valid.
>>>
>>> This did not break anything until now, but the zone continuous is fragile
>>> in this possible scenario. So as previous discussion[1], it is better to
>>> add some comments to explain this possible issue in case there are some
>>> future pfn walkers that rely on this.
>>>
>>> [1] https://lore.kernel.org/all/87r0sdsmr6.fsf@yhuang6-desk2.ccr.corp.intel.com/
>>>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>> Changes from v2:
>>>   - Update the commit log and comments per Michal, thanks.
>>> Changes from v1:
>>>   - Update the comments per Ying and Mike, thanks.
>>>
>>> Note, I did not add Huang Ying's reviewed tag, since there are some
>>> updates per Michal's suggestion. Ying, please review the v3. Thanks.
>>> ---
>>>   mm/page_alloc.c | 9 +++++++++
>>>   1 file changed, 9 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 6457b64fe562..bd124390c79b 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1502,6 +1502,15 @@ void __free_pages_core(struct page *page, unsigned int order)
>>>    * interleaving within a single pageblock. It is therefore sufficient to check
>>>    * the first and last page of a pageblock and avoid checking each individual
>>>    * page in a pageblock.
>>> + *
>>> + * Note: the function may return non-NULL struct page even for a page block
>>> + * which contains a memory hole (i.e. there is no physical memory for a subset
>>> + * of the pfn range). For example, if the pageblock order is MAX_ORDER, which
>>> + * will fall into 2 sub-sections, and the end pfn of the pageblock may be hole
>>> + * even though the start pfn is online and valid. This should be safe most of
>>> + * the time because struct pages are still zero pre-filled and pfn walkers
>> I don't think the pfn is just zero-filled even it's a hole.  Can you
>> confirm that?  In memmap_init() and memmap_init_zone_range(),
>> init_unavailable_range() is called to initialize the struct page.
>
> Yes, what I mean is the page frames were initialized to zero firstly,
> and some fields were initialized to default value. The "zero
> pre-filled" seems confusing, may be change to "initialized"?

Yes.  That sounds good.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page()
  2023-04-25  1:27     ` Baolin Wang
  2023-04-25  2:30       ` Huang, Ying
@ 2023-04-25  9:05       ` Michal Hocko
  2023-04-25 12:29         ` Baolin Wang
  1 sibling, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2023-04-25  9:05 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Huang, Ying, akpm, rppt, mgorman, vbabka, david, linux-mm, linux-kernel

On Tue 25-04-23 09:27:23, Baolin Wang wrote:
> 
> 
> On 4/25/2023 8:22 AM, Huang, Ying wrote:
> > Baolin Wang <baolin.wang@linux.alibaba.com> writes:
[...]
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 6457b64fe562..bd124390c79b 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1502,6 +1502,15 @@ void __free_pages_core(struct page *page, unsigned int order)
> > >    * interleaving within a single pageblock. It is therefore sufficient to check
> > >    * the first and last page of a pageblock and avoid checking each individual
> > >    * page in a pageblock.
> > > + *
> > > + * Note: the function may return non-NULL struct page even for a page block
> > > + * which contains a memory hole (i.e. there is no physical memory for a subset
> > > + * of the pfn range). For example, if the pageblock order is MAX_ORDER, which
> > > + * will fall into 2 sub-sections, and the end pfn of the pageblock may be hole
> > > + * even though the start pfn is online and valid. This should be safe most of
> > > + * the time because struct pages are still zero pre-filled and pfn walkers
> > 
> > I don't think the pfn is just zero-filled even it's a hole.  Can you
> > confirm that?  In memmap_init() and memmap_init_zone_range(),
> > init_unavailable_range() is called to initialize the struct page.
> 
> Yes, what I mean is the page frames were initialized to zero firstly, and
> some fields were initialized to default value. The "zero pre-filled" seems
> confusing, may be change to "initialized"?

Huang Ying is correct. Holes should have struct pages initialized and
init_unavailable_range actually marks those pages reserved. Which
is really good because they mean "do not touch unless this page is
yours". For some reason I thought those struct pages are simply zero
filled. I was clearly wrong. Maybe it would be good to reference
init_unavailable_range in the comment so that it is easier to track the
whole code path.

Sorry about that!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page()
  2023-04-25  9:05       ` Michal Hocko
@ 2023-04-25 12:29         ` Baolin Wang
  2023-04-25 12:44           ` [PATCH v4] " Baolin Wang
  0 siblings, 1 reply; 10+ messages in thread
From: Baolin Wang @ 2023-04-25 12:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Huang, Ying, akpm, rppt, mgorman, vbabka, david, linux-mm, linux-kernel



On 4/25/2023 5:05 PM, Michal Hocko wrote:
> On Tue 25-04-23 09:27:23, Baolin Wang wrote:
>>
>>
>> On 4/25/2023 8:22 AM, Huang, Ying wrote:
>>> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> [...]
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 6457b64fe562..bd124390c79b 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -1502,6 +1502,15 @@ void __free_pages_core(struct page *page, unsigned int order)
>>>>     * interleaving within a single pageblock. It is therefore sufficient to check
>>>>     * the first and last page of a pageblock and avoid checking each individual
>>>>     * page in a pageblock.
>>>> + *
>>>> + * Note: the function may return non-NULL struct page even for a page block
>>>> + * which contains a memory hole (i.e. there is no physical memory for a subset
>>>> + * of the pfn range). For example, if the pageblock order is MAX_ORDER, which
>>>> + * will fall into 2 sub-sections, and the end pfn of the pageblock may be hole
>>>> + * even though the start pfn is online and valid. This should be safe most of
>>>> + * the time because struct pages are still zero pre-filled and pfn walkers
>>>
>>> I don't think the pfn is just zero-filled even it's a hole.  Can you
>>> confirm that?  In memmap_init() and memmap_init_zone_range(),
>>> init_unavailable_range() is called to initialize the struct page.
>>
>> Yes, what I mean is the page frames were initialized to zero firstly, and
>> some fields were initialized to default value. The "zero pre-filled" seems
>> confusing, may be change to "initialized"?
> 
> Huang Ying is correct. Holes should have struct pages initialized and
> init_unavailable_range actually marks those pages reserved. Which
> is really good because they mean "do not touch unless this page is
> yours". For some reason I thought those struct pages are simply zero
> filled. I was clearly wrong. Maybe it would be good to reference
> init_unavailable_range in the comment so that it is easier to track the
> whole code path.

OK, will do as you and Huang Ying suggested. Thank you both.

> Sorry about that!

never mind:)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v4] mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page()
  2023-04-25 12:29         ` Baolin Wang
@ 2023-04-25 12:44           ` Baolin Wang
  2023-04-26  1:23             ` Huang, Ying
  0 siblings, 1 reply; 10+ messages in thread
From: Baolin Wang @ 2023-04-25 12:44 UTC (permalink / raw)
  To: akpm
  Cc: rppt, ying.huang, mgorman, vbabka, mhocko, david, baolin.wang,
	linux-mm, linux-kernel

Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(), which
checks whether the given zone contains holes, and uses pfn_to_online_page()
to validate if the start pfn is online and valid, as well as using pfn_valid()
to validate the end pfn.

However, the __pageblock_pfn_to_page() function may return non-NULL even
if the end pfn of a pageblock is in a memory hole in some situations. For
example, if the pageblock order is MAX_ORDER, which will fall into 2
sub-sections, and the end pfn of the pageblock may be hole even though
the start pfn is online and valid.

See below memory layout as an example and suppose the pageblock order
is MAX_ORDER.

[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000040000000-0x00000000ffffffff]
[    0.000000]   DMA32    empty
[    0.000000]   Normal   [mem 0x0000000100000000-0x0000001fa7ffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000040000000-0x0000001fa3c7ffff]
[    0.000000]   node   0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff]
[    0.000000]   node   0: [mem 0x0000001fa4000000-0x0000001fa402ffff]
[    0.000000]   node   0: [mem 0x0000001fa4030000-0x0000001fa40effff]
[    0.000000]   node   0: [mem 0x0000001fa40f0000-0x0000001fa73cffff]
[    0.000000]   node   0: [mem 0x0000001fa73d0000-0x0000001fa745ffff]
[    0.000000]   node   0: [mem 0x0000001fa7460000-0x0000001fa746ffff]
[    0.000000]   node   0: [mem 0x0000001fa7470000-0x0000001fa758ffff]
[    0.000000]   node   0: [mem 0x0000001fa7590000-0x0000001fa7dfffff]

Focus on the last memory range, and there is a hole for the range [mem
0x0000001fa7590000-0x0000001fa7dfffff]. That means the last pageblock
will contain the range from 0x1fa7c00000 to 0x1fa7ffffff, since the
pageblock must be 4M aligned. And in this pageblock, these pfns will
fall into 2 sub-section (the sub-section size is 2M aligned).

So, the 1st sub-section (indicates pfn range: 0x1fa7c00000 -
0x1fa7dfffff ) in this pageblock is valid by calling subsection_map_init()
in free_area_init(), but the 2nd sub-section (indicates pfn range:
0x1fa7e00000 - 0x1fa7ffffff ) in this pageblock is not valid.

This did not break anything until now, but the zone continuous is fragile
in this possible scenario. So as previous discussion[1], it is better to
add some comments to explain this possible issue in case there are some
future pfn walkers that rely on this.

[1] https://lore.kernel.org/all/87r0sdsmr6.fsf@yhuang6-desk2.ccr.corp.intel.com/

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
Changes from v3:
 - Update the comments to make it clear.
 - Add acked tag from Michal.
Changes from v2:
 - Update the commit log and comments per Michal, thanks.
Changes from v1:
 - Update the comments per Ying and Mike, thanks.
---
 mm/page_alloc.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6457b64fe562..af9c995d3c1e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1502,6 +1502,15 @@ void __free_pages_core(struct page *page, unsigned int order)
  * interleaving within a single pageblock. It is therefore sufficient to check
  * the first and last page of a pageblock and avoid checking each individual
  * page in a pageblock.
+ *
+ * Note: the function may return non-NULL struct page even for a page block
+ * which contains a memory hole (i.e. there is no physical memory for a subset
+ * of the pfn range). For example, if the pageblock order is MAX_ORDER, which
+ * will fall into 2 sub-sections, and the end pfn of the pageblock may be hole
+ * even though the start pfn is online and valid. This should be safe most of
+ * the time because struct pages are still initialized via init_unavailable_range()
+ * and pfn walkers shouldn't touch any physical memory range for which they do
+ * not recognize any specific metadata in struct pages.
  */
 struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
 				     unsigned long end_pfn, struct zone *zone)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v4] mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page()
  2023-04-25 12:44           ` [PATCH v4] " Baolin Wang
@ 2023-04-26  1:23             ` Huang, Ying
  0 siblings, 0 replies; 10+ messages in thread
From: Huang, Ying @ 2023-04-26  1:23 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, rppt, mgorman, vbabka, mhocko, david, linux-mm, linux-kernel

Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(), which
> checks whether the given zone contains holes, and uses pfn_to_online_page()
> to validate if the start pfn is online and valid, as well as using pfn_valid()
> to validate the end pfn.
>
> However, the __pageblock_pfn_to_page() function may return non-NULL even
> if the end pfn of a pageblock is in a memory hole in some situations. For
> example, if the pageblock order is MAX_ORDER, which will fall into 2
> sub-sections, and the end pfn of the pageblock may be hole even though
> the start pfn is online and valid.
>
> See below memory layout as an example and suppose the pageblock order
> is MAX_ORDER.
>
> [    0.000000] Zone ranges:
> [    0.000000]   DMA      [mem 0x0000000040000000-0x00000000ffffffff]
> [    0.000000]   DMA32    empty
> [    0.000000]   Normal   [mem 0x0000000100000000-0x0000001fa7ffffff]
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x0000000040000000-0x0000001fa3c7ffff]
> [    0.000000]   node   0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff]
> [    0.000000]   node   0: [mem 0x0000001fa4000000-0x0000001fa402ffff]
> [    0.000000]   node   0: [mem 0x0000001fa4030000-0x0000001fa40effff]
> [    0.000000]   node   0: [mem 0x0000001fa40f0000-0x0000001fa73cffff]
> [    0.000000]   node   0: [mem 0x0000001fa73d0000-0x0000001fa745ffff]
> [    0.000000]   node   0: [mem 0x0000001fa7460000-0x0000001fa746ffff]
> [    0.000000]   node   0: [mem 0x0000001fa7470000-0x0000001fa758ffff]
> [    0.000000]   node   0: [mem 0x0000001fa7590000-0x0000001fa7dfffff]
>
> Focus on the last memory range, and there is a hole for the range [mem
> 0x0000001fa7590000-0x0000001fa7dfffff]. That means the last pageblock
> will contain the range from 0x1fa7c00000 to 0x1fa7ffffff, since the
> pageblock must be 4M aligned. And in this pageblock, these pfns will
> fall into 2 sub-section (the sub-section size is 2M aligned).
>
> So, the 1st sub-section (indicates pfn range: 0x1fa7c00000 -
> 0x1fa7dfffff ) in this pageblock is valid by calling subsection_map_init()
> in free_area_init(), but the 2nd sub-section (indicates pfn range:
> 0x1fa7e00000 - 0x1fa7ffffff ) in this pageblock is not valid.
>
> This did not break anything until now, but the zone continuous is fragile
> in this possible scenario. So as previous discussion[1], it is better to
> add some comments to explain this possible issue in case there are some
> future pfn walkers that rely on this.
>
> [1] https://lore.kernel.org/all/87r0sdsmr6.fsf@yhuang6-desk2.ccr.corp.intel.com/
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

> ---
> Changes from v3:
>  - Update the comments to make it clear.
>  - Add acked tag from Michal.
> Changes from v2:
>  - Update the commit log and comments per Michal, thanks.
> Changes from v1:
>  - Update the comments per Ying and Mike, thanks.
> ---
>  mm/page_alloc.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6457b64fe562..af9c995d3c1e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1502,6 +1502,15 @@ void __free_pages_core(struct page *page, unsigned int order)
>   * interleaving within a single pageblock. It is therefore sufficient to check
>   * the first and last page of a pageblock and avoid checking each individual
>   * page in a pageblock.
> + *
> + * Note: the function may return non-NULL struct page even for a page block
> + * which contains a memory hole (i.e. there is no physical memory for a subset
> + * of the pfn range). For example, if the pageblock order is MAX_ORDER, which
> + * will fall into 2 sub-sections, and the end pfn of the pageblock may be hole
> + * even though the start pfn is online and valid. This should be safe most of
> + * the time because struct pages are still initialized via init_unavailable_range()
> + * and pfn walkers shouldn't touch any physical memory range for which they do
> + * not recognize any specific metadata in struct pages.
>   */
>  struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
>  				     unsigned long end_pfn, struct zone *zone)

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-04-26  1:24 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-24 13:45 [PATCH v3 1/2] mm/page_alloc: drop the unnecessary pfn_valid() for start pfn Baolin Wang
2023-04-24 13:45 ` [PATCH v3 2/2] mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page() Baolin Wang
2023-04-24 14:47   ` Michal Hocko
2023-04-25  0:22   ` Huang, Ying
2023-04-25  1:27     ` Baolin Wang
2023-04-25  2:30       ` Huang, Ying
2023-04-25  9:05       ` Michal Hocko
2023-04-25 12:29         ` Baolin Wang
2023-04-25 12:44           ` [PATCH v4] " Baolin Wang
2023-04-26  1:23             ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).