All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] arm64: mm: Fix kernel page tables incorrectly deleted during memory removal
@ 2023-07-17 11:51 ` Wupeng Ma
  0 siblings, 0 replies; 13+ messages in thread
From: Wupeng Ma @ 2023-07-17 11:51 UTC (permalink / raw)
  To: catalin.marinas, akpm, sudaraja
  Cc: linux-mm, linux-kernel, mawupeng1, wangkefeng.wang, will,
	linux-arm-kernel, mark.rutland, anshuman.khandual

From: Ma Wupeng <mawupeng1@huawei.com>

During our test, we found that kernel page table may be unexpectedly
cleared with rodata off. The root cause is that the kernel page is
initialized with pud size(1G block mapping) while offline is memory
block size(MIN_MEMORY_BLOCK_SIZE 128M), eg, if 2G memory is hot-added,
when offline a memory block, the call trace is shown below,

 offline_and_remove_memory
    try_remove_memory
      arch_remove_memory
       __remove_pgd_mapping
         unmap_hotplug_range
           unmap_hotplug_p4d_range
             unmap_hotplug_pud_range
               if (pud_sect(pud))
                 pud_clear(pudp);

There is no issue for block mapping with pmd level(2M) because the
memory block size is aligned with 2M.

Commit f0b13ee23241 ("arm64/sparsemem: reduce SECTION_SIZE_BITS") reduces
SECTION_SIZE_BITS from arm64, this make memory section size less than pud
size possible. Since only hotadded memory can be removed for arm64 due to
commit bbd6ec605c0f ("arm64/mm: Enable memory hot remove"), stop using pud
size kernel page entry during memory hot join can fix this.

Fixes: f0b13ee23241 ("arm64/sparsemem: reduce SECTION_SIZE_BITS")
Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
---
 arch/arm64/mm/mmu.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 95d360805f8a..44c724ce4f70 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -44,6 +44,7 @@
 #define NO_BLOCK_MAPPINGS	BIT(0)
 #define NO_CONT_MAPPINGS	BIT(1)
 #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
+#define NO_PUD_MAPPINGS		BIT(3)
 
 int idmap_t0sz __ro_after_init;
 
@@ -344,7 +345,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
 		 */
 		if (pud_sect_supported() &&
 		   ((addr | next | phys) & ~PUD_MASK) == 0 &&
-		    (flags & NO_BLOCK_MAPPINGS) == 0) {
+		    (flags & (NO_BLOCK_MAPPINGS | NO_PUD_MAPPINGS)) == 0) {
 			pud_set_huge(pudp, phys, prot);
 
 			/*
@@ -1305,7 +1306,7 @@ struct range arch_get_mappable_range(void)
 int arch_add_memory(int nid, u64 start, u64 size,
 		    struct mhp_params *params)
 {
-	int ret, flags = NO_EXEC_MAPPINGS;
+	int ret, flags = NO_EXEC_MAPPINGS | NO_PUD_MAPPINGS;
 
 	VM_BUG_ON(!mhp_range_allowed(start, size, true));
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH] arm64: mm: Fix kernel page tables incorrectly deleted during memory removal
@ 2023-07-17 11:51 ` Wupeng Ma
  0 siblings, 0 replies; 13+ messages in thread
From: Wupeng Ma @ 2023-07-17 11:51 UTC (permalink / raw)
  To: catalin.marinas, akpm, sudaraja
  Cc: linux-mm, linux-kernel, mawupeng1, wangkefeng.wang, will,
	linux-arm-kernel, mark.rutland, anshuman.khandual

From: Ma Wupeng <mawupeng1@huawei.com>

During our test, we found that kernel page table may be unexpectedly
cleared with rodata off. The root cause is that the kernel page is
initialized with pud size(1G block mapping) while offline is memory
block size(MIN_MEMORY_BLOCK_SIZE 128M), eg, if 2G memory is hot-added,
when offline a memory block, the call trace is shown below,

 offline_and_remove_memory
    try_remove_memory
      arch_remove_memory
       __remove_pgd_mapping
         unmap_hotplug_range
           unmap_hotplug_p4d_range
             unmap_hotplug_pud_range
               if (pud_sect(pud))
                 pud_clear(pudp);

There is no issue for block mapping with pmd level(2M) because the
memory block size is aligned with 2M.

Commit f0b13ee23241 ("arm64/sparsemem: reduce SECTION_SIZE_BITS") reduces
SECTION_SIZE_BITS from arm64, this make memory section size less than pud
size possible. Since only hotadded memory can be removed for arm64 due to
commit bbd6ec605c0f ("arm64/mm: Enable memory hot remove"), stop using pud
size kernel page entry during memory hot join can fix this.

Fixes: f0b13ee23241 ("arm64/sparsemem: reduce SECTION_SIZE_BITS")
Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
---
 arch/arm64/mm/mmu.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 95d360805f8a..44c724ce4f70 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -44,6 +44,7 @@
 #define NO_BLOCK_MAPPINGS	BIT(0)
 #define NO_CONT_MAPPINGS	BIT(1)
 #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
+#define NO_PUD_MAPPINGS		BIT(3)
 
 int idmap_t0sz __ro_after_init;
 
@@ -344,7 +345,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
 		 */
 		if (pud_sect_supported() &&
 		   ((addr | next | phys) & ~PUD_MASK) == 0 &&
-		    (flags & NO_BLOCK_MAPPINGS) == 0) {
+		    (flags & (NO_BLOCK_MAPPINGS | NO_PUD_MAPPINGS)) == 0) {
 			pud_set_huge(pudp, phys, prot);
 
 			/*
@@ -1305,7 +1306,7 @@ struct range arch_get_mappable_range(void)
 int arch_add_memory(int nid, u64 start, u64 size,
 		    struct mhp_params *params)
 {
-	int ret, flags = NO_EXEC_MAPPINGS;
+	int ret, flags = NO_EXEC_MAPPINGS | NO_PUD_MAPPINGS;
 
 	VM_BUG_ON(!mhp_range_allowed(start, size, true));
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] arm64: mm: Fix kernel page tables incorrectly deleted during memory removal
  2023-07-17 11:51 ` Wupeng Ma
@ 2023-07-21 10:36   ` Will Deacon
  -1 siblings, 0 replies; 13+ messages in thread
From: Will Deacon @ 2023-07-21 10:36 UTC (permalink / raw)
  To: Wupeng Ma
  Cc: catalin.marinas, akpm, sudaraja, linux-mm, linux-kernel,
	wangkefeng.wang, linux-arm-kernel, mark.rutland,
	anshuman.khandual

On Mon, Jul 17, 2023 at 07:51:50PM +0800, Wupeng Ma wrote:
> From: Ma Wupeng <mawupeng1@huawei.com>
> 
> During our test, we found that kernel page table may be unexpectedly
> cleared with rodata off. The root cause is that the kernel page is
> initialized with pud size(1G block mapping) while offline is memory
> block size(MIN_MEMORY_BLOCK_SIZE 128M), eg, if 2G memory is hot-added,
> when offline a memory block, the call trace is shown below,
> 
>  offline_and_remove_memory
>     try_remove_memory
>       arch_remove_memory
>        __remove_pgd_mapping
>          unmap_hotplug_range
>            unmap_hotplug_p4d_range
>              unmap_hotplug_pud_range
>                if (pud_sect(pud))
>                  pud_clear(pudp);

Sorry, but I'm struggling to understand the problem here. If we're adding
and removing a 2G memory region, why _wouldn't_ we want to use large 1GiB
mappings? Or are you saying that only a subset of the memory is removed,
but we then accidentally unmap the whole thing?

> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 95d360805f8a..44c724ce4f70 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -44,6 +44,7 @@
>  #define NO_BLOCK_MAPPINGS	BIT(0)
>  #define NO_CONT_MAPPINGS	BIT(1)
>  #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
> +#define NO_PUD_MAPPINGS		BIT(3)
>  
>  int idmap_t0sz __ro_after_init;
>  
> @@ -344,7 +345,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
>  		 */
>  		if (pud_sect_supported() &&
>  		   ((addr | next | phys) & ~PUD_MASK) == 0 &&
> -		    (flags & NO_BLOCK_MAPPINGS) == 0) {
> +		    (flags & (NO_BLOCK_MAPPINGS | NO_PUD_MAPPINGS)) == 0) {
>  			pud_set_huge(pudp, phys, prot);
>  
>  			/*
> @@ -1305,7 +1306,7 @@ struct range arch_get_mappable_range(void)
>  int arch_add_memory(int nid, u64 start, u64 size,
>  		    struct mhp_params *params)
>  {
> -	int ret, flags = NO_EXEC_MAPPINGS;
> +	int ret, flags = NO_EXEC_MAPPINGS | NO_PUD_MAPPINGS;

I think we should allow large mappings here and instead prevent partial
removal of the block, if that's what is causing the issue.

Will

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] arm64: mm: Fix kernel page tables incorrectly deleted during memory removal
@ 2023-07-21 10:36   ` Will Deacon
  0 siblings, 0 replies; 13+ messages in thread
From: Will Deacon @ 2023-07-21 10:36 UTC (permalink / raw)
  To: Wupeng Ma
  Cc: catalin.marinas, akpm, sudaraja, linux-mm, linux-kernel,
	wangkefeng.wang, linux-arm-kernel, mark.rutland,
	anshuman.khandual

On Mon, Jul 17, 2023 at 07:51:50PM +0800, Wupeng Ma wrote:
> From: Ma Wupeng <mawupeng1@huawei.com>
> 
> During our test, we found that kernel page table may be unexpectedly
> cleared with rodata off. The root cause is that the kernel page is
> initialized with pud size(1G block mapping) while offline is memory
> block size(MIN_MEMORY_BLOCK_SIZE 128M), eg, if 2G memory is hot-added,
> when offline a memory block, the call trace is shown below,
> 
>  offline_and_remove_memory
>     try_remove_memory
>       arch_remove_memory
>        __remove_pgd_mapping
>          unmap_hotplug_range
>            unmap_hotplug_p4d_range
>              unmap_hotplug_pud_range
>                if (pud_sect(pud))
>                  pud_clear(pudp);

Sorry, but I'm struggling to understand the problem here. If we're adding
and removing a 2G memory region, why _wouldn't_ we want to use large 1GiB
mappings? Or are you saying that only a subset of the memory is removed,
but we then accidentally unmap the whole thing?

> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 95d360805f8a..44c724ce4f70 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -44,6 +44,7 @@
>  #define NO_BLOCK_MAPPINGS	BIT(0)
>  #define NO_CONT_MAPPINGS	BIT(1)
>  #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
> +#define NO_PUD_MAPPINGS		BIT(3)
>  
>  int idmap_t0sz __ro_after_init;
>  
> @@ -344,7 +345,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
>  		 */
>  		if (pud_sect_supported() &&
>  		   ((addr | next | phys) & ~PUD_MASK) == 0 &&
> -		    (flags & NO_BLOCK_MAPPINGS) == 0) {
> +		    (flags & (NO_BLOCK_MAPPINGS | NO_PUD_MAPPINGS)) == 0) {
>  			pud_set_huge(pudp, phys, prot);
>  
>  			/*
> @@ -1305,7 +1306,7 @@ struct range arch_get_mappable_range(void)
>  int arch_add_memory(int nid, u64 start, u64 size,
>  		    struct mhp_params *params)
>  {
> -	int ret, flags = NO_EXEC_MAPPINGS;
> +	int ret, flags = NO_EXEC_MAPPINGS | NO_PUD_MAPPINGS;

I think we should allow large mappings here and instead prevent partial
removal of the block, if that's what is causing the issue.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] arm64: mm: Fix kernel page tables incorrectly deleted during memory removal
  2023-07-21 10:36   ` Will Deacon
  (?)
@ 2023-07-24  1:25   ` mawupeng
  2023-07-24  5:54     ` Anshuman Khandual
  -1 siblings, 1 reply; 13+ messages in thread
From: mawupeng @ 2023-07-24  1:25 UTC (permalink / raw)
  To: will
  Cc: mawupeng1, catalin.marinas, akpm, sudaraja, linux-mm,
	linux-kernel, wangkefeng.wang, linux-arm-kernel, mark.rutland,
	anshuman.khandual



On 2023/7/21 18:36, Will Deacon wrote:
> On Mon, Jul 17, 2023 at 07:51:50PM +0800, Wupeng Ma wrote:
>> From: Ma Wupeng <mawupeng1@huawei.com>
>>
>> During our test, we found that kernel page table may be unexpectedly
>> cleared with rodata off. The root cause is that the kernel page is
>> initialized with pud size(1G block mapping) while offline is memory
>> block size(MIN_MEMORY_BLOCK_SIZE 128M), eg, if 2G memory is hot-added,
>> when offline a memory block, the call trace is shown below,
>>
>>  offline_and_remove_memory
>>     try_remove_memory
>>       arch_remove_memory
>>        __remove_pgd_mapping
>>          unmap_hotplug_range
>>            unmap_hotplug_p4d_range
>>              unmap_hotplug_pud_range
>>                if (pud_sect(pud))
>>                  pud_clear(pudp);
> 
> Sorry, but I'm struggling to understand the problem here. If we're adding
> and removing a 2G memory region, why _wouldn't_ we want to use large 1GiB
> mappings?


> Or are you saying that only a subset of the memory is removed,
> but we then accidentally unmap the whole thing?

Yes, umap a subset but the whole thing page table entry is removed.

> 
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 95d360805f8a..44c724ce4f70 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -44,6 +44,7 @@
>>  #define NO_BLOCK_MAPPINGS	BIT(0)
>>  #define NO_CONT_MAPPINGS	BIT(1)
>>  #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
>> +#define NO_PUD_MAPPINGS		BIT(3)
>>  
>>  int idmap_t0sz __ro_after_init;
>>  
>> @@ -344,7 +345,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
>>  		 */
>>  		if (pud_sect_supported() &&
>>  		   ((addr | next | phys) & ~PUD_MASK) == 0 &&
>> -		    (flags & NO_BLOCK_MAPPINGS) == 0) {
>> +		    (flags & (NO_BLOCK_MAPPINGS | NO_PUD_MAPPINGS)) == 0) {
>>  			pud_set_huge(pudp, phys, prot);
>>  
>>  			/*
>> @@ -1305,7 +1306,7 @@ struct range arch_get_mappable_range(void)
>>  int arch_add_memory(int nid, u64 start, u64 size,
>>  		    struct mhp_params *params)
>>  {
>> -	int ret, flags = NO_EXEC_MAPPINGS;
>> +	int ret, flags = NO_EXEC_MAPPINGS | NO_PUD_MAPPINGS;
> 
> I think we should allow large mappings here and instead prevent partial
> removal of the block, if that's what is causing the issue.

This could solve this problem.
Or we can prevent  partial removal? Or rebulid page table entry which is not removed?

> 
> Will

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] arm64: mm: Fix kernel page tables incorrectly deleted during memory removal
  2023-07-24  1:25   ` mawupeng
@ 2023-07-24  5:54     ` Anshuman Khandual
  2023-07-24  6:11       ` David Hildenbrand
  0 siblings, 1 reply; 13+ messages in thread
From: Anshuman Khandual @ 2023-07-24  5:54 UTC (permalink / raw)
  To: mawupeng, will, David Hildenbrand
  Cc: catalin.marinas, akpm, sudaraja, linux-mm, linux-kernel,
	wangkefeng.wang, linux-arm-kernel, mark.rutland



On 7/24/23 06:55, mawupeng wrote:
> 
> On 2023/7/21 18:36, Will Deacon wrote:
>> On Mon, Jul 17, 2023 at 07:51:50PM +0800, Wupeng Ma wrote:
>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>
>>> During our test, we found that kernel page table may be unexpectedly
>>> cleared with rodata off. The root cause is that the kernel page is
>>> initialized with pud size(1G block mapping) while offline is memory
>>> block size(MIN_MEMORY_BLOCK_SIZE 128M), eg, if 2G memory is hot-added,
>>> when offline a memory block, the call trace is shown below,
>>>
>>>  offline_and_remove_memory
>>>     try_remove_memory
>>>       arch_remove_memory
>>>        __remove_pgd_mapping
>>>          unmap_hotplug_range
>>>            unmap_hotplug_p4d_range
>>>              unmap_hotplug_pud_range
>>>                if (pud_sect(pud))
>>>                  pud_clear(pudp);
>> Sorry, but I'm struggling to understand the problem here. If we're adding
>> and removing a 2G memory region, why _wouldn't_ we want to use large 1GiB
>> mappings?
> 
>> Or are you saying that only a subset of the memory is removed,
>> but we then accidentally unmap the whole thing?
> Yes, umap a subset but the whole thing page table entry is removed.
> 
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index 95d360805f8a..44c724ce4f70 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -44,6 +44,7 @@
>>>  #define NO_BLOCK_MAPPINGS	BIT(0)
>>>  #define NO_CONT_MAPPINGS	BIT(1)
>>>  #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
>>> +#define NO_PUD_MAPPINGS		BIT(3)
>>>  
>>>  int idmap_t0sz __ro_after_init;
>>>  
>>> @@ -344,7 +345,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
>>>  		 */
>>>  		if (pud_sect_supported() &&
>>>  		   ((addr | next | phys) & ~PUD_MASK) == 0 &&
>>> -		    (flags & NO_BLOCK_MAPPINGS) == 0) {
>>> +		    (flags & (NO_BLOCK_MAPPINGS | NO_PUD_MAPPINGS)) == 0) {
>>>  			pud_set_huge(pudp, phys, prot);
>>>  
>>>  			/*
>>> @@ -1305,7 +1306,7 @@ struct range arch_get_mappable_range(void)
>>>  int arch_add_memory(int nid, u64 start, u64 size,
>>>  		    struct mhp_params *params)
>>>  {
>>> -	int ret, flags = NO_EXEC_MAPPINGS;
>>> +	int ret, flags = NO_EXEC_MAPPINGS | NO_PUD_MAPPINGS;
>> I think we should allow large mappings here and instead prevent partial
>> removal of the block, if that's what is causing the issue.
> This could solve this problem.
> Or we can prevent  partial removal? Or rebulid page table entry which is not removed?

+ David Hildenbrand <david@redhat.com>

Splitting the block mapping and rebuilding page table entry to reflect non-removed
areas will require additional information such as flags and pgtable alloc function
as in __create_pgd_mapping(), which need to be passed along, depending on whether
it's tearing down vmemmap (would not have PUD block map) or linear mapping. But I
am just wondering if we have to go in that direction at all or just prevent partial
memory block removal as suggested by Will.

- arch_remove_memory() does not have return type, core MM hotremove would not fail
  because arch_remove_memory() failed or warned

- core MM hotremove does check_hotplug_memory_range() which ensures the range and
  start address are memory_block_size_bytes() aligned

- Default memory_block_size_bytes() is dependent on SECTION_SIZE_BITS which on arm64
  now can be less than PUD_SIZE triggering this problem.

	#define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)

	unsigned long __weak memory_block_size_bytes(void)
	{
        	return MIN_MEMORY_BLOCK_SIZE;
	}
	EXPORT_SYMBOL_GPL(memory_block_size_bytes);

- We would need to override memory_block_size_bytes() on arm64 to accommodate such
  scenarios here

Something like this might work (built but not tested)

commit 2eb8dc0d08dfe0b2a3bb71df93b12f7bf74a2ca6 (HEAD)
Author: Anshuman Khandual <anshuman.khandual@arm.com>
Date:   Mon Jul 24 06:45:34 2023 +0100

    arm64/mm: Define memory_block_size_bytes()
    
    Define memory_block_size_bytes() on arm64 platforms to set minimum hot plug
    and remove granularity as PUD_SIZE in case where MIN_MEMORY_BLOCK_SIZE just
    falls below PUD_SIZE. Otherwise a complete PUD block mapping will be teared
    down while unmapping MIN_MEMORY_BLOCK_SIZE range.
    
    Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 95d360805f8a..1918459b3460 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1157,6 +1157,17 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+unsigned long memory_block_size_bytes(void)
+{
+       /*
+        * Linear mappings might include PUD based block mappings which
+        * cannot be teared down in part during memory hotremove. Hence
+        * PUD_SIZE needs to be the minimum granularity, for memory hot
+        * removal in case MIN_MEMORY_BLOCK_SIZE falls below.
+        */
+       return max_t(unsigned long, MIN_MEMORY_BLOCK_SIZE, PUD_SIZE);
+}
+
 void vmemmap_free(unsigned long start, unsigned long end,
                struct vmem_altmap *altmap)
 {

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] arm64: mm: Fix kernel page tables incorrectly deleted during memory removal
  2023-07-24  5:54     ` Anshuman Khandual
@ 2023-07-24  6:11       ` David Hildenbrand
  2023-07-26  6:20         ` mawupeng
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2023-07-24  6:11 UTC (permalink / raw)
  To: Anshuman Khandual, mawupeng, will
  Cc: catalin.marinas, akpm, sudaraja, linux-mm, linux-kernel,
	wangkefeng.wang, linux-arm-kernel, mark.rutland

On 24.07.23 07:54, Anshuman Khandual wrote:
> 
> 
> On 7/24/23 06:55, mawupeng wrote:
>>
>> On 2023/7/21 18:36, Will Deacon wrote:
>>> On Mon, Jul 17, 2023 at 07:51:50PM +0800, Wupeng Ma wrote:
>>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>>
>>>> During our test, we found that kernel page table may be unexpectedly
>>>> cleared with rodata off. The root cause is that the kernel page is
>>>> initialized with pud size(1G block mapping) while offline is memory
>>>> block size(MIN_MEMORY_BLOCK_SIZE 128M), eg, if 2G memory is hot-added,
>>>> when offline a memory block, the call trace is shown below,

Is someone adding memory in 2 GiB granularity and then removing parts of 
it in 128 MiB granularity? That would be against what we support using 
the add_memory() / offline_and_remove_memory() API and that driver 
should be fixed instead.

Or does this trigger only when a hotplugged memory block falls into the 
same 2 GiB area as boot memory?

>>>>
>>>>   offline_and_remove_memory
>>>>      try_remove_memory
>>>>        arch_remove_memory
>>>>         __remove_pgd_mapping
>>>>           unmap_hotplug_range
>>>>             unmap_hotplug_p4d_range
>>>>               unmap_hotplug_pud_range
>>>>                 if (pud_sect(pud))
>>>>                   pud_clear(pudp);

Which drivers triggers that? In-tree is only virtio-mem and dax/kmem. 
Both add and remove memory in the same granularity.

For example, virtio-mem will only call add_memory(memory_block_size()) 
to then offline_and_remove_memory(memory_block_size()).

Could that trigger it as well?

>>> Sorry, but I'm struggling to understand the problem here. If we're adding
>>> and removing a 2G memory region, why _wouldn't_ we want to use large 1GiB
>>> mappings?
>>
>>> Or are you saying that only a subset of the memory is removed,
>>> but we then accidentally unmap the whole thing?
>> Yes, umap a subset but the whole thing page table entry is removed.
>>

Can we have some more details about the user and how to trigger it?

>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>> index 95d360805f8a..44c724ce4f70 100644
>>>> --- a/arch/arm64/mm/mmu.c
>>>> +++ b/arch/arm64/mm/mmu.c
>>>> @@ -44,6 +44,7 @@
>>>>   #define NO_BLOCK_MAPPINGS	BIT(0)
>>>>   #define NO_CONT_MAPPINGS	BIT(1)
>>>>   #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
>>>> +#define NO_PUD_MAPPINGS		BIT(3)
>>>>   
>>>>   int idmap_t0sz __ro_after_init;
>>>>   
>>>> @@ -344,7 +345,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
>>>>   		 */
>>>>   		if (pud_sect_supported() &&
>>>>   		   ((addr | next | phys) & ~PUD_MASK) == 0 &&
>>>> -		    (flags & NO_BLOCK_MAPPINGS) == 0) {
>>>> +		    (flags & (NO_BLOCK_MAPPINGS | NO_PUD_MAPPINGS)) == 0) {
>>>>   			pud_set_huge(pudp, phys, prot);
>>>>   
>>>>   			/*
>>>> @@ -1305,7 +1306,7 @@ struct range arch_get_mappable_range(void)
>>>>   int arch_add_memory(int nid, u64 start, u64 size,
>>>>   		    struct mhp_params *params)
>>>>   {
>>>> -	int ret, flags = NO_EXEC_MAPPINGS;
>>>> +	int ret, flags = NO_EXEC_MAPPINGS | NO_PUD_MAPPINGS;
>>> I think we should allow large mappings here and instead prevent partial
>>> removal of the block, if that's what is causing the issue.
>> This could solve this problem.
>> Or we can prevent  partial removal? Or rebulid page table entry which is not removed?
> 
> + David Hildenbrand <david@redhat.com>
> 
> Splitting the block mapping and rebuilding page table entry to reflect non-removed
> areas will require additional information such as flags and pgtable alloc function
> as in __create_pgd_mapping(), which need to be passed along, depending on whether
> it's tearing down vmemmap (would not have PUD block map) or linear mapping. But I
> am just wondering if we have to go in that direction at all or just prevent partial
> memory block removal as suggested by Will.
> 
> - arch_remove_memory() does not have return type, core MM hotremove would not fail
>    because arch_remove_memory() failed or warned
> 
> - core MM hotremove does check_hotplug_memory_range() which ensures the range and
>    start address are memory_block_size_bytes() aligned
> 
> - Default memory_block_size_bytes() is dependent on SECTION_SIZE_BITS which on arm64
>    now can be less than PUD_SIZE triggering this problem.
> 
> 	#define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
> 
> 	unsigned long __weak memory_block_size_bytes(void)
> 	{
>          	return MIN_MEMORY_BLOCK_SIZE;
> 	}
> 	EXPORT_SYMBOL_GPL(memory_block_size_bytes);
> 
> - We would need to override memory_block_size_bytes() on arm64 to accommodate such
>    scenarios here
> 
> Something like this might work (built but not tested)
> 
> commit 2eb8dc0d08dfe0b2a3bb71df93b12f7bf74a2ca6 (HEAD)
> Author: Anshuman Khandual <anshuman.khandual@arm.com>
> Date:   Mon Jul 24 06:45:34 2023 +0100
> 
>      arm64/mm: Define memory_block_size_bytes()
>      
>      Define memory_block_size_bytes() on arm64 platforms to set minimum hot plug
>      and remove granularity as PUD_SIZE in case where MIN_MEMORY_BLOCK_SIZE just
>      falls below PUD_SIZE. Otherwise a complete PUD block mapping will be teared
>      down while unmapping MIN_MEMORY_BLOCK_SIZE range.
>      
>      Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 95d360805f8a..1918459b3460 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1157,6 +1157,17 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>   }
>   
>   #ifdef CONFIG_MEMORY_HOTPLUG
> +unsigned long memory_block_size_bytes(void)
> +{
> +       /*
> +        * Linear mappings might include PUD based block mappings which
> +        * cannot be teared down in part during memory hotremove. Hence
> +        * PUD_SIZE needs to be the minimum granularity, for memory hot
> +        * removal in case MIN_MEMORY_BLOCK_SIZE falls below.
> +        */
> +       return max_t(unsigned long, MIN_MEMORY_BLOCK_SIZE, PUD_SIZE);
> +}
> +
>   void vmemmap_free(unsigned long start, unsigned long end,
>                  struct vmem_altmap *altmap)
>   {
> 

OH god no. That would seriously degrade memory hotplug capabilities in 
virtual environments (especially, virtio-mem and DIMMS).

If someone adds memory in 128 MiB chunks and removes memory in 128 MiB 
chunks, that has to be working.

Removing boot memory is blocked via 
register_memory_notifier(&prevent_bootmem_remove_nb);

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] arm64: mm: Fix kernel page tables incorrectly deleted during memory removal
  2023-07-24  6:11       ` David Hildenbrand
@ 2023-07-26  6:20         ` mawupeng
  2023-07-26  7:50           ` David Hildenbrand
  0 siblings, 1 reply; 13+ messages in thread
From: mawupeng @ 2023-07-26  6:20 UTC (permalink / raw)
  To: david, anshuman.khandual, will
  Cc: mawupeng1, catalin.marinas, akpm, sudaraja, linux-mm,
	linux-kernel, wangkefeng.wang, linux-arm-kernel, mark.rutland



On 2023/7/24 14:11, David Hildenbrand wrote:
> On 24.07.23 07:54, Anshuman Khandual wrote:
>>
>>
>> On 7/24/23 06:55, mawupeng wrote:
>>>
>>> On 2023/7/21 18:36, Will Deacon wrote:
>>>> On Mon, Jul 17, 2023 at 07:51:50PM +0800, Wupeng Ma wrote:
>>>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>>>
>>>>> During our test, we found that kernel page table may be unexpectedly
>>>>> cleared with rodata off. The root cause is that the kernel page is
>>>>> initialized with pud size(1G block mapping) while offline is memory
>>>>> block size(MIN_MEMORY_BLOCK_SIZE 128M), eg, if 2G memory is hot-added,
>>>>> when offline a memory block, the call trace is shown below,
> 
> Is someone adding memory in 2 GiB granularity and then removing parts of it in 128 MiB granularity? That would be against what we support using the add_memory() / offline_and_remove_memory() API and that driver should be fixed instead.

Yes, this kind of situation.

The problem occurs in the following scenarios:
1. use mem=xxG to reserve memory.
2. add_momory to online memory.
3. offline part of the memroy via offline_and_remove_memory.

During my research, ACPI memory removal use memory_subsys_offline to offline memory section and
this will not delete page table entry which do not trigger this kind of problem.

So I understand what you are talking about.
1. 3rd-party driver shouldn't use add_memory/offline_and_remove_memory to online/offline memory.
   If it have to use, this can be achieved by driver.
2. memory_subsys_offline is perfered to do such thing.

Should we update the doc to describe this kind of limitation?

> 
> Or does this trigger only when a hotplugged memory block falls into the same 2 GiB area as boot memor> 
>>>>>
>>>>>   offline_and_remove_memory
>>>>>      try_remove_memory
>>>>>        arch_remove_memory
>>>>>         __remove_pgd_mapping
>>>>>           unmap_hotplug_range
>>>>>             unmap_hotplug_p4d_range
>>>>>               unmap_hotplug_pud_range
>>>>>                 if (pud_sect(pud))
>>>>>                   pud_clear(pudp);
> 
> Which drivers triggers that? In-tree is only virtio-mem and dax/kmem. Both add and remove memory in the same granularity.

It is 3rd-party driver. which use try to offline part of(128M) movable memory and this lead to the problem.

> 
> For example, virtio-mem will only call add_memory(memory_block_size()) to then offline_and_remove_memory(memory_block_size()).
> 
> Could that trigger it as well?
> 
>>>> Sorry, but I'm struggling to understand the problem here. If we're adding
>>>> and removing a 2G memory region, why _wouldn't_ we want to use large 1GiB
>>>> mappings?
>>>
>>>> Or are you saying that only a subset of the memory is removed,
>>>> but we then accidentally unmap the whole thing?
>>> Yes, umap a subset but the whole thing page table entry is removed.
>>>
> 
> Can we have some more details about the user and how to trigger it?
> 
>>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>>> index 95d360805f8a..44c724ce4f70 100644
>>>>> --- a/arch/arm64/mm/mmu.c
>>>>> +++ b/arch/arm64/mm/mmu.c
>>>>> @@ -44,6 +44,7 @@
>>>>>   #define NO_BLOCK_MAPPINGS    BIT(0)
>>>>>   #define NO_CONT_MAPPINGS    BIT(1)
>>>>>   #define NO_EXEC_MAPPINGS    BIT(2)    /* assumes FEAT_HPDS is not used */
>>>>> +#define NO_PUD_MAPPINGS        BIT(3)
>>>>>     int idmap_t0sz __ro_after_init;
>>>>>   @@ -344,7 +345,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
>>>>>            */
>>>>>           if (pud_sect_supported() &&
>>>>>              ((addr | next | phys) & ~PUD_MASK) == 0 &&
>>>>> -            (flags & NO_BLOCK_MAPPINGS) == 0) {
>>>>> +            (flags & (NO_BLOCK_MAPPINGS | NO_PUD_MAPPINGS)) == 0) {
>>>>>               pud_set_huge(pudp, phys, prot);
>>>>>                 /*
>>>>> @@ -1305,7 +1306,7 @@ struct range arch_get_mappable_range(void)
>>>>>   int arch_add_memory(int nid, u64 start, u64 size,
>>>>>               struct mhp_params *params)
>>>>>   {
>>>>> -    int ret, flags = NO_EXEC_MAPPINGS;
>>>>> +    int ret, flags = NO_EXEC_MAPPINGS | NO_PUD_MAPPINGS;
>>>> I think we should allow large mappings here and instead prevent partial
>>>> removal of the block, if that's what is causing the issue.
>>> This could solve this problem.
>>> Or we can prevent  partial removal? Or rebulid page table entry which is not removed?
>>
>> + David Hildenbrand <david@redhat.com>
>>
>> Splitting the block mapping and rebuilding page table entry to reflect non-removed
>> areas will require additional information such as flags and pgtable alloc function
>> as in __create_pgd_mapping(), which need to be passed along, depending on whether
>> it's tearing down vmemmap (would not have PUD block map) or linear mapping. But I
>> am just wondering if we have to go in that direction at all or just prevent partial
>> memory block removal as suggested by Will.
>>
>> - arch_remove_memory() does not have return type, core MM hotremove would not fail
>>    because arch_remove_memory() failed or warned
>>
>> - core MM hotremove does check_hotplug_memory_range() which ensures the range and
>>    start address are memory_block_size_bytes() aligned
>>
>> - Default memory_block_size_bytes() is dependent on SECTION_SIZE_BITS which on arm64
>>    now can be less than PUD_SIZE triggering this problem.
>>
>>     #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
>>
>>     unsigned long __weak memory_block_size_bytes(void)
>>     {
>>              return MIN_MEMORY_BLOCK_SIZE;
>>     }
>>     EXPORT_SYMBOL_GPL(memory_block_size_bytes);
>>
>> - We would need to override memory_block_size_bytes() on arm64 to accommodate such
>>    scenarios here
>>
>> Something like this might work (built but not tested)
>>
>> commit 2eb8dc0d08dfe0b2a3bb71df93b12f7bf74a2ca6 (HEAD)
>> Author: Anshuman Khandual <anshuman.khandual@arm.com>
>> Date:   Mon Jul 24 06:45:34 2023 +0100
>>
>>      arm64/mm: Define memory_block_size_bytes()
>>           Define memory_block_size_bytes() on arm64 platforms to set minimum hot plug
>>      and remove granularity as PUD_SIZE in case where MIN_MEMORY_BLOCK_SIZE just
>>      falls below PUD_SIZE. Otherwise a complete PUD block mapping will be teared
>>      down while unmapping MIN_MEMORY_BLOCK_SIZE range.
>>           Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 95d360805f8a..1918459b3460 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1157,6 +1157,17 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>   }
>>     #ifdef CONFIG_MEMORY_HOTPLUG
>> +unsigned long memory_block_size_bytes(void)
>> +{
>> +       /*
>> +        * Linear mappings might include PUD based block mappings which
>> +        * cannot be teared down in part during memory hotremove. Hence
>> +        * PUD_SIZE needs to be the minimum granularity, for memory hot
>> +        * removal in case MIN_MEMORY_BLOCK_SIZE falls below.
>> +        */
>> +       return max_t(unsigned long, MIN_MEMORY_BLOCK_SIZE, PUD_SIZE);
>> +}
>> +
>>   void vmemmap_free(unsigned long start, unsigned long end,
>>                  struct vmem_altmap *altmap)
>>   {
>>
> 
> OH god no. That would seriously degrade memory hotplug capabilities in virtual environments (especially, virtio-mem and DIMMS).
> 
> If someone adds memory in 128 MiB chunks and removes memory in 128 MiB chunks, that has to be working.
> 
> Removing boot memory is blocked via register_memory_notifier(&prevent_bootmem_remove_nb);
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] arm64: mm: Fix kernel page tables incorrectly deleted during memory removal
  2023-07-26  6:20         ` mawupeng
@ 2023-07-26  7:50           ` David Hildenbrand
  2023-07-28  1:06               ` mawupeng
  2023-07-28  4:01               ` Anshuman Khandual
  0 siblings, 2 replies; 13+ messages in thread
From: David Hildenbrand @ 2023-07-26  7:50 UTC (permalink / raw)
  To: mawupeng, anshuman.khandual, will
  Cc: catalin.marinas, akpm, sudaraja, linux-mm, linux-kernel,
	wangkefeng.wang, linux-arm-kernel, mark.rutland

On 26.07.23 08:20, mawupeng wrote:
> 
> 
> On 2023/7/24 14:11, David Hildenbrand wrote:
>> On 24.07.23 07:54, Anshuman Khandual wrote:
>>>
>>>
>>> On 7/24/23 06:55, mawupeng wrote:
>>>>
>>>> On 2023/7/21 18:36, Will Deacon wrote:
>>>>> On Mon, Jul 17, 2023 at 07:51:50PM +0800, Wupeng Ma wrote:
>>>>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>>>>
>>>>>> During our test, we found that kernel page table may be unexpectedly
>>>>>> cleared with rodata off. The root cause is that the kernel page is
>>>>>> initialized with pud size(1G block mapping) while offline is memory
>>>>>> block size(MIN_MEMORY_BLOCK_SIZE 128M), eg, if 2G memory is hot-added,
>>>>>> when offline a memory block, the call trace is shown below,
>>
>> Is someone adding memory in 2 GiB granularity and then removing parts of it in 128 MiB granularity? That would be against what we support using the add_memory() / offline_and_remove_memory() API and that driver should be fixed instead.
> 
> Yes, this kind of situation.
> 
> The problem occurs in the following scenarios:
> 1. use mem=xxG to reserve memory.
> 2. add_momory to online memory.
> 3. offline part of the memroy via offline_and_remove_memory.
> 
> During my research, ACPI memory removal use memory_subsys_offline to offline memory section and
> this will not delete page table entry which do not trigger this kind of problem.
> 
> So I understand what you are talking about.
> 1. 3rd-party driver shouldn't use add_memory/offline_and_remove_memory to online/offline memory.
>     If it have to use, this can be achieved by driver.
> 2. memory_subsys_offline is perfered to do such thing.

No, my point is that

1) If you use add_memory() and offline_and_remove_memory() in the *same
    granularity* it has to be working, otherwise it has to be fixed.

2) If you use add_memory() and offline_and_remove_memory() in different
    granularity (especially, add_memory() in bigger granularity) , then
    change your code to do add_memory() in the same granularity.


If you run into 1), then we populated a PUD for boot memory that also 
covers yet unpopulated physical memory ranges that are later populated 
by add_memory(). If that's the case, then we can either fix it by

a) Not doing that. Use PMD tables instead for that piece of memory.

b) Detecting that that PUD still covers memory and refusing to remove
    that PUD.

c) Rejecting to hotadd memory in this situation at that location. We
    have mhp_get_pluggable_range() -> arch_get_mappable_range() to kind-
    of handle something like that.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] arm64: mm: Fix kernel page tables incorrectly deleted during memory removal
  2023-07-26  7:50           ` David Hildenbrand
@ 2023-07-28  1:06               ` mawupeng
  2023-07-28  4:01               ` Anshuman Khandual
  1 sibling, 0 replies; 13+ messages in thread
From: mawupeng @ 2023-07-28  1:06 UTC (permalink / raw)
  To: david, anshuman.khandual, will
  Cc: mawupeng1, catalin.marinas, akpm, sudaraja, linux-mm,
	linux-kernel, wangkefeng.wang, linux-arm-kernel, mark.rutland



On 2023/7/26 15:50, David Hildenbrand wrote:
> On 26.07.23 08:20, mawupeng wrote:
>>
>>
>> On 2023/7/24 14:11, David Hildenbrand wrote:
>>> On 24.07.23 07:54, Anshuman Khandual wrote:
>>>>
>>>>
>>>> On 7/24/23 06:55, mawupeng wrote:
>>>>>
>>>>> On 2023/7/21 18:36, Will Deacon wrote:
>>>>>> On Mon, Jul 17, 2023 at 07:51:50PM +0800, Wupeng Ma wrote:
>>>>>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>>>>>
>>>>>>> During our test, we found that kernel page table may be unexpectedly
>>>>>>> cleared with rodata off. The root cause is that the kernel page is
>>>>>>> initialized with pud size(1G block mapping) while offline is memory
>>>>>>> block size(MIN_MEMORY_BLOCK_SIZE 128M), eg, if 2G memory is hot-added,
>>>>>>> when offline a memory block, the call trace is shown below,
>>>
>>> Is someone adding memory in 2 GiB granularity and then removing parts of it in 128 MiB granularity? That would be against what we support using the add_memory() / offline_and_remove_memory() API and that driver should be fixed instead.
>>
>> Yes, this kind of situation.
>>
>> The problem occurs in the following scenarios:
>> 1. use mem=xxG to reserve memory.
>> 2. add_momory to online memory.
>> 3. offline part of the memroy via offline_and_remove_memory.
>>
>> During my research, ACPI memory removal use memory_subsys_offline to offline memory section and
>> this will not delete page table entry which do not trigger this kind of problem.
>>
>> So I understand what you are talking about.
>> 1. 3rd-party driver shouldn't use add_memory/offline_and_remove_memory to online/offline memory.
>>     If it have to use, this can be achieved by driver.
>> 2. memory_subsys_offline is perfered to do such thing.
> 
> No, my point is that
> 
> 1) If you use add_memory() and offline_and_remove_memory() in the *same
>    granularity* it has to be working, otherwise it has to be fixed.
> 
> 2) If you use add_memory() and offline_and_remove_memory() in different
>    granularity (especially, add_memory() in bigger granularity) , then
>    change your code to do add_memory() in the same granularity.
> 
> 
> If you run into 1), then we populated a PUD for boot memory that also covers yet unpopulated physical memory ranges that are later populated by add_memory(). If that's the case, then we can either fix it by
> 
> a) Not doing that. Use PMD tables instead for that piece of memory.
> 
> b) Detecting that that PUD still covers memory and refusing to remove
>    that PUD.
> 
> c) Rejecting to hotadd memory in this situation at that location. We
>    have mhp_get_pluggable_range() -> arch_get_mappable_range() to kind-
>    of handle something like that.

Thank you for your patient answer.

This I do understand and answer my question.

> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] arm64: mm: Fix kernel page tables incorrectly deleted during memory removal
@ 2023-07-28  1:06               ` mawupeng
  0 siblings, 0 replies; 13+ messages in thread
From: mawupeng @ 2023-07-28  1:06 UTC (permalink / raw)
  To: david, anshuman.khandual, will
  Cc: mawupeng1, catalin.marinas, akpm, sudaraja, linux-mm,
	linux-kernel, wangkefeng.wang, linux-arm-kernel, mark.rutland



On 2023/7/26 15:50, David Hildenbrand wrote:
> On 26.07.23 08:20, mawupeng wrote:
>>
>>
>> On 2023/7/24 14:11, David Hildenbrand wrote:
>>> On 24.07.23 07:54, Anshuman Khandual wrote:
>>>>
>>>>
>>>> On 7/24/23 06:55, mawupeng wrote:
>>>>>
>>>>> On 2023/7/21 18:36, Will Deacon wrote:
>>>>>> On Mon, Jul 17, 2023 at 07:51:50PM +0800, Wupeng Ma wrote:
>>>>>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>>>>>
>>>>>>> During our test, we found that kernel page table may be unexpectedly
>>>>>>> cleared with rodata off. The root cause is that the kernel page is
>>>>>>> initialized with pud size(1G block mapping) while offline is memory
>>>>>>> block size(MIN_MEMORY_BLOCK_SIZE 128M), eg, if 2G memory is hot-added,
>>>>>>> when offline a memory block, the call trace is shown below,
>>>
>>> Is someone adding memory in 2 GiB granularity and then removing parts of it in 128 MiB granularity? That would be against what we support using the add_memory() / offline_and_remove_memory() API and that driver should be fixed instead.
>>
>> Yes, this kind of situation.
>>
>> The problem occurs in the following scenarios:
>> 1. use mem=xxG to reserve memory.
>> 2. add_momory to online memory.
>> 3. offline part of the memroy via offline_and_remove_memory.
>>
>> During my research, ACPI memory removal use memory_subsys_offline to offline memory section and
>> this will not delete page table entry which do not trigger this kind of problem.
>>
>> So I understand what you are talking about.
>> 1. 3rd-party driver shouldn't use add_memory/offline_and_remove_memory to online/offline memory.
>>     If it have to use, this can be achieved by driver.
>> 2. memory_subsys_offline is perfered to do such thing.
> 
> No, my point is that
> 
> 1) If you use add_memory() and offline_and_remove_memory() in the *same
>    granularity* it has to be working, otherwise it has to be fixed.
> 
> 2) If you use add_memory() and offline_and_remove_memory() in different
>    granularity (especially, add_memory() in bigger granularity) , then
>    change your code to do add_memory() in the same granularity.
> 
> 
> If you run into 1), then we populated a PUD for boot memory that also covers yet unpopulated physical memory ranges that are later populated by add_memory(). If that's the case, then we can either fix it by
> 
> a) Not doing that. Use PMD tables instead for that piece of memory.
> 
> b) Detecting that that PUD still covers memory and refusing to remove
>    that PUD.
> 
> c) Rejecting to hotadd memory in this situation at that location. We
>    have mhp_get_pluggable_range() -> arch_get_mappable_range() to kind-
>    of handle something like that.

Thank you for your patient answer.

This I do understand and answer my question.

> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] arm64: mm: Fix kernel page tables incorrectly deleted during memory removal
  2023-07-26  7:50           ` David Hildenbrand
@ 2023-07-28  4:01               ` Anshuman Khandual
  2023-07-28  4:01               ` Anshuman Khandual
  1 sibling, 0 replies; 13+ messages in thread
From: Anshuman Khandual @ 2023-07-28  4:01 UTC (permalink / raw)
  To: David Hildenbrand, mawupeng, will
  Cc: catalin.marinas, akpm, sudaraja, linux-mm, linux-kernel,
	wangkefeng.wang, linux-arm-kernel, mark.rutland



On 7/26/23 13:20, David Hildenbrand wrote:
> On 26.07.23 08:20, mawupeng wrote:
>>
>>
>> On 2023/7/24 14:11, David Hildenbrand wrote:
>>> On 24.07.23 07:54, Anshuman Khandual wrote:
>>>>
>>>>
>>>> On 7/24/23 06:55, mawupeng wrote:
>>>>>
>>>>> On 2023/7/21 18:36, Will Deacon wrote:
>>>>>> On Mon, Jul 17, 2023 at 07:51:50PM +0800, Wupeng Ma wrote:
>>>>>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>>>>>
>>>>>>> During our test, we found that kernel page table may be unexpectedly
>>>>>>> cleared with rodata off. The root cause is that the kernel page is
>>>>>>> initialized with pud size(1G block mapping) while offline is memory
>>>>>>> block size(MIN_MEMORY_BLOCK_SIZE 128M), eg, if 2G memory is hot-added,
>>>>>>> when offline a memory block, the call trace is shown below,
>>>
>>> Is someone adding memory in 2 GiB granularity and then removing parts of it in 128 MiB granularity? That would be against what we support using the add_memory() / offline_and_remove_memory() API and that driver should be fixed instead.
>>
>> Yes, this kind of situation.
>>
>> The problem occurs in the following scenarios:
>> 1. use mem=xxG to reserve memory.
>> 2. add_momory to online memory.
>> 3. offline part of the memroy via offline_and_remove_memory.
>>
>> During my research, ACPI memory removal use memory_subsys_offline to offline memory section and
>> this will not delete page table entry which do not trigger this kind of problem.
>>
>> So I understand what you are talking about.
>> 1. 3rd-party driver shouldn't use add_memory/offline_and_remove_memory to online/offline memory.
>>     If it have to use, this can be achieved by driver.
>> 2. memory_subsys_offline is perfered to do such thing.
> 
> No, my point is that
> 
> 1) If you use add_memory() and offline_and_remove_memory() in the *same
>    granularity* it has to be working, otherwise it has to be fixed.
> 
> 2) If you use add_memory() and offline_and_remove_memory() in different
>    granularity (especially, add_memory() in bigger granularity) , then
>    change your code to do add_memory() in the same granularity.
> 
> 
> If you run into 1), then we populated a PUD for boot memory that also covers yet unpopulated physical memory ranges that are later populated by add_memory(). If that's the case, then we can either fix it by

Is that case possible ?

__create_pgd_mapping() is called to create the mapping both in hotplug
and boot memory cases. alloc_init_pud() ensures [1], that both virtual
and physical address ranges are PUD_MASK aligned, before creating a
huge or block page entry.

(addr | next | phys) & ~PUD_MASK) == 0 

> 
> a) Not doing that. Use PMD tables instead for that piece of memory.
> 
> b) Detecting that that PUD still covers memory and refusing to remove
>    that PUD.
> 
> c) Rejecting to hotadd memory in this situation at that location. We
>    have mhp_get_pluggable_range() -> arch_get_mappable_range() to kind-
>    of handle something like that.
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] arm64: mm: Fix kernel page tables incorrectly deleted during memory removal
@ 2023-07-28  4:01               ` Anshuman Khandual
  0 siblings, 0 replies; 13+ messages in thread
From: Anshuman Khandual @ 2023-07-28  4:01 UTC (permalink / raw)
  To: David Hildenbrand, mawupeng, will
  Cc: catalin.marinas, akpm, sudaraja, linux-mm, linux-kernel,
	wangkefeng.wang, linux-arm-kernel, mark.rutland



On 7/26/23 13:20, David Hildenbrand wrote:
> On 26.07.23 08:20, mawupeng wrote:
>>
>>
>> On 2023/7/24 14:11, David Hildenbrand wrote:
>>> On 24.07.23 07:54, Anshuman Khandual wrote:
>>>>
>>>>
>>>> On 7/24/23 06:55, mawupeng wrote:
>>>>>
>>>>> On 2023/7/21 18:36, Will Deacon wrote:
>>>>>> On Mon, Jul 17, 2023 at 07:51:50PM +0800, Wupeng Ma wrote:
>>>>>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>>>>>
>>>>>>> During our test, we found that kernel page table may be unexpectedly
>>>>>>> cleared with rodata off. The root cause is that the kernel page is
>>>>>>> initialized with pud size(1G block mapping) while offline is memory
>>>>>>> block size(MIN_MEMORY_BLOCK_SIZE 128M), eg, if 2G memory is hot-added,
>>>>>>> when offline a memory block, the call trace is shown below,
>>>
>>> Is someone adding memory in 2 GiB granularity and then removing parts of it in 128 MiB granularity? That would be against what we support using the add_memory() / offline_and_remove_memory() API and that driver should be fixed instead.
>>
>> Yes, this kind of situation.
>>
>> The problem occurs in the following scenarios:
>> 1. use mem=xxG to reserve memory.
>> 2. add_momory to online memory.
>> 3. offline part of the memroy via offline_and_remove_memory.
>>
>> During my research, ACPI memory removal use memory_subsys_offline to offline memory section and
>> this will not delete page table entry which do not trigger this kind of problem.
>>
>> So I understand what you are talking about.
>> 1. 3rd-party driver shouldn't use add_memory/offline_and_remove_memory to online/offline memory.
>>     If it have to use, this can be achieved by driver.
>> 2. memory_subsys_offline is perfered to do such thing.
> 
> No, my point is that
> 
> 1) If you use add_memory() and offline_and_remove_memory() in the *same
>    granularity* it has to be working, otherwise it has to be fixed.
> 
> 2) If you use add_memory() and offline_and_remove_memory() in different
>    granularity (especially, add_memory() in bigger granularity) , then
>    change your code to do add_memory() in the same granularity.
> 
> 
> If you run into 1), then we populated a PUD for boot memory that also covers yet unpopulated physical memory ranges that are later populated by add_memory(). If that's the case, then we can either fix it by

Is that case possible ?

__create_pgd_mapping() is called to create the mapping both in hotplug
and boot memory cases. alloc_init_pud() ensures [1], that both virtual
and physical address ranges are PUD_MASK aligned, before creating a
huge or block page entry.

(addr | next | phys) & ~PUD_MASK) == 0 

> 
> a) Not doing that. Use PMD tables instead for that piece of memory.
> 
> b) Detecting that that PUD still covers memory and refusing to remove
>    that PUD.
> 
> c) Rejecting to hotadd memory in this situation at that location. We
>    have mhp_get_pluggable_range() -> arch_get_mappable_range() to kind-
>    of handle something like that.
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-07-28  4:01 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-17 11:51 [RFC PATCH] arm64: mm: Fix kernel page tables incorrectly deleted during memory removal Wupeng Ma
2023-07-17 11:51 ` Wupeng Ma
2023-07-21 10:36 ` Will Deacon
2023-07-21 10:36   ` Will Deacon
2023-07-24  1:25   ` mawupeng
2023-07-24  5:54     ` Anshuman Khandual
2023-07-24  6:11       ` David Hildenbrand
2023-07-26  6:20         ` mawupeng
2023-07-26  7:50           ` David Hildenbrand
2023-07-28  1:06             ` mawupeng
2023-07-28  1:06               ` mawupeng
2023-07-28  4:01             ` Anshuman Khandual
2023-07-28  4:01               ` Anshuman Khandual

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.