linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte
@ 2022-10-25  1:42 Wupeng Ma
  2022-10-25  6:36 ` Muchun Song
  0 siblings, 1 reply; 14+ messages in thread
From: Wupeng Ma @ 2022-10-25  1:42 UTC (permalink / raw)
  To: akpm, mike.kravetz, songmuchun
  Cc: mhocko, osalvador, catalin.marinas, linux-mm, linux-kernel, mawupeng1

From: Ma Wupeng <mawupeng1@huawei.com>

Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
read-only to catch illegal write operation to the tail page.

However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
since this may lead to dirty state cleaned. This check is introduced by
commit 2f4b829c625e ("arm64: Add support for hardware updates of the
access and dirty pte bits") and the initial check is as follow:

  BUG_ON(pte_write(*ptep) && !pte_dirty(pte));

Since we do need to mark this pte as read-only to catch illegal write
operation to the tail pages, use set_pte  to replace set_pte_at to bypass
this check.

The following shell command can be used to reproduce this WARN_ON in
6.1.0-rc1:

  echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap
  cat /proc/sys/vm/hugetlb_optimize_vmemmap

  echo 1024 > /proc/sys/vm/nr_overcommit_hugepages
  mkdir -p /root/hugetlb
  mount none /root/hugetlb -t hugetlbfs
  fallocate -l 2g /root/hugetlb/xx &

Here is the detail WARN_ON log:

------------[ cut here ]------------
__check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83
WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120
Modules linked in:
CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224
Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
 vmemmap_remap_pte+0x118/0x120
 vmemmap_remap_range+0x30c/0x560
 hugetlb_vmemmap_optimize+0x158/0x408
 __prep_new_huge_page+0x24/0x150
 prep_new_huge_page+0x30/0x60
 alloc_fresh_huge_page+0x1c4/0x1e0
 alloc_surplus_huge_page+0xa0/0x168
 alloc_huge_page+0x264/0x5b8
 hugetlbfs_fallocate+0x294/0x680
 vfs_fallocate+0x12c/0x568
 ksys_fallocate+0x50/0xa0
 __arm64_sys_fallocate+0x28/0x38
 invoke_syscall+0x4c/0x110
 el0_svc_common.constprop.0+0x68/0x128
 do_el0_svc+0x34/0xd0
 el0_svc+0x48/0xb8
 el0t_64_sync_handler+0xb8/0xc0
 el0t_64_sync+0x18c/0x190
---[ end trace 0000000000000000 ]---

Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
---
 mm/hugetlb_vmemmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index ba2a2596fb4e..cb056265c31e 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
 	struct page *page = pte_page(*pte);
 
 	list_add_tail(&page->lru, walk->vmemmap_pages);
-	set_pte_at(&init_mm, addr, pte, entry);
+	set_pte(pte, entry);
 }
 
 /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte
  2022-10-25  1:42 [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte Wupeng Ma
@ 2022-10-25  6:36 ` Muchun Song
  2022-10-26  3:01   ` mawupeng
  2022-10-26  5:06   ` Anshuman Khandual
  0 siblings, 2 replies; 14+ messages in thread
From: Muchun Song @ 2022-10-25  6:36 UTC (permalink / raw)
  To: Wupeng Ma
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Michal Hocko,
	Oscar Salvador, catalin.marinas, Linux Memory Management List,
	linux-kernel



> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
> 
> From: Ma Wupeng <mawupeng1@huawei.com>
> 
> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
> read-only to catch illegal write operation to the tail page.
> 
> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()

Thanks for your finding this issue.

> since this may lead to dirty state cleaned. This check is introduced by
> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
> access and dirty pte bits") and the initial check is as follow:
> 
>  BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
> 
> Since we do need to mark this pte as read-only to catch illegal write
> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
> this check.

In theory, the waring does not affect anything since the tail vmemmap
pages are supposed to be read-only. So, skipping this check for vmemmap
pages seem feasible. But I am not sure whether it is general to use
set_pte here instead of set_pte_at, I didn’t see any users of set_pte
from the common code routines except the code from arch/xxx. And this
issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update()
itself.

Something like (Just some thoughts from mine):

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b5df82aa99e6..df7716965a93 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval);
  *   PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
  */

-static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
+static inline void __check_racy_pte_update(struct mm_struct *mm,
+                                          unsigned long addr, pte_t *ptep,
                                           pte_t pte)
 {
        pte_t old_pte;
@@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
        if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1)
                return;

+       if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) &&
+           addr >= VMEMMAP_START && addr <= VMEMMAP_END)
+               return;
+
        /*
         * Check for potential race with hardware updates of the pte
         * (ptep_set_access_flags safely changes valid ptes without going

> 
> The following shell command can be used to reproduce this WARN_ON in
> 6.1.0-rc1:
> 
>  echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap
>  cat /proc/sys/vm/hugetlb_optimize_vmemmap
> 
>  echo 1024 > /proc/sys/vm/nr_overcommit_hugepages
>  mkdir -p /root/hugetlb
>  mount none /root/hugetlb -t hugetlbfs
>  fallocate -l 2g /root/hugetlb/xx &
> 
> Here is the detail WARN_ON log:
> 
> ------------[ cut here ]------------
> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83
> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120
> Modules linked in:
> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224
> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> Call trace:
> vmemmap_remap_pte+0x118/0x120
> vmemmap_remap_range+0x30c/0x560
> hugetlb_vmemmap_optimize+0x158/0x408
> __prep_new_huge_page+0x24/0x150
> prep_new_huge_page+0x30/0x60
> alloc_fresh_huge_page+0x1c4/0x1e0
> alloc_surplus_huge_page+0xa0/0x168
> alloc_huge_page+0x264/0x5b8
> hugetlbfs_fallocate+0x294/0x680
> vfs_fallocate+0x12c/0x568
> ksys_fallocate+0x50/0xa0
> __arm64_sys_fallocate+0x28/0x38
> invoke_syscall+0x4c/0x110
> el0_svc_common.constprop.0+0x68/0x128
> do_el0_svc+0x34/0xd0
> el0_svc+0x48/0xb8
> el0t_64_sync_handler+0xb8/0xc0
> el0t_64_sync+0x18c/0x190
> ---[ end trace 0000000000000000 ]---
> 
> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")

Actually, this commit does not pose the issue for arm64. I think the correct commit
which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2.

Thanks.

> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
> ---
> mm/hugetlb_vmemmap.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index ba2a2596fb4e..cb056265c31e 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
> struct page *page = pte_page(*pte);
> 
> list_add_tail(&page->lru, walk->vmemmap_pages);
> - set_pte_at(&init_mm, addr, pte, entry);
> + set_pte(pte, entry);
> }
> 
> /*
> -- 
> 2.25.1
> 
> 


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte
  2022-10-25  6:36 ` Muchun Song
@ 2022-10-26  3:01   ` mawupeng
  2022-10-29  1:55     ` mawupeng
  2022-10-26  5:06   ` Anshuman Khandual
  1 sibling, 1 reply; 14+ messages in thread
From: mawupeng @ 2022-10-26  3:01 UTC (permalink / raw)
  To: muchun.song
  Cc: mawupeng1, akpm, mike.kravetz, songmuchun, mhocko, osalvador,
	catalin.marinas, linux-mm, linux-kernel



On 2022/10/25 14:36, Muchun Song wrote:
> 
> 
>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>
>> From: Ma Wupeng <mawupeng1@huawei.com>
>>
>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>> read-only to catch illegal write operation to the tail page.
>>
>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
> 
> Thanks for your finding this issue.
> 
>> since this may lead to dirty state cleaned. This check is introduced by
>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>> access and dirty pte bits") and the initial check is as follow:
>>
>>  BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>
>> Since we do need to mark this pte as read-only to catch illegal write
>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>> this check.
> 
> In theory, the waring does not affect anything since the tail vmemmap
> pages are supposed to be read-only. So, skipping this check for vmemmap
> pages seem feasible. But I am not sure whether it is general to use
> set_pte here instead of set_pte_at, I didn’t see any users of set_pte
> from the common code routines except the code from arch/xxx. And this
> issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update()
> itself.
> 
> Something like (Just some thoughts from mine):
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index b5df82aa99e6..df7716965a93 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval);
>   *   PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
>   */
> 
> -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
> +static inline void __check_racy_pte_update(struct mm_struct *mm,
> +                                          unsigned long addr, pte_t *ptep,
>                                            pte_t pte)
>  {
>         pte_t old_pte;
> @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
>         if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1)
>                 return;
> 
> +       if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) &&
> +           addr >= VMEMMAP_START && addr <= VMEMMAP_END)
> +               return;
> +
>         /*
>          * Check for potential race with hardware updates of the pte
>          * (ptep_set_access_flags safely changes valid ptes without going
> 
>>
>> The following shell command can be used to reproduce this WARN_ON in
>> 6.1.0-rc1:
>>
>>  echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap
>>  cat /proc/sys/vm/hugetlb_optimize_vmemmap
>>
>>  echo 1024 > /proc/sys/vm/nr_overcommit_hugepages
>>  mkdir -p /root/hugetlb
>>  mount none /root/hugetlb -t hugetlbfs
>>  fallocate -l 2g /root/hugetlb/xx &
>>
>> Here is the detail WARN_ON log:
>>
>> ------------[ cut here ]------------
>> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83
>> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120
>> Modules linked in:
>> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224
>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
>> Call trace:
>> vmemmap_remap_pte+0x118/0x120
>> vmemmap_remap_range+0x30c/0x560
>> hugetlb_vmemmap_optimize+0x158/0x408
>> __prep_new_huge_page+0x24/0x150
>> prep_new_huge_page+0x30/0x60
>> alloc_fresh_huge_page+0x1c4/0x1e0
>> alloc_surplus_huge_page+0xa0/0x168
>> alloc_huge_page+0x264/0x5b8
>> hugetlbfs_fallocate+0x294/0x680
>> vfs_fallocate+0x12c/0x568
>> ksys_fallocate+0x50/0xa0
>> __arm64_sys_fallocate+0x28/0x38
>> invoke_syscall+0x4c/0x110
>> el0_svc_common.constprop.0+0x68/0x128
>> do_el0_svc+0x34/0xd0
>> el0_svc+0x48/0xb8
>> el0t_64_sync_handler+0xb8/0xc0
>> el0t_64_sync+0x18c/0x190
>> ---[ end trace 0000000000000000 ]---
>>
>> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
> 
> Actually, this commit does not pose the issue for arm64. I think the correct commit
> which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2.
> 
> Thanks.

Thanks for pointing it out.

I have tested the commit, it sure is patch 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2.
This VM_WARN_ONCE will be produced if HUGETLB_PAGE_FREE_VMEMMAP for arm64 is enabled.

I will change my change log and fix this in next patch.

> 
>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
>> ---
>> mm/hugetlb_vmemmap.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index ba2a2596fb4e..cb056265c31e 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>> struct page *page = pte_page(*pte);
>>
>> list_add_tail(&page->lru, walk->vmemmap_pages);
>> - set_pte_at(&init_mm, addr, pte, entry);
>> + set_pte(pte, entry);
>> }
>>
>> /*
>> -- 
>> 2.25.1
>>
>>
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte
  2022-10-25  6:36 ` Muchun Song
  2022-10-26  3:01   ` mawupeng
@ 2022-10-26  5:06   ` Anshuman Khandual
  2022-10-26  7:01     ` Muchun Song
  2022-10-27  1:42     ` mawupeng
  1 sibling, 2 replies; 14+ messages in thread
From: Anshuman Khandual @ 2022-10-26  5:06 UTC (permalink / raw)
  To: Muchun Song, Wupeng Ma
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Michal Hocko,
	Oscar Salvador, catalin.marinas, Linux Memory Management List,
	linux-kernel



On 10/25/22 12:06, Muchun Song wrote:
> 
> 
>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>
>> From: Ma Wupeng <mawupeng1@huawei.com>
>>
>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>> read-only to catch illegal write operation to the tail page.
>>
>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
> 
> Thanks for your finding this issue.
> 
>> since this may lead to dirty state cleaned. This check is introduced by
>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>> access and dirty pte bits") and the initial check is as follow:
>>
>>  BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>
>> Since we do need to mark this pte as read-only to catch illegal write
>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>> this check.
> 
> In theory, the waring does not affect anything since the tail vmemmap
> pages are supposed to be read-only. So, skipping this check for vmemmap

Tails vmemmap pages are supposed to be read-only, in practice but their
backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
warning would not have triggered.

        VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
                     "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
                     __func__, pte_val(old_pte), pte_val(pte));

Also, is not it true that the pte being remapped into a different page
as read only, than what it had originally (which will be freed up) i.e 
the PFN in 'old_pte' and 'pte' will be different. Hence is there still
a possibility for a race condition even when the PFN changes ?

> pages seem feasible. But I am not sure whether it is general to use
> set_pte here instead of set_pte_at, I didn’t see any users of set_pte
> from the common code routines except the code from arch/xxx. And this
> issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update()
> itself.

Right, should not change it to yet lower level platform helper set_pte()
just to work around this warning. Instead, __check_racy_pte_update() is
the right place where it should be fixed.

> 
> Something like (Just some thoughts from mine):
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index b5df82aa99e6..df7716965a93 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval);
>   *   PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
>   */
> 
> -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
> +static inline void __check_racy_pte_update(struct mm_struct *mm,
> +                                          unsigned long addr, pte_t *ptep,
>                                            pte_t pte)
>  {
>         pte_t old_pte;
> @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
>         if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1)
>                 return;
> 
> +       if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) &&
> +           addr >= VMEMMAP_START && addr <= VMEMMAP_END)
> +               return;
> +
>         /*
>          * Check for potential race with hardware updates of the pte
>          * (ptep_set_access_flags safely changes valid ptes without going
> 
>>
>> The following shell command can be used to reproduce this WARN_ON in
>> 6.1.0-rc1:
>>
>>  echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap
>>  cat /proc/sys/vm/hugetlb_optimize_vmemmap
>>
>>  echo 1024 > /proc/sys/vm/nr_overcommit_hugepages
>>  mkdir -p /root/hugetlb
>>  mount none /root/hugetlb -t hugetlbfs
>>  fallocate -l 2g /root/hugetlb/xx &
>>
>> Here is the detail WARN_ON log:
>>
>> ------------[ cut here ]------------
>> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83
>> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120
>> Modules linked in:
>> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224
>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
>> Call trace:
>> vmemmap_remap_pte+0x118/0x120
>> vmemmap_remap_range+0x30c/0x560
>> hugetlb_vmemmap_optimize+0x158/0x408
>> __prep_new_huge_page+0x24/0x150
>> prep_new_huge_page+0x30/0x60
>> alloc_fresh_huge_page+0x1c4/0x1e0
>> alloc_surplus_huge_page+0xa0/0x168
>> alloc_huge_page+0x264/0x5b8
>> hugetlbfs_fallocate+0x294/0x680
>> vfs_fallocate+0x12c/0x568
>> ksys_fallocate+0x50/0xa0
>> __arm64_sys_fallocate+0x28/0x38
>> invoke_syscall+0x4c/0x110
>> el0_svc_common.constprop.0+0x68/0x128
>> do_el0_svc+0x34/0xd0
>> el0_svc+0x48/0xb8
>> el0t_64_sync_handler+0xb8/0xc0
>> el0t_64_sync+0x18c/0x190
>> ---[ end trace 0000000000000000 ]---
>>
>> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
> 
> Actually, this commit does not pose the issue for arm64. I think the correct commit
> which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2.
> 
> Thanks.
> 
>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
>> ---
>> mm/hugetlb_vmemmap.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index ba2a2596fb4e..cb056265c31e 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>> struct page *page = pte_page(*pte);
>>
>> list_add_tail(&page->lru, walk->vmemmap_pages);
>> - set_pte_at(&init_mm, addr, pte, entry);
>> + set_pte(pte, entry);
>> }
>>
>> /*
>> -- 
>> 2.25.1
>>
>>
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte
  2022-10-26  5:06   ` Anshuman Khandual
@ 2022-10-26  7:01     ` Muchun Song
  2022-10-26  8:36       ` Anshuman Khandual
  2022-10-27  1:42     ` mawupeng
  1 sibling, 1 reply; 14+ messages in thread
From: Muchun Song @ 2022-10-26  7:01 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Wupeng Ma, Andrew Morton, Mike Kravetz, Muchun Song,
	Michal Hocko, Oscar Salvador, catalin.marinas,
	Linux Memory Management List, linux-kernel



> On Oct 26, 2022, at 13:06, Anshuman Khandual <anshuman.khandual@arm.com> wrote:
> 
> 
> 
> On 10/25/22 12:06, Muchun Song wrote:
>> 
>> 
>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>> 
>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>> 
>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>>> read-only to catch illegal write operation to the tail page.
>>> 
>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
>> 
>> Thanks for your finding this issue.
>> 
>>> since this may lead to dirty state cleaned. This check is introduced by
>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>>> access and dirty pte bits") and the initial check is as follow:
>>> 
>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>> 
>>> Since we do need to mark this pte as read-only to catch illegal write
>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>>> this check.
>> 
>> In theory, the waring does not affect anything since the tail vmemmap
>> pages are supposed to be read-only. So, skipping this check for vmemmap
> 
> Tails vmemmap pages are supposed to be read-only, in practice but their
> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
> warning would not have triggered.

Right.

> 
>        VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
>                     "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
>                     __func__, pte_val(old_pte), pte_val(pte));
> 
> Also, is not it true that the pte being remapped into a different page
> as read only, than what it had originally (which will be freed up) i.e 
> the PFN in 'old_pte' and 'pte' will be different. Hence is there still

Right.

> a possibility for a race condition even when the PFN changes ?

Sorry, I didn't get this question. Did you mean the PTE is changed from
new (pte) to the old one (old_pte) by the hardware because of the update
of dirty bit when a concurrent write operation to the tail vmemmap page?

Muchun,
Thanks.

> 
>> pages seem feasible. But I am not sure whether it is general to use
>> set_pte here instead of set_pte_at, I didn’t see any users of set_pte
>> from the common code routines except the code from arch/xxx. And this
>> issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update()
>> itself.
> 
> Right, should not change it to yet lower level platform helper set_pte()
> just to work around this warning. Instead, __check_racy_pte_update() is
> the right place where it should be fixed.
> 
>> 
>> Something like (Just some thoughts from mine):
>> 
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index b5df82aa99e6..df7716965a93 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval);
>>  *   PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
>>  */
>> 
>> -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
>> +static inline void __check_racy_pte_update(struct mm_struct *mm,
>> +                                          unsigned long addr, pte_t *ptep,
>>                                           pte_t pte)
>> {
>>        pte_t old_pte;
>> @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
>>        if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1)
>>                return;
>> 
>> +       if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) &&
>> +           addr >= VMEMMAP_START && addr <= VMEMMAP_END)
>> +               return;
>> +
>>        /*
>>         * Check for potential race with hardware updates of the pte
>>         * (ptep_set_access_flags safely changes valid ptes without going
>> 
>>> 
>>> The following shell command can be used to reproduce this WARN_ON in
>>> 6.1.0-rc1:
>>> 
>>> echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap
>>> cat /proc/sys/vm/hugetlb_optimize_vmemmap
>>> 
>>> echo 1024 > /proc/sys/vm/nr_overcommit_hugepages
>>> mkdir -p /root/hugetlb
>>> mount none /root/hugetlb -t hugetlbfs
>>> fallocate -l 2g /root/hugetlb/xx &
>>> 
>>> Here is the detail WARN_ON log:
>>> 
>>> ------------[ cut here ]------------
>>> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83
>>> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120
>>> Modules linked in:
>>> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224
>>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
>>> Call trace:
>>> vmemmap_remap_pte+0x118/0x120
>>> vmemmap_remap_range+0x30c/0x560
>>> hugetlb_vmemmap_optimize+0x158/0x408
>>> __prep_new_huge_page+0x24/0x150
>>> prep_new_huge_page+0x30/0x60
>>> alloc_fresh_huge_page+0x1c4/0x1e0
>>> alloc_surplus_huge_page+0xa0/0x168
>>> alloc_huge_page+0x264/0x5b8
>>> hugetlbfs_fallocate+0x294/0x680
>>> vfs_fallocate+0x12c/0x568
>>> ksys_fallocate+0x50/0xa0
>>> __arm64_sys_fallocate+0x28/0x38
>>> invoke_syscall+0x4c/0x110
>>> el0_svc_common.constprop.0+0x68/0x128
>>> do_el0_svc+0x34/0xd0
>>> el0_svc+0x48/0xb8
>>> el0t_64_sync_handler+0xb8/0xc0
>>> el0t_64_sync+0x18c/0x190
>>> ---[ end trace 0000000000000000 ]---
>>> 
>>> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
>> 
>> Actually, this commit does not pose the issue for arm64. I think the correct commit
>> which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2.
>> 
>> Thanks.
>> 
>>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
>>> ---
>>> mm/hugetlb_vmemmap.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>> 
>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>> index ba2a2596fb4e..cb056265c31e 100644
>>> --- a/mm/hugetlb_vmemmap.c
>>> +++ b/mm/hugetlb_vmemmap.c
>>> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>>> struct page *page = pte_page(*pte);
>>> 
>>> list_add_tail(&page->lru, walk->vmemmap_pages);
>>> - set_pte_at(&init_mm, addr, pte, entry);
>>> + set_pte(pte, entry);
>>> }
>>> 
>>> /*
>>> -- 
>>> 2.25.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte
  2022-10-26  7:01     ` Muchun Song
@ 2022-10-26  8:36       ` Anshuman Khandual
  2022-10-26  8:53         ` Muchun Song
  2022-10-27 10:50         ` Catalin Marinas
  0 siblings, 2 replies; 14+ messages in thread
From: Anshuman Khandual @ 2022-10-26  8:36 UTC (permalink / raw)
  To: Muchun Song
  Cc: Wupeng Ma, Andrew Morton, Mike Kravetz, Muchun Song,
	Michal Hocko, Oscar Salvador, catalin.marinas,
	Linux Memory Management List, linux-kernel



On 10/26/22 12:31, Muchun Song wrote:
> 
> 
>> On Oct 26, 2022, at 13:06, Anshuman Khandual <anshuman.khandual@arm.com> wrote:
>>
>>
>>
>> On 10/25/22 12:06, Muchun Song wrote:
>>>
>>>
>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>>>
>>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>>
>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>>>> read-only to catch illegal write operation to the tail page.
>>>>
>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
>>>
>>> Thanks for your finding this issue.
>>>
>>>> since this may lead to dirty state cleaned. This check is introduced by
>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>>>> access and dirty pte bits") and the initial check is as follow:
>>>>
>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>>>
>>>> Since we do need to mark this pte as read-only to catch illegal write
>>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>>>> this check.
>>>
>>> In theory, the waring does not affect anything since the tail vmemmap
>>> pages are supposed to be read-only. So, skipping this check for vmemmap
>>
>> Tails vmemmap pages are supposed to be read-only, in practice but their
>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
>> warning would not have triggered.
> 
> Right.
> 
>>
>>        VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
>>                     "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
>>                     __func__, pte_val(old_pte), pte_val(pte));
>>
>> Also, is not it true that the pte being remapped into a different page
>> as read only, than what it had originally (which will be freed up) i.e 
>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still
> 
> Right.
> 
>> a possibility for a race condition even when the PFN changes ?
> 
> Sorry, I didn't get this question. Did you mean the PTE is changed from
> new (pte) to the old one (old_pte) by the hardware because of the update
> of dirty bit when a concurrent write operation to the tail vmemmap page?

No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining
tails pages ? Is not there a PFN change, along with access permission change
involved in this remapping process ?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte
  2022-10-26  8:36       ` Anshuman Khandual
@ 2022-10-26  8:53         ` Muchun Song
  2022-10-27 10:50         ` Catalin Marinas
  1 sibling, 0 replies; 14+ messages in thread
From: Muchun Song @ 2022-10-26  8:53 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Wupeng Ma, Andrew Morton, Mike Kravetz, Muchun Song,
	Michal Hocko, Oscar Salvador, catalin.marinas,
	Linux Memory Management List, linux-kernel



> On Oct 26, 2022, at 16:36, Anshuman Khandual <anshuman.khandual@arm.com> wrote:
> 
> 
> 
> On 10/26/22 12:31, Muchun Song wrote:
>> 
>> 
>>> On Oct 26, 2022, at 13:06, Anshuman Khandual <anshuman.khandual@arm.com> wrote:
>>> 
>>> 
>>> 
>>> On 10/25/22 12:06, Muchun Song wrote:
>>>> 
>>>> 
>>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>>>> 
>>>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>>> 
>>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>>>>> read-only to catch illegal write operation to the tail page.
>>>>> 
>>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
>>>> 
>>>> Thanks for your finding this issue.
>>>> 
>>>>> since this may lead to dirty state cleaned. This check is introduced by
>>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>>>>> access and dirty pte bits") and the initial check is as follow:
>>>>> 
>>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>>>> 
>>>>> Since we do need to mark this pte as read-only to catch illegal write
>>>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>>>>> this check.
>>>> 
>>>> In theory, the waring does not affect anything since the tail vmemmap
>>>> pages are supposed to be read-only. So, skipping this check for vmemmap
>>> 
>>> Tails vmemmap pages are supposed to be read-only, in practice but their
>>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
>>> warning would not have triggered.
>> 
>> Right.
>> 
>>> 
>>>       VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
>>>                    "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
>>>                    __func__, pte_val(old_pte), pte_val(pte));
>>> 
>>> Also, is not it true that the pte being remapped into a different page
>>> as read only, than what it had originally (which will be freed up) i.e 
>>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still
>> 
>> Right.
>> 
>>> a possibility for a race condition even when the PFN changes ?
>> 
>> Sorry, I didn't get this question. Did you mean the PTE is changed from
>> new (pte) to the old one (old_pte) by the hardware because of the update
>> of dirty bit when a concurrent write operation to the tail vmemmap page?
> 
> No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining
> tails pages ? Is not there a PFN change, along with access permission change
> involved in this remapping process ?

Alright, yes, both the PFN and the access permission are changed.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte
  2022-10-26  5:06   ` Anshuman Khandual
  2022-10-26  7:01     ` Muchun Song
@ 2022-10-27  1:42     ` mawupeng
  1 sibling, 0 replies; 14+ messages in thread
From: mawupeng @ 2022-10-27  1:42 UTC (permalink / raw)
  To: anshuman.khandual, muchun.song, catalin.marinas
  Cc: mawupeng1, akpm, mike.kravetz, songmuchun, mhocko, osalvador,
	linux-mm, linux-kernel



On 2022/10/26 13:06, Anshuman Khandual wrote:
> 
> 
> On 10/25/22 12:06, Muchun Song wrote:
>>
>>
>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>>
>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>
>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>>> read-only to catch illegal write operation to the tail page.
>>>
>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
>>
>> Thanks for your finding this issue.
>>
>>> since this may lead to dirty state cleaned. This check is introduced by
>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>>> access and dirty pte bits") and the initial check is as follow:
>>>
>>>  BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>>
>>> Since we do need to mark this pte as read-only to catch illegal write
>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>>> this check.
>>
>> In theory, the waring does not affect anything since the tail vmemmap
>> pages are supposed to be read-only. So, skipping this check for vmemmap
> 
> Tails vmemmap pages are supposed to be read-only, in practice but their
> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
> warning would not have triggered.
> 
>         VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
>                      "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
>                      __func__, pte_val(old_pte), pte_val(pte));

So, arm64 will trigger this warn_on in this condition and this condition is not
unusual such as this scenario?

It is true that we should change the login in arm64. but how to change it to
make the code more common?

Any thoughts?

Thanks for reviewing.

> 
> Also, is not it true that the pte being remapped into a different page
> as read only, than what it had originally (which will be freed up) i.e 
> the PFN in 'old_pte' and 'pte' will be different. Hence is there still
> a possibility for a race condition even when the PFN changes ?
> 
>> pages seem feasible. But I am not sure whether it is general to use
>> set_pte here instead of set_pte_at, I didn’t see any users of set_pte
>> from the common code routines except the code from arch/xxx. And this
>> issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update()
>> itself.
> 
> Right, should not change it to yet lower level platform helper set_pte()
> just to work around this warning. Instead, __check_racy_pte_update() is
> the right place where it should be fixed.
> 
>>
>> Something like (Just some thoughts from mine):
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index b5df82aa99e6..df7716965a93 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval);
>>   *   PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
>>   */
>>
>> -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
>> +static inline void __check_racy_pte_update(struct mm_struct *mm,
>> +                                          unsigned long addr, pte_t *ptep,
>>                                            pte_t pte)
>>  {
>>         pte_t old_pte;
>> @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
>>         if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1)
>>                 return;
>>
>> +       if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) &&
>> +           addr >= VMEMMAP_START && addr <= VMEMMAP_END)
>> +               return;
>> +
>>         /*
>>          * Check for potential race with hardware updates of the pte
>>          * (ptep_set_access_flags safely changes valid ptes without going
>>
>>>
>>> The following shell command can be used to reproduce this WARN_ON in
>>> 6.1.0-rc1:
>>>
>>>  echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap
>>>  cat /proc/sys/vm/hugetlb_optimize_vmemmap
>>>
>>>  echo 1024 > /proc/sys/vm/nr_overcommit_hugepages
>>>  mkdir -p /root/hugetlb
>>>  mount none /root/hugetlb -t hugetlbfs
>>>  fallocate -l 2g /root/hugetlb/xx &
>>>
>>> Here is the detail WARN_ON log:
>>>
>>> ------------[ cut here ]------------
>>> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83
>>> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120
>>> Modules linked in:
>>> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224
>>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
>>> Call trace:
>>> vmemmap_remap_pte+0x118/0x120
>>> vmemmap_remap_range+0x30c/0x560
>>> hugetlb_vmemmap_optimize+0x158/0x408
>>> __prep_new_huge_page+0x24/0x150
>>> prep_new_huge_page+0x30/0x60
>>> alloc_fresh_huge_page+0x1c4/0x1e0
>>> alloc_surplus_huge_page+0xa0/0x168
>>> alloc_huge_page+0x264/0x5b8
>>> hugetlbfs_fallocate+0x294/0x680
>>> vfs_fallocate+0x12c/0x568
>>> ksys_fallocate+0x50/0xa0
>>> __arm64_sys_fallocate+0x28/0x38
>>> invoke_syscall+0x4c/0x110
>>> el0_svc_common.constprop.0+0x68/0x128
>>> do_el0_svc+0x34/0xd0
>>> el0_svc+0x48/0xb8
>>> el0t_64_sync_handler+0xb8/0xc0
>>> el0t_64_sync+0x18c/0x190
>>> ---[ end trace 0000000000000000 ]---
>>>
>>> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
>>
>> Actually, this commit does not pose the issue for arm64. I think the correct commit
>> which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2.
>>
>> Thanks.
>>
>>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
>>> ---
>>> mm/hugetlb_vmemmap.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>> index ba2a2596fb4e..cb056265c31e 100644
>>> --- a/mm/hugetlb_vmemmap.c
>>> +++ b/mm/hugetlb_vmemmap.c
>>> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>>> struct page *page = pte_page(*pte);
>>>
>>> list_add_tail(&page->lru, walk->vmemmap_pages);
>>> - set_pte_at(&init_mm, addr, pte, entry);
>>> + set_pte(pte, entry);
>>> }
>>>
>>> /*
>>> -- 
>>> 2.25.1
>>>
>>>
>>
>>
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte
  2022-10-26  8:36       ` Anshuman Khandual
  2022-10-26  8:53         ` Muchun Song
@ 2022-10-27 10:50         ` Catalin Marinas
  2022-10-28  2:45           ` Muchun Song
  1 sibling, 1 reply; 14+ messages in thread
From: Catalin Marinas @ 2022-10-27 10:50 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Muchun Song, Wupeng Ma, Andrew Morton, Mike Kravetz, Muchun Song,
	Michal Hocko, Oscar Salvador, Linux Memory Management List,
	linux-kernel

On Wed, Oct 26, 2022 at 02:06:00PM +0530, Anshuman Khandual wrote:
> On 10/26/22 12:31, Muchun Song wrote:
> >> On 10/25/22 12:06, Muchun Song wrote:
> >>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
> >>>> From: Ma Wupeng <mawupeng1@huawei.com>
> >>>>
> >>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
> >>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
> >>>> read-only to catch illegal write operation to the tail page.
> >>>>
> >>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
> >>>
> >>> Thanks for your finding this issue.
> >>>
> >>>> since this may lead to dirty state cleaned. This check is introduced by
> >>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
> >>>> access and dirty pte bits") and the initial check is as follow:
> >>>>
> >>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
> >>>>
> >>>> Since we do need to mark this pte as read-only to catch illegal write
> >>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
> >>>> this check.
> >>>
> >>> In theory, the waring does not affect anything since the tail vmemmap
> >>> pages are supposed to be read-only. So, skipping this check for vmemmap
> >>
> >> Tails vmemmap pages are supposed to be read-only, in practice but their
> >> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
> >> warning would not have triggered.
> > 
> > Right.
> > 
> >>
> >>        VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
> >>                     "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
> >>                     __func__, pte_val(old_pte), pte_val(pte));
> >>
> >> Also, is not it true that the pte being remapped into a different page
> >> as read only, than what it had originally (which will be freed up) i.e 
> >> the PFN in 'old_pte' and 'pte' will be different. Hence is there still
> > 
> > Right.
> > 
> >> a possibility for a race condition even when the PFN changes ?
> > 
> > Sorry, I didn't get this question. Did you mean the PTE is changed from
> > new (pte) to the old one (old_pte) by the hardware because of the update
> > of dirty bit when a concurrent write operation to the tail vmemmap page?
> 
> No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining
> tails pages ? Is not there a PFN change, along with access permission change
> involved in this remapping process ?

For the record, as we discussed offline, changing the output address
(pfn) of a pte is not safe without break-before-make if at least one of
the mappings was writeable. The caller (vmemmap_remap_pte()) would need
to be fixed to first invalidate the pte and then write the new pte. I
assume no other CPU accesses this part of the vmemmap while the pte is
being remapped.

-- 
Catalin

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte
  2022-10-27 10:50         ` Catalin Marinas
@ 2022-10-28  2:45           ` Muchun Song
  2022-10-28 15:53             ` Catalin Marinas
  0 siblings, 1 reply; 14+ messages in thread
From: Muchun Song @ 2022-10-28  2:45 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Anshuman Khandual, Wupeng Ma, Andrew Morton, Mike Kravetz,
	Muchun Song, Michal Hocko, Oscar Salvador,
	Linux Memory Management List, linux-kernel



> On Oct 27, 2022, at 18:50, Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> On Wed, Oct 26, 2022 at 02:06:00PM +0530, Anshuman Khandual wrote:
>> On 10/26/22 12:31, Muchun Song wrote:
>>>> On 10/25/22 12:06, Muchun Song wrote:
>>>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>>>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>>>> 
>>>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>>>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>>>>>> read-only to catch illegal write operation to the tail page.
>>>>>> 
>>>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
>>>>> 
>>>>> Thanks for your finding this issue.
>>>>> 
>>>>>> since this may lead to dirty state cleaned. This check is introduced by
>>>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>>>>>> access and dirty pte bits") and the initial check is as follow:
>>>>>> 
>>>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>>>>> 
>>>>>> Since we do need to mark this pte as read-only to catch illegal write
>>>>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>>>>>> this check.
>>>>> 
>>>>> In theory, the waring does not affect anything since the tail vmemmap
>>>>> pages are supposed to be read-only. So, skipping this check for vmemmap
>>>> 
>>>> Tails vmemmap pages are supposed to be read-only, in practice but their
>>>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
>>>> warning would not have triggered.
>>> 
>>> Right.
>>> 
>>>> 
>>>>       VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
>>>>                    "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
>>>>                    __func__, pte_val(old_pte), pte_val(pte));
>>>> 
>>>> Also, is not it true that the pte being remapped into a different page
>>>> as read only, than what it had originally (which will be freed up) i.e 
>>>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still
>>> 
>>> Right.
>>> 
>>>> a possibility for a race condition even when the PFN changes ?
>>> 
>>> Sorry, I didn't get this question. Did you mean the PTE is changed from
>>> new (pte) to the old one (old_pte) by the hardware because of the update
>>> of dirty bit when a concurrent write operation to the tail vmemmap page?
>> 
>> No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining
>> tails pages ? Is not there a PFN change, along with access permission change
>> involved in this remapping process ?
> 
> For the record, as we discussed offline, changing the output address
> (pfn) of a pte is not safe without break-before-make if at least one of
> the mappings was writeable. The caller (vmemmap_remap_pte()) would need
> to be fixed to first invalidate the pte and then write the new pte. I

Hi Catalin,

Could you expose more details about what issue it will be caused? I am
not familiar with arm64.

> assume no other CPU accesses this part of the vmemmap while the pte is
> being remapped.

However, there is no guarantee that no other CPU accesses this pte.
E.g. memory failure or memory compaction, both can obtain head page
from any tail struct pages (only read) anytime.

Thanks.

> 
> -- 
> Catalin



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte
  2022-10-28  2:45           ` Muchun Song
@ 2022-10-28 15:53             ` Catalin Marinas
  2022-11-01  9:29               ` Muchun Song
  0 siblings, 1 reply; 14+ messages in thread
From: Catalin Marinas @ 2022-10-28 15:53 UTC (permalink / raw)
  To: Muchun Song
  Cc: Anshuman Khandual, Wupeng Ma, Andrew Morton, Mike Kravetz,
	Muchun Song, Michal Hocko, Oscar Salvador,
	Linux Memory Management List, linux-kernel

On Fri, Oct 28, 2022 at 10:45:09AM +0800, Muchun Song wrote:
> On Oct 27, 2022, at 18:50, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Wed, Oct 26, 2022 at 02:06:00PM +0530, Anshuman Khandual wrote:
> >> On 10/26/22 12:31, Muchun Song wrote:
> >>>> On 10/25/22 12:06, Muchun Song wrote:
> >>>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
> >>>>>> From: Ma Wupeng <mawupeng1@huawei.com>
> >>>>>> 
> >>>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
> >>>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
> >>>>>> read-only to catch illegal write operation to the tail page.
> >>>>>> 
> >>>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
> >>>>> 
> >>>>> Thanks for your finding this issue.
> >>>>> 
> >>>>>> since this may lead to dirty state cleaned. This check is introduced by
> >>>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
> >>>>>> access and dirty pte bits") and the initial check is as follow:
> >>>>>> 
> >>>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
> >>>>>> 
> >>>>>> Since we do need to mark this pte as read-only to catch illegal write
> >>>>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
> >>>>>> this check.
> >>>>> 
> >>>>> In theory, the waring does not affect anything since the tail vmemmap
> >>>>> pages are supposed to be read-only. So, skipping this check for vmemmap
> >>>> 
> >>>> Tails vmemmap pages are supposed to be read-only, in practice but their
> >>>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
> >>>> warning would not have triggered.
> >>> 
> >>> Right.
> >>> 
> >>>> 
> >>>>       VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
> >>>>                    "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
> >>>>                    __func__, pte_val(old_pte), pte_val(pte));
> >>>> 
> >>>> Also, is not it true that the pte being remapped into a different page
> >>>> as read only, than what it had originally (which will be freed up) i.e 
> >>>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still
> >>> 
> >>> Right.
> >>> 
> >>>> a possibility for a race condition even when the PFN changes ?
> >>> 
> >>> Sorry, I didn't get this question. Did you mean the PTE is changed from
> >>> new (pte) to the old one (old_pte) by the hardware because of the update
> >>> of dirty bit when a concurrent write operation to the tail vmemmap page?
> >> 
> >> No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining
> >> tails pages ? Is not there a PFN change, along with access permission change
> >> involved in this remapping process ?
> > 
> > For the record, as we discussed offline, changing the output address
> > (pfn) of a pte is not safe without break-before-make if at least one of
> > the mappings was writeable. The caller (vmemmap_remap_pte()) would need
> > to be fixed to first invalidate the pte and then write the new pte. I
> 
> Could you expose more details about what issue it will be caused? I am
> not familiar with arm64.

Well, it's not allowed by the architecture, so some CPU implementations
may do weird things like accessing incorrect memory or triggering TLB
conflict aborts if, for some reason, they end up with two entries in
the TLB for the same VA but pointing to different pfns. The hardware
expects an invalid PTE and TLB invalidation between such changes. In
practice most likely nothing happens and this works fine but we need to
stick to the architecture requirements in case some CPUs take advantage
of this requirement.

> > assume no other CPU accesses this part of the vmemmap while the pte is
> > being remapped.
> 
> However, there is no guarantee that no other CPU accesses this pte.
> E.g. memory failure or memory compaction, both can obtain head page
> from any tail struct pages (only read) anytime.

Oh, so we cannot safely go through a break-before-make sequence here
(zero the PTE, flush the TLB, write the new PTE) as some CPU may access
this pte.

-- 
Catalin

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte
  2022-10-26  3:01   ` mawupeng
@ 2022-10-29  1:55     ` mawupeng
  2022-11-01  9:56       ` Catalin Marinas
  0 siblings, 1 reply; 14+ messages in thread
From: mawupeng @ 2022-10-29  1:55 UTC (permalink / raw)
  To: muchun.song, catalin.marinas, songmuchun, anshuman.khandual
  Cc: mawupeng1, akpm, mike.kravetz, mhocko, osalvador, linux-mm, linux-kernel



On 2022/10/26 11:01, mawupeng wrote:
> 
> 
> On 2022/10/25 14:36, Muchun Song wrote:
>>
>>
>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>>
>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>
>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>>> read-only to catch illegal write operation to the tail page.
>>>
>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
>>
>> Thanks for your finding this issue.
>>
>>> since this may lead to dirty state cleaned. This check is introduced by
>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>>> access and dirty pte bits") and the initial check is as follow:
>>>
>>>  BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>>
>>> Since we do need to mark this pte as read-only to catch illegal write
>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>>> this check.
>>
>> In theory, the waring does not affect anything since the tail vmemmap
>> pages are supposed to be read-only. So, skipping this check for vmemmap
>> pages seem feasible. But I am not sure whether it is general to use
>> set_pte here instead of set_pte_at, I didn’t see any users of set_pte
>> from the common code routines except the code from arch/xxx. And this
>> issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update()
>> itself.
>>
>> Something like (Just some thoughts from mine):
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index b5df82aa99e6..df7716965a93 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval);
>>   *   PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
>>   */
>>
>> -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
>> +static inline void __check_racy_pte_update(struct mm_struct *mm,
>> +                                          unsigned long addr, pte_t *ptep,
>>                                            pte_t pte)
>>  {
>>         pte_t old_pte;
>> @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep,
>>         if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1)
>>                 return;
>>
>> +       if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) &&
>> +           addr >= VMEMMAP_START && addr <= VMEMMAP_END)
>> +               return;
>> +
>>         /*
>>          * Check for potential race with hardware updates of the pte
>>          * (ptep_set_access_flags safely changes valid ptes without going
>>

IMHO, arm64 or other archs do some work on the dirty bit and rdonly bit in
pte in commit 2f4b829c625e ("arm64: Add support for hardware updates of the
access and dirty pte bits").

Maybe we can use pte_wrprotect() to mark this pte read-only? It will add 
PTE_DIRTY bit for the new pte entry compare to the old one.

Here is the diff:

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index ba2a2596fb4e..24a230895316 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -244,8 +244,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
         * Remap the tail pages as read-only to catch illegal write operation
         * to the tail pages.
         */
-       pgprot_t pgprot = PAGE_KERNEL_RO;
-       pte_t entry = mk_pte(walk->reuse_page, pgprot);
+       pte_t entry = pte_wrprotect(mk_pte(walk->reuse_page, PAGE_KERNEL));
        struct page *page = pte_page(*pte);
 
        list_add_tail(&page->lru, walk->vmemmap_pages);

>>>
>>> The following shell command can be used to reproduce this WARN_ON in
>>> 6.1.0-rc1:
>>>
>>>  echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap
>>>  cat /proc/sys/vm/hugetlb_optimize_vmemmap
>>>
>>>  echo 1024 > /proc/sys/vm/nr_overcommit_hugepages
>>>  mkdir -p /root/hugetlb
>>>  mount none /root/hugetlb -t hugetlbfs
>>>  fallocate -l 2g /root/hugetlb/xx &
>>>
>>> Here is the detail WARN_ON log:
>>>
>>> ------------[ cut here ]------------
>>> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83
>>> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120
>>> Modules linked in:
>>> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224
>>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
>>> Call trace:
>>> vmemmap_remap_pte+0x118/0x120
>>> vmemmap_remap_range+0x30c/0x560
>>> hugetlb_vmemmap_optimize+0x158/0x408
>>> __prep_new_huge_page+0x24/0x150
>>> prep_new_huge_page+0x30/0x60
>>> alloc_fresh_huge_page+0x1c4/0x1e0
>>> alloc_surplus_huge_page+0xa0/0x168
>>> alloc_huge_page+0x264/0x5b8
>>> hugetlbfs_fallocate+0x294/0x680
>>> vfs_fallocate+0x12c/0x568
>>> ksys_fallocate+0x50/0xa0
>>> __arm64_sys_fallocate+0x28/0x38
>>> invoke_syscall+0x4c/0x110
>>> el0_svc_common.constprop.0+0x68/0x128
>>> do_el0_svc+0x34/0xd0
>>> el0_svc+0x48/0xb8
>>> el0t_64_sync_handler+0xb8/0xc0
>>> el0t_64_sync+0x18c/0x190
>>> ---[ end trace 0000000000000000 ]---
>>>
>>> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
>>
>> Actually, this commit does not pose the issue for arm64. I think the correct commit
>> which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2.
>>
>> Thanks.
> 
> Thanks for pointing it out.
> 
> I have tested the commit, it sure is patch 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2.
> This VM_WARN_ONCE will be produced if HUGETLB_PAGE_FREE_VMEMMAP for arm64 is enabled.
> 
> I will change my change log and fix this in next patch.
> 
>>
>>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
>>> ---
>>> mm/hugetlb_vmemmap.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>> index ba2a2596fb4e..cb056265c31e 100644
>>> --- a/mm/hugetlb_vmemmap.c
>>> +++ b/mm/hugetlb_vmemmap.c
>>> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>>> struct page *page = pte_page(*pte);
>>>
>>> list_add_tail(&page->lru, walk->vmemmap_pages);
>>> - set_pte_at(&init_mm, addr, pte, entry);
>>> + set_pte(pte, entry);
>>> }
>>>
>>> /*
>>> -- 
>>> 2.25.1
>>>
>>>
>>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte
  2022-10-28 15:53             ` Catalin Marinas
@ 2022-11-01  9:29               ` Muchun Song
  0 siblings, 0 replies; 14+ messages in thread
From: Muchun Song @ 2022-11-01  9:29 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Anshuman Khandual, Wupeng Ma, Andrew Morton, Mike Kravetz,
	Muchun Song, Michal Hocko, Oscar Salvador,
	Linux Memory Management List, linux-kernel



> On Oct 28, 2022, at 23:53, Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> On Fri, Oct 28, 2022 at 10:45:09AM +0800, Muchun Song wrote:
>> On Oct 27, 2022, at 18:50, Catalin Marinas <catalin.marinas@arm.com> wrote:
>>> On Wed, Oct 26, 2022 at 02:06:00PM +0530, Anshuman Khandual wrote:
>>>> On 10/26/22 12:31, Muchun Song wrote:
>>>>>> On 10/25/22 12:06, Muchun Song wrote:
>>>>>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
>>>>>>>> From: Ma Wupeng <mawupeng1@huawei.com>
>>>>>>>> 
>>>>>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
>>>>>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
>>>>>>>> read-only to catch illegal write operation to the tail page.
>>>>>>>> 
>>>>>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
>>>>>>> 
>>>>>>> Thanks for your finding this issue.
>>>>>>> 
>>>>>>>> since this may lead to dirty state cleaned. This check is introduced by
>>>>>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
>>>>>>>> access and dirty pte bits") and the initial check is as follow:
>>>>>>>> 
>>>>>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
>>>>>>>> 
>>>>>>>> Since we do need to mark this pte as read-only to catch illegal write
>>>>>>>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
>>>>>>>> this check.
>>>>>>> 
>>>>>>> In theory, the waring does not affect anything since the tail vmemmap
>>>>>>> pages are supposed to be read-only. So, skipping this check for vmemmap
>>>>>> 
>>>>>> Tails vmemmap pages are supposed to be read-only, in practice but their
>>>>>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE()
>>>>>> warning would not have triggered.
>>>>> 
>>>>> Right.
>>>>> 
>>>>>> 
>>>>>>      VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
>>>>>>                   "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
>>>>>>                   __func__, pte_val(old_pte), pte_val(pte));
>>>>>> 
>>>>>> Also, is not it true that the pte being remapped into a different page
>>>>>> as read only, than what it had originally (which will be freed up) i.e 
>>>>>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still
>>>>> 
>>>>> Right.
>>>>> 
>>>>>> a possibility for a race condition even when the PFN changes ?
>>>>> 
>>>>> Sorry, I didn't get this question. Did you mean the PTE is changed from
>>>>> new (pte) to the old one (old_pte) by the hardware because of the update
>>>>> of dirty bit when a concurrent write operation to the tail vmemmap page?
>>>> 
>>>> No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining
>>>> tails pages ? Is not there a PFN change, along with access permission change
>>>> involved in this remapping process ?
>>> 
>>> For the record, as we discussed offline, changing the output address
>>> (pfn) of a pte is not safe without break-before-make if at least one of
>>> the mappings was writeable. The caller (vmemmap_remap_pte()) would need
>>> to be fixed to first invalidate the pte and then write the new pte. I
>> 
>> Could you expose more details about what issue it will be caused? I am
>> not familiar with arm64.
> 
> Well, it's not allowed by the architecture, so some CPU implementations
> may do weird things like accessing incorrect memory or triggering TLB
> conflict aborts if, for some reason, they end up with two entries in
> the TLB for the same VA but pointing to different pfns. The hardware
> expects an invalid PTE and TLB invalidation between such changes. In
> practice most likely nothing happens and this works fine but we need to
> stick to the architecture requirements in case some CPUs take advantage
> of this requirement.

Got it. Thanks for your nice explanation.

> 
>>> assume no other CPU accesses this part of the vmemmap while the pte is
>>> being remapped.
>> 
>> However, there is no guarantee that no other CPU accesses this pte.
>> E.g. memory failure or memory compaction, both can obtain head page
>> from any tail struct pages (only read) anytime.
> 
> Oh, so we cannot safely go through a break-before-make sequence here
> (zero the PTE, flush the TLB, write the new PTE) as some CPU may access
> this pte.

Right.

Muchun

> 
> -- 
> Catalin



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte
  2022-10-29  1:55     ` mawupeng
@ 2022-11-01  9:56       ` Catalin Marinas
  0 siblings, 0 replies; 14+ messages in thread
From: Catalin Marinas @ 2022-11-01  9:56 UTC (permalink / raw)
  To: mawupeng
  Cc: muchun.song, songmuchun, anshuman.khandual, akpm, mike.kravetz,
	mhocko, osalvador, linux-mm, linux-kernel

On Sat, Oct 29, 2022 at 09:55:15AM +0800, mawupeng wrote:
> On 2022/10/26 11:01, mawupeng wrote:
> > On 2022/10/25 14:36, Muchun Song wrote:
> >>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote:
> >>>
> >>> From: Ma Wupeng <mawupeng1@huawei.com>
> >>>
> >>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with
> >>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as
> >>> read-only to catch illegal write operation to the tail page.
> >>>
> >>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update()
> >>
> >> Thanks for your finding this issue.
> >>
> >>> since this may lead to dirty state cleaned. This check is introduced by
> >>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the
> >>> access and dirty pte bits") and the initial check is as follow:
> >>>
> >>>  BUG_ON(pte_write(*ptep) && !pte_dirty(pte));
> >>>
> >>> Since we do need to mark this pte as read-only to catch illegal write
> >>> operation to the tail pages, use set_pte  to replace set_pte_at to bypass
> >>> this check.
[...]
> IMHO, arm64 or other archs do some work on the dirty bit and rdonly bit in
> pte in commit 2f4b829c625e ("arm64: Add support for hardware updates of the
> access and dirty pte bits").
> 
> Maybe we can use pte_wrprotect() to mark this pte read-only? It will add 
> PTE_DIRTY bit for the new pte entry compare to the old one.
> 
> Here is the diff:
> 
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index ba2a2596fb4e..24a230895316 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -244,8 +244,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>          * Remap the tail pages as read-only to catch illegal write operation
>          * to the tail pages.
>          */
> -       pgprot_t pgprot = PAGE_KERNEL_RO;
> -       pte_t entry = mk_pte(walk->reuse_page, pgprot);
> +       pte_t entry = pte_wrprotect(mk_pte(walk->reuse_page, PAGE_KERNEL));

This may silence the warning but we plan to add another to detect a
change in the pfn without going through a break-before-make sequence.

-- 
Catalin

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-11-01  9:56 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-25  1:42 [PATCH -next 1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte Wupeng Ma
2022-10-25  6:36 ` Muchun Song
2022-10-26  3:01   ` mawupeng
2022-10-29  1:55     ` mawupeng
2022-11-01  9:56       ` Catalin Marinas
2022-10-26  5:06   ` Anshuman Khandual
2022-10-26  7:01     ` Muchun Song
2022-10-26  8:36       ` Anshuman Khandual
2022-10-26  8:53         ` Muchun Song
2022-10-27 10:50         ` Catalin Marinas
2022-10-28  2:45           ` Muchun Song
2022-10-28 15:53             ` Catalin Marinas
2022-11-01  9:29               ` Muchun Song
2022-10-27  1:42     ` mawupeng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).