All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] A few fixup patches for hugetlb
@ 2022-08-16 13:05 Miaohe Lin
  2022-08-16 13:05 ` [PATCH 1/6] mm/hugetlb: fix incorrect update of max_huge_pages Miaohe Lin
                   ` (5 more replies)
  0 siblings, 6 replies; 44+ messages in thread
From: Miaohe Lin @ 2022-08-16 13:05 UTC (permalink / raw)
  To: akpm, mike.kravetz, songmuchun; +Cc: linux-mm, linux-kernel, linmiaohe

Hi everyone,
This series contains a few fixup patches to fix incorrect update of
max_huge_pages, fix WARN_ON(!kobj) in sysfs_create_group() and so on.
More details can be found in the respective changelogs.
Thanks!

Miaohe Lin (6):
  mm/hugetlb: fix incorrect update of max_huge_pages
  mm/hugetlb: fix WARN_ON(!kobj) in sysfs_create_group()
  mm/hugetlb: fix missing call to restore_reserve_on_error()
  mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  mm/hugetlb: fix sysfs group leak in hugetlb_unregister_node()
  mm/hugetlb: make detecting shared pte more reliable

 mm/hugetlb.c         | 46 +++++++++++++++++++++++++++-----------------
 mm/hugetlb_vmemmap.c |  5 +++++
 2 files changed, 33 insertions(+), 18 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/6] mm/hugetlb: fix incorrect update of max_huge_pages
  2022-08-16 13:05 [PATCH 0/6] A few fixup patches for hugetlb Miaohe Lin
@ 2022-08-16 13:05 ` Miaohe Lin
  2022-08-16 22:52   ` Mike Kravetz
  2022-08-17  2:28   ` Muchun Song
  2022-08-16 13:05 ` [PATCH 2/6] mm/hugetlb: fix WARN_ON(!kobj) in sysfs_create_group() Miaohe Lin
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 44+ messages in thread
From: Miaohe Lin @ 2022-08-16 13:05 UTC (permalink / raw)
  To: akpm, mike.kravetz, songmuchun; +Cc: linux-mm, linux-kernel, linmiaohe

There should be pages_per_huge_page(h) / pages_per_huge_page(target_hstate)
pages incremented for target_hstate->max_huge_pages when page is demoted.
Update max_huge_pages accordingly for consistency.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/hugetlb.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ea1c7bfa1cc3..e72052964fb5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3472,7 +3472,8 @@ static int demote_free_huge_page(struct hstate *h, struct page *page)
 	 * based on pool changes for the demoted page.
 	 */
 	h->max_huge_pages--;
-	target_hstate->max_huge_pages += pages_per_huge_page(h);
+	target_hstate->max_huge_pages +=
+		pages_per_huge_page(h) / pages_per_huge_page(target_hstate);
 
 	return rc;
 }
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/6] mm/hugetlb: fix WARN_ON(!kobj) in sysfs_create_group()
  2022-08-16 13:05 [PATCH 0/6] A few fixup patches for hugetlb Miaohe Lin
  2022-08-16 13:05 ` [PATCH 1/6] mm/hugetlb: fix incorrect update of max_huge_pages Miaohe Lin
@ 2022-08-16 13:05 ` Miaohe Lin
  2022-08-16 22:55   ` Mike Kravetz
  2022-08-17  2:31   ` Muchun Song
  2022-08-16 13:05 ` [PATCH 3/6] mm/hugetlb: fix missing call to restore_reserve_on_error() Miaohe Lin
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 44+ messages in thread
From: Miaohe Lin @ 2022-08-16 13:05 UTC (permalink / raw)
  To: akpm, mike.kravetz, songmuchun; +Cc: linux-mm, linux-kernel, linmiaohe

If sysfs_create_group() fails with hstate_attr_group, hstate_kobjs[hi]
will be set to NULL. Then it will be passed to sysfs_create_group() if
h->demote_order != 0 thus triggering WARN_ON(!kobj) check. Fix this by
making sure hstate_kobjs[hi] != NULL when calling sysfs_create_group.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/hugetlb.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e72052964fb5..ff991e5bdf1f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3846,6 +3846,7 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
 	if (retval) {
 		kobject_put(hstate_kobjs[hi]);
 		hstate_kobjs[hi] = NULL;
+		return retval;
 	}
 
 	if (h->demote_order) {
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/6] mm/hugetlb: fix missing call to restore_reserve_on_error()
  2022-08-16 13:05 [PATCH 0/6] A few fixup patches for hugetlb Miaohe Lin
  2022-08-16 13:05 ` [PATCH 1/6] mm/hugetlb: fix incorrect update of max_huge_pages Miaohe Lin
  2022-08-16 13:05 ` [PATCH 2/6] mm/hugetlb: fix WARN_ON(!kobj) in sysfs_create_group() Miaohe Lin
@ 2022-08-16 13:05 ` Miaohe Lin
  2022-08-16 23:31   ` Mike Kravetz
  2022-08-16 13:05 ` [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at() Miaohe Lin
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 44+ messages in thread
From: Miaohe Lin @ 2022-08-16 13:05 UTC (permalink / raw)
  To: akpm, mike.kravetz, songmuchun; +Cc: linux-mm, linux-kernel, linmiaohe

When huge_add_to_page_cache() fails, the page is freed directly without
calling restore_reserve_on_error() to restore reserve for newly allocated
pages not in page cache. Fix this by calling restore_reserve_on_error()
when huge_add_to_page_cache fails.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/hugetlb.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ff991e5bdf1f..b69d7808f457 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5603,6 +5603,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		if (vma->vm_flags & VM_MAYSHARE) {
 			int err = huge_add_to_page_cache(page, mapping, idx);
 			if (err) {
+				restore_reserve_on_error(h, vma, haddr, page);
 				put_page(page);
 				if (err == -EEXIST)
 					goto retry;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-16 13:05 [PATCH 0/6] A few fixup patches for hugetlb Miaohe Lin
                   ` (2 preceding siblings ...)
  2022-08-16 13:05 ` [PATCH 3/6] mm/hugetlb: fix missing call to restore_reserve_on_error() Miaohe Lin
@ 2022-08-16 13:05 ` Miaohe Lin
  2022-08-17  2:53   ` Muchun Song
                     ` (2 more replies)
  2022-08-16 13:05 ` [PATCH 5/6] mm/hugetlb: fix sysfs group leak in hugetlb_unregister_node() Miaohe Lin
  2022-08-16 13:05 ` [PATCH 6/6] mm/hugetlb: make detecting shared pte more reliable Miaohe Lin
  5 siblings, 3 replies; 44+ messages in thread
From: Miaohe Lin @ 2022-08-16 13:05 UTC (permalink / raw)
  To: akpm, mike.kravetz, songmuchun; +Cc: linux-mm, linux-kernel, linmiaohe

The memory barrier smp_wmb() is needed to make sure that preceding stores
to the page contents become visible before the below set_pte_at() write.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/hugetlb_vmemmap.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 20f414c0379f..76b2d03a0d8d 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -287,6 +287,11 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
 	copy_page(to, (void *)walk->reuse_addr);
 	reset_struct_pages(to);
 
+	/*
+	 * Makes sure that preceding stores to the page contents become visible
+	 * before the set_pte_at() write.
+	 */
+	smp_wmb();
 	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
 }
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/6] mm/hugetlb: fix sysfs group leak in hugetlb_unregister_node()
  2022-08-16 13:05 [PATCH 0/6] A few fixup patches for hugetlb Miaohe Lin
                   ` (3 preceding siblings ...)
  2022-08-16 13:05 ` [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at() Miaohe Lin
@ 2022-08-16 13:05 ` Miaohe Lin
  2022-08-17  9:41   ` Yin, Fengwei
  2022-08-18  1:12   ` Yin, Fengwei
  2022-08-16 13:05 ` [PATCH 6/6] mm/hugetlb: make detecting shared pte more reliable Miaohe Lin
  5 siblings, 2 replies; 44+ messages in thread
From: Miaohe Lin @ 2022-08-16 13:05 UTC (permalink / raw)
  To: akpm, mike.kravetz, songmuchun; +Cc: linux-mm, linux-kernel, linmiaohe

The sysfs group per_node_hstate_attr_group and hstate_demote_attr_group
when h->demote_order != 0 are created in hugetlb_register_node(). But
these sysfs groups are not removed when unregister the node, thus sysfs
group is leaked. Using sysfs_remove_group() to fix this issue.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/hugetlb.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b69d7808f457..e1356ad57087 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3850,12 +3850,18 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
 	}
 
 	if (h->demote_order) {
-		if (sysfs_create_group(hstate_kobjs[hi],
-					&hstate_demote_attr_group))
+		retval = sysfs_create_group(hstate_kobjs[hi],
+					    &hstate_demote_attr_group);
+		if (retval) {
 			pr_warn("HugeTLB unable to create demote interfaces for %s\n", h->name);
+			sysfs_remove_group(hstate_kobjs[hi], hstate_attr_group);
+			kobject_put(hstate_kobjs[hi]);
+			hstate_kobjs[hi] = NULL;
+			return retval;
+		}
 	}
 
-	return retval;
+	return 0;
 }
 
 static void __init hugetlb_sysfs_init(void)
@@ -3941,10 +3947,15 @@ static void hugetlb_unregister_node(struct node *node)
 
 	for_each_hstate(h) {
 		int idx = hstate_index(h);
-		if (nhs->hstate_kobjs[idx]) {
-			kobject_put(nhs->hstate_kobjs[idx]);
-			nhs->hstate_kobjs[idx] = NULL;
-		}
+		struct kobject *hstate_kobj = nhs->hstate_kobjs[idx];
+
+		if (!hstate_kobj)
+			continue;
+		if (h->demote_order)
+			sysfs_remove_group(hstate_kobj, &hstate_demote_attr_group);
+		sysfs_remove_group(hstate_kobj, &per_node_hstate_attr_group);
+		kobject_put(hstate_kobj);
+		nhs->hstate_kobjs[idx] = NULL;
 	}
 
 	kobject_put(nhs->hugepages_kobj);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 6/6] mm/hugetlb: make detecting shared pte more reliable
  2022-08-16 13:05 [PATCH 0/6] A few fixup patches for hugetlb Miaohe Lin
                   ` (4 preceding siblings ...)
  2022-08-16 13:05 ` [PATCH 5/6] mm/hugetlb: fix sysfs group leak in hugetlb_unregister_node() Miaohe Lin
@ 2022-08-16 13:05 ` Miaohe Lin
  2022-08-17 23:56   ` Mike Kravetz
  5 siblings, 1 reply; 44+ messages in thread
From: Miaohe Lin @ 2022-08-16 13:05 UTC (permalink / raw)
  To: akpm, mike.kravetz, songmuchun; +Cc: linux-mm, linux-kernel, linmiaohe

If the pagetables are shared, we shouldn't copy or take references. Since
src could have unshared and dst shares with another vma, huge_pte_none()
is thus used to determine whether dst_pte is shared. But this check isn't
reliable. A shared pte could have pte none in pagetable in fact. The page
count of ptep page should be checked here in order to reliably determine
whether pte is shared.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/hugetlb.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e1356ad57087..25db6d07479e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4795,15 +4795,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 
 		/*
 		 * If the pagetables are shared don't copy or take references.
-		 * dst_pte == src_pte is the common case of src/dest sharing.
 		 *
+		 * dst_pte == src_pte is the common case of src/dest sharing.
 		 * However, src could have 'unshared' and dst shares with
-		 * another vma.  If dst_pte !none, this implies sharing.
-		 * Check here before taking page table lock, and once again
-		 * after taking the lock below.
+		 * another vma. So page_count of ptep page is checked instead
+		 * to reliably determine whether pte is shared.
 		 */
-		dst_entry = huge_ptep_get(dst_pte);
-		if ((dst_pte == src_pte) || !huge_pte_none(dst_entry)) {
+		if (page_count(virt_to_page(dst_pte)) > 1) {
 			addr |= last_addr_mask;
 			continue;
 		}
@@ -4814,11 +4812,9 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		entry = huge_ptep_get(src_pte);
 		dst_entry = huge_ptep_get(dst_pte);
 again:
-		if (huge_pte_none(entry) || !huge_pte_none(dst_entry)) {
+		if (huge_pte_none(entry)) {
 			/*
-			 * Skip if src entry none.  Also, skip in the
-			 * unlikely case dst entry !none as this implies
-			 * sharing with another vma.
+			 * Skip if src entry none.
 			 */
 			;
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) {
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mm/hugetlb: fix incorrect update of max_huge_pages
  2022-08-16 13:05 ` [PATCH 1/6] mm/hugetlb: fix incorrect update of max_huge_pages Miaohe Lin
@ 2022-08-16 22:52   ` Mike Kravetz
  2022-08-16 23:20     ` Andrew Morton
  2022-08-17  2:28   ` Muchun Song
  1 sibling, 1 reply; 44+ messages in thread
From: Mike Kravetz @ 2022-08-16 22:52 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: akpm, songmuchun, linux-mm, linux-kernel

On 08/16/22 21:05, Miaohe Lin wrote:
> There should be pages_per_huge_page(h) / pages_per_huge_page(target_hstate)
> pages incremented for target_hstate->max_huge_pages when page is demoted.
> Update max_huge_pages accordingly for consistency.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/hugetlb.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ea1c7bfa1cc3..e72052964fb5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3472,7 +3472,8 @@ static int demote_free_huge_page(struct hstate *h, struct page *page)
>  	 * based on pool changes for the demoted page.
>  	 */
>  	h->max_huge_pages--;
> -	target_hstate->max_huge_pages += pages_per_huge_page(h);
> +	target_hstate->max_huge_pages +=
> +		pages_per_huge_page(h) / pages_per_huge_page(target_hstate);

Thanks!

That is indeed incorrect.  However the miscalculation should not have any 
consequences.  Correct?  The value is used when initially populating the
pools.  It is never read and used again.  It is written to in
set_max_huge_pages if someone changes the number of hugetlb pages.

I guess that is a long way of saying I am not sure why we care about trying
to keep max_huge_pages up to date?  I do not think it matters.

I also thought, if we are going to adjust max_huge_pages here we may
also want to adjust the node specific value: h->max_huge_pages_node[node].
There are a few other places where the global max_huge_pages is adjusted
without adjusting the node specific value.

The more I think about it, the more I think we should explore just
eliminating any adjustment of this/these values after initially
populating the pools.
-- 
Mike Kravetz

>  
>  	return rc;
>  }
> -- 
> 2.23.0
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] mm/hugetlb: fix WARN_ON(!kobj) in sysfs_create_group()
  2022-08-16 13:05 ` [PATCH 2/6] mm/hugetlb: fix WARN_ON(!kobj) in sysfs_create_group() Miaohe Lin
@ 2022-08-16 22:55   ` Mike Kravetz
  2022-08-17  2:31   ` Muchun Song
  1 sibling, 0 replies; 44+ messages in thread
From: Mike Kravetz @ 2022-08-16 22:55 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: akpm, songmuchun, linux-mm, linux-kernel

On 08/16/22 21:05, Miaohe Lin wrote:
> If sysfs_create_group() fails with hstate_attr_group, hstate_kobjs[hi]
> will be set to NULL. Then it will be passed to sysfs_create_group() if
> h->demote_order != 0 thus triggering WARN_ON(!kobj) check. Fix this by
> making sure hstate_kobjs[hi] != NULL when calling sysfs_create_group.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/hugetlb.c | 1 +
>  1 file changed, 1 insertion(+)

Thanks!

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e72052964fb5..ff991e5bdf1f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3846,6 +3846,7 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
>  	if (retval) {
>  		kobject_put(hstate_kobjs[hi]);
>  		hstate_kobjs[hi] = NULL;
> +		return retval;
>  	}
>  
>  	if (h->demote_order) {
> -- 
> 2.23.0
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mm/hugetlb: fix incorrect update of max_huge_pages
  2022-08-16 22:52   ` Mike Kravetz
@ 2022-08-16 23:20     ` Andrew Morton
  2022-08-16 23:34       ` Mike Kravetz
  0 siblings, 1 reply; 44+ messages in thread
From: Andrew Morton @ 2022-08-16 23:20 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Miaohe Lin, songmuchun, linux-mm, linux-kernel

On Tue, 16 Aug 2022 15:52:47 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> On 08/16/22 21:05, Miaohe Lin wrote:
> > There should be pages_per_huge_page(h) / pages_per_huge_page(target_hstate)
> > pages incremented for target_hstate->max_huge_pages when page is demoted.
> > Update max_huge_pages accordingly for consistency.
> > 
> > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> > ---
> >  mm/hugetlb.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index ea1c7bfa1cc3..e72052964fb5 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -3472,7 +3472,8 @@ static int demote_free_huge_page(struct hstate *h, struct page *page)
> >  	 * based on pool changes for the demoted page.
> >  	 */
> >  	h->max_huge_pages--;
> > -	target_hstate->max_huge_pages += pages_per_huge_page(h);
> > +	target_hstate->max_huge_pages +=
> > +		pages_per_huge_page(h) / pages_per_huge_page(target_hstate);
> 
> Thanks!
> 
> That is indeed incorrect.  However the miscalculation should not have any 
> consequences.  Correct?  The value is used when initially populating the
> pools.  It is never read and used again.  It is written to in
> set_max_huge_pages if someone changes the number of hugetlb pages.
> 
> I guess that is a long way of saying I am not sure why we care about trying
> to keep max_huge_pages up to date?  I do not think it matters.
> 
> I also thought, if we are going to adjust max_huge_pages here we may
> also want to adjust the node specific value: h->max_huge_pages_node[node].
> There are a few other places where the global max_huge_pages is adjusted
> without adjusting the node specific value.
> 
> The more I think about it, the more I think we should explore just
> eliminating any adjustment of this/these values after initially
> populating the pools.

I'm thinking we should fix something that is "indeed incorrect" before
going on to more extensive things?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/6] mm/hugetlb: fix missing call to restore_reserve_on_error()
  2022-08-16 13:05 ` [PATCH 3/6] mm/hugetlb: fix missing call to restore_reserve_on_error() Miaohe Lin
@ 2022-08-16 23:31   ` Mike Kravetz
  2022-08-17  1:59     ` Miaohe Lin
  0 siblings, 1 reply; 44+ messages in thread
From: Mike Kravetz @ 2022-08-16 23:31 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: akpm, songmuchun, linux-mm, linux-kernel

On 08/16/22 21:05, Miaohe Lin wrote:
> When huge_add_to_page_cache() fails, the page is freed directly without
> calling restore_reserve_on_error() to restore reserve for newly allocated
> pages not in page cache. Fix this by calling restore_reserve_on_error()
> when huge_add_to_page_cache fails.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/hugetlb.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ff991e5bdf1f..b69d7808f457 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5603,6 +5603,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>  		if (vma->vm_flags & VM_MAYSHARE) {
>  			int err = huge_add_to_page_cache(page, mapping, idx);
>  			if (err) {
> +				restore_reserve_on_error(h, vma, haddr, page);

Hmmmm.  I was going to comment that restore_reserve_on_error would not handle
the situation where 'err == -EEXIST' below.  This is because it implies we
raced with someone else that added the page to the cache.  And, that other
allocation, not this one, consumed the reservation.  However, I am not sure
how that could be possible?  The hugetlb fault mutex (which we hold)
must be held to add a page to the page cache.

Searching git history I see that code was added (or at least existed) before
the hugetlb fault mutex was introduced.  So, I believe that check for -EEXIST
and retry can go.

With that said, restore_reserve_on_error can be called here.  But, let's
look into removing that err == -EEXIST check to avoid confusion.
-- 
Mike Kravetz

>  				put_page(page);
>  				if (err == -EEXIST)
>  					goto retry;
> -- 
> 2.23.0
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mm/hugetlb: fix incorrect update of max_huge_pages
  2022-08-16 23:20     ` Andrew Morton
@ 2022-08-16 23:34       ` Mike Kravetz
  2022-08-17  1:53         ` Miaohe Lin
  0 siblings, 1 reply; 44+ messages in thread
From: Mike Kravetz @ 2022-08-16 23:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miaohe Lin, songmuchun, linux-mm, linux-kernel

On 08/16/22 16:20, Andrew Morton wrote:
> On Tue, 16 Aug 2022 15:52:47 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> > On 08/16/22 21:05, Miaohe Lin wrote:
> > > There should be pages_per_huge_page(h) / pages_per_huge_page(target_hstate)
> > > pages incremented for target_hstate->max_huge_pages when page is demoted.
> > > Update max_huge_pages accordingly for consistency.
> > > 
> > > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> > > ---
> > >  mm/hugetlb.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index ea1c7bfa1cc3..e72052964fb5 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -3472,7 +3472,8 @@ static int demote_free_huge_page(struct hstate *h, struct page *page)
> > >  	 * based on pool changes for the demoted page.
> > >  	 */
> > >  	h->max_huge_pages--;
> > > -	target_hstate->max_huge_pages += pages_per_huge_page(h);
> > > +	target_hstate->max_huge_pages +=
> > > +		pages_per_huge_page(h) / pages_per_huge_page(target_hstate);
> > 
> > Thanks!
> > 
> > That is indeed incorrect.  However the miscalculation should not have any 
> > consequences.  Correct?  The value is used when initially populating the
> > pools.  It is never read and used again.  It is written to in
> > set_max_huge_pages if someone changes the number of hugetlb pages.
> > 
> > I guess that is a long way of saying I am not sure why we care about trying
> > to keep max_huge_pages up to date?  I do not think it matters.
> > 
> > I also thought, if we are going to adjust max_huge_pages here we may
> > also want to adjust the node specific value: h->max_huge_pages_node[node].
> > There are a few other places where the global max_huge_pages is adjusted
> > without adjusting the node specific value.
> > 
> > The more I think about it, the more I think we should explore just
> > eliminating any adjustment of this/these values after initially
> > populating the pools.
> 
> I'm thinking we should fix something that is "indeed incorrect" before
> going on to more extensive things?

Sure, I am good with that.

Just wanted to point out that the incorrect calculation does not have
any negative consequences.  Maybe prompting Miaohe to look into the more
extensive cleanup.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mm/hugetlb: fix incorrect update of max_huge_pages
  2022-08-16 23:34       ` Mike Kravetz
@ 2022-08-17  1:53         ` Miaohe Lin
  0 siblings, 0 replies; 44+ messages in thread
From: Miaohe Lin @ 2022-08-17  1:53 UTC (permalink / raw)
  To: Mike Kravetz, Andrew Morton; +Cc: songmuchun, linux-mm, linux-kernel

On 2022/8/17 7:34, Mike Kravetz wrote:
> On 08/16/22 16:20, Andrew Morton wrote:
>> On Tue, 16 Aug 2022 15:52:47 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>>> On 08/16/22 21:05, Miaohe Lin wrote:
>>>> There should be pages_per_huge_page(h) / pages_per_huge_page(target_hstate)
>>>> pages incremented for target_hstate->max_huge_pages when page is demoted.
>>>> Update max_huge_pages accordingly for consistency.
>>>>
>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>> ---
>>>>  mm/hugetlb.c | 3 ++-
>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index ea1c7bfa1cc3..e72052964fb5 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -3472,7 +3472,8 @@ static int demote_free_huge_page(struct hstate *h, struct page *page)
>>>>  	 * based on pool changes for the demoted page.
>>>>  	 */
>>>>  	h->max_huge_pages--;
>>>> -	target_hstate->max_huge_pages += pages_per_huge_page(h);
>>>> +	target_hstate->max_huge_pages +=
>>>> +		pages_per_huge_page(h) / pages_per_huge_page(target_hstate);
>>>
>>> Thanks!
>>>
>>> That is indeed incorrect.  However the miscalculation should not have any 
>>> consequences.  Correct?  The value is used when initially populating the
>>> pools.  It is never read and used again.  It is written to in
>>> set_max_huge_pages if someone changes the number of hugetlb pages.
>>>
>>> I guess that is a long way of saying I am not sure why we care about trying
>>> to keep max_huge_pages up to date?  I do not think it matters.
>>>
>>> I also thought, if we are going to adjust max_huge_pages here we may
>>> also want to adjust the node specific value: h->max_huge_pages_node[node].
>>> There are a few other places where the global max_huge_pages is adjusted
>>> without adjusting the node specific value.
>>>
>>> The more I think about it, the more I think we should explore just
>>> eliminating any adjustment of this/these values after initially
>>> populating the pools.
>>
>> I'm thinking we should fix something that is "indeed incorrect" before
>> going on to more extensive things?
> 
> Sure, I am good with that.
> 
> Just wanted to point out that the incorrect calculation does not have
> any negative consequences.  Maybe prompting Miaohe to look into the more
> extensive cleanup.

Many thanks both. I will try to do this "more extensive cleanup" after pending work is done.

Thanks,
Miaohe Lin



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/6] mm/hugetlb: fix missing call to restore_reserve_on_error()
  2022-08-16 23:31   ` Mike Kravetz
@ 2022-08-17  1:59     ` Miaohe Lin
  0 siblings, 0 replies; 44+ messages in thread
From: Miaohe Lin @ 2022-08-17  1:59 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: akpm, songmuchun, linux-mm, linux-kernel

On 2022/8/17 7:31, Mike Kravetz wrote:
> On 08/16/22 21:05, Miaohe Lin wrote:
>> When huge_add_to_page_cache() fails, the page is freed directly without
>> calling restore_reserve_on_error() to restore reserve for newly allocated
>> pages not in page cache. Fix this by calling restore_reserve_on_error()
>> when huge_add_to_page_cache fails.
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> ---
>>  mm/hugetlb.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index ff991e5bdf1f..b69d7808f457 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -5603,6 +5603,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>  		if (vma->vm_flags & VM_MAYSHARE) {
>>  			int err = huge_add_to_page_cache(page, mapping, idx);
>>  			if (err) {
>> +				restore_reserve_on_error(h, vma, haddr, page);
> 
> Hmmmm.  I was going to comment that restore_reserve_on_error would not handle
> the situation where 'err == -EEXIST' below.  This is because it implies we
> raced with someone else that added the page to the cache.  And, that other

Thanks for pointing this out.

> allocation, not this one, consumed the reservation.  However, I am not sure
> how that could be possible?  The hugetlb fault mutex (which we hold)
> must be held to add a page to the page cache.
> 
> Searching git history I see that code was added (or at least existed) before
> the hugetlb fault mutex was introduced.  So, I believe that check for -EEXIST
> and retry can go.

Agree with you. All call sites of huge_add_to_page_cache is protected by hugetlb fault mutex.

> 
> With that said, restore_reserve_on_error can be called here.  But, let's
> look into removing that err == -EEXIST check to avoid confusion.

Will do it in next version. Many thanks for your review and comment.

Thanks,
Miaohe Lin


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mm/hugetlb: fix incorrect update of max_huge_pages
  2022-08-16 13:05 ` [PATCH 1/6] mm/hugetlb: fix incorrect update of max_huge_pages Miaohe Lin
  2022-08-16 22:52   ` Mike Kravetz
@ 2022-08-17  2:28   ` Muchun Song
  1 sibling, 0 replies; 44+ messages in thread
From: Muchun Song @ 2022-08-17  2:28 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Linux MM, linux-kernel



> On Aug 16, 2022, at 21:05, Miaohe Lin <linmiaohe@huawei.com> wrote:
> 
> There should be pages_per_huge_page(h) / pages_per_huge_page(target_hstate)
> pages incremented for target_hstate->max_huge_pages when page is demoted.
> Update max_huge_pages accordingly for consistency.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thanks.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] mm/hugetlb: fix WARN_ON(!kobj) in sysfs_create_group()
  2022-08-16 13:05 ` [PATCH 2/6] mm/hugetlb: fix WARN_ON(!kobj) in sysfs_create_group() Miaohe Lin
  2022-08-16 22:55   ` Mike Kravetz
@ 2022-08-17  2:31   ` Muchun Song
  2022-08-17  2:39     ` Miaohe Lin
  1 sibling, 1 reply; 44+ messages in thread
From: Muchun Song @ 2022-08-17  2:31 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Andrew Morton, mike.kravetz, Muchun Song, linux-mm, linux-kernel



> On Aug 16, 2022, at 21:05, Miaohe Lin <linmiaohe@huawei.com> wrote:
> 
> If sysfs_create_group() fails with hstate_attr_group, hstate_kobjs[hi]
> will be set to NULL. Then it will be passed to sysfs_create_group() if
> h->demote_order != 0 thus triggering WARN_ON(!kobj) check. Fix this by
> making sure hstate_kobjs[hi] != NULL when calling sysfs_create_group.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>

It’s better to add a Fixes tag here.

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thanks.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] mm/hugetlb: fix WARN_ON(!kobj) in sysfs_create_group()
  2022-08-17  2:31   ` Muchun Song
@ 2022-08-17  2:39     ` Miaohe Lin
  0 siblings, 0 replies; 44+ messages in thread
From: Miaohe Lin @ 2022-08-17  2:39 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, mike.kravetz, Muchun Song, linux-mm, linux-kernel

On 2022/8/17 10:31, Muchun Song wrote:
> 
> 
>> On Aug 16, 2022, at 21:05, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> If sysfs_create_group() fails with hstate_attr_group, hstate_kobjs[hi]
>> will be set to NULL. Then it will be passed to sysfs_create_group() if
>> h->demote_order != 0 thus triggering WARN_ON(!kobj) check. Fix this by
>> making sure hstate_kobjs[hi] != NULL when calling sysfs_create_group.
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> 
> It’s better to add a Fixes tag here.

Will add it in next version. Thanks for your review and comment.

Thanks,
Miaohe Lin

> 
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> 
> Thanks.
> 
> .
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-16 13:05 ` [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at() Miaohe Lin
@ 2022-08-17  2:53   ` Muchun Song
  2022-08-17  8:41     ` Miaohe Lin
  2022-08-18  1:15   ` Yin, Fengwei
  2022-08-20  8:12   ` Muchun Song
  2 siblings, 1 reply; 44+ messages in thread
From: Muchun Song @ 2022-08-17  2:53 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Linux MM, linux-kernel



> On Aug 16, 2022, at 21:05, Miaohe Lin <linmiaohe@huawei.com> wrote:
> 
> The memory barrier smp_wmb() is needed to make sure that preceding stores
> to the page contents become visible before the below set_pte_at() write.

I’m not sure if you are right. I think it is set_pte_at()’s responsibility.
Take arm64 (since it is a Relaxed Memory Order model) as an example (the
following code snippet is set_pte()), I see a barrier guarantee. So I am
curious what issues you are facing. So I want to know the basis for you to
do this change.

 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
        *ptep = pte;

        /*
         * Only if the new pte is valid and kernel, otherwise TLB maintenance
         * or update_mmu_cache() have the necessary barriers.
         */
        if (pte_valid_not_user(pte)) {
               dsb(ishst);
               isb();
        }
 }

Thanks.

> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
> mm/hugetlb_vmemmap.c | 5 +++++
> 1 file changed, 5 insertions(+)
> 
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 20f414c0379f..76b2d03a0d8d 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -287,6 +287,11 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> 	copy_page(to, (void *)walk->reuse_addr);
> 	reset_struct_pages(to);
> 
> +	/*
> +	 * Makes sure that preceding stores to the page contents become visible
> +	 * before the set_pte_at() write.
> +	 */
> +	smp_wmb();
> 	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
> }
> 
> -- 
> 2.23.0
> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-17  2:53   ` Muchun Song
@ 2022-08-17  8:41     ` Miaohe Lin
  2022-08-17  9:13       ` Yin, Fengwei
  2022-08-17 11:21       ` Muchun Song
  0 siblings, 2 replies; 44+ messages in thread
From: Miaohe Lin @ 2022-08-17  8:41 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Linux MM, linux-kernel

On 2022/8/17 10:53, Muchun Song wrote:
> 
> 
>> On Aug 16, 2022, at 21:05, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> The memory barrier smp_wmb() is needed to make sure that preceding stores
>> to the page contents become visible before the below set_pte_at() write.
> 
> I’m not sure if you are right. I think it is set_pte_at()’s responsibility.

Maybe not. There're many call sites do the similar things:

hugetlb_mcopy_atomic_pte
__do_huge_pmd_anonymous_page
collapse_huge_page
do_anonymous_page
migrate_vma_insert_page
mcopy_atomic_pte

Take do_anonymous_page as an example:

	/*
	 * The memory barrier inside __SetPageUptodate makes sure that
	 * preceding stores to the page contents become visible before
	 * the set_pte_at() write.
	 */
	__SetPageUptodate(page);

So I think a memory barrier is needed before the set_pte_at() write. Or am I miss something?

Thanks,
Miaohe Lin

> Take arm64 (since it is a Relaxed Memory Order model) as an example (the
> following code snippet is set_pte()), I see a barrier guarantee. So I am
> curious what issues you are facing. So I want to know the basis for you to
> do this change.
> 
>  static inline void set_pte(pte_t *ptep, pte_t pte)
>  {
>         *ptep = pte;
> 
>         /*
>          * Only if the new pte is valid and kernel, otherwise TLB maintenance
>          * or update_mmu_cache() have the necessary barriers.
>          */
>         if (pte_valid_not_user(pte)) {
>                dsb(ishst);
>                isb();
>         }
>  }
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-17  8:41     ` Miaohe Lin
@ 2022-08-17  9:13       ` Yin, Fengwei
  2022-08-17 11:21       ` Muchun Song
  1 sibling, 0 replies; 44+ messages in thread
From: Yin, Fengwei @ 2022-08-17  9:13 UTC (permalink / raw)
  To: Miaohe Lin, Muchun Song
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Linux MM, linux-kernel



On 8/17/2022 4:41 PM, Miaohe Lin wrote:
> So I think a memory barrier is needed before the set_pte_at() write. Or am I miss something?
Yes. I agree with you. The memory barrier should be put between page
content change and pte update. The patch looks good to me. Thanks.


Regards
Yin, Fengwei

> 
> Thanks,
> Miaohe Lin


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] mm/hugetlb: fix sysfs group leak in hugetlb_unregister_node()
  2022-08-16 13:05 ` [PATCH 5/6] mm/hugetlb: fix sysfs group leak in hugetlb_unregister_node() Miaohe Lin
@ 2022-08-17  9:41   ` Yin, Fengwei
  2022-08-18  1:00     ` Yin, Fengwei
  2022-08-18  1:12   ` Yin, Fengwei
  1 sibling, 1 reply; 44+ messages in thread
From: Yin, Fengwei @ 2022-08-17  9:41 UTC (permalink / raw)
  To: Miaohe Lin, akpm, mike.kravetz, songmuchun; +Cc: linux-mm, linux-kernel

Hi Miaohe,

On 8/16/2022 9:05 PM, Miaohe Lin wrote:
> }
>  
>  	if (h->demote_order) {
> -		if (sysfs_create_group(hstate_kobjs[hi],
> -					&hstate_demote_attr_group))
> +		retval = sysfs_create_group(hstate_kobjs[hi],
> +					    &hstate_demote_attr_group);
What about add one more:
   just return if hstate_attr_group creating failed:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0aee2f3ae15c..a67ef4b4eb3f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3845,6 +3845,7 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
        if (retval) {
                kobject_put(hstate_kobjs[hi]);
                hstate_kobjs[hi] = NULL;
+               return retval;
        }

Once hstate_kobjs[hi] is set to NULL, hstate_demote_attr_group creating will
fail as well. Thanks.


Regards
Yin, Fengwei

> +		if (retval) {
>  			pr_warn("HugeTLB unable to create demote interfaces for %s\n", h->name);
> +			sysfs_remove_group(hstate_kobjs[hi], hstate_attr_group);
> +			kobject_put(hstate_kobjs[hi]);
> +			hstate_kobjs[hi] = NULL;
> +			return retval;
> +		}
>  	}

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-17  8:41     ` Miaohe Lin
  2022-08-17  9:13       ` Yin, Fengwei
@ 2022-08-17 11:21       ` Muchun Song
  2022-08-18  1:14         ` Yin, Fengwei
  1 sibling, 1 reply; 44+ messages in thread
From: Muchun Song @ 2022-08-17 11:21 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Linux MM, linux-kernel



> On Aug 17, 2022, at 16:41, Miaohe Lin <linmiaohe@huawei.com> wrote:
> 
> On 2022/8/17 10:53, Muchun Song wrote:
>> 
>> 
>>> On Aug 16, 2022, at 21:05, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>> 
>>> The memory barrier smp_wmb() is needed to make sure that preceding stores
>>> to the page contents become visible before the below set_pte_at() write.
>> 
>> I’m not sure if you are right. I think it is set_pte_at()’s responsibility.
> 
> Maybe not. There're many call sites do the similar things:
> 
> hugetlb_mcopy_atomic_pte
> __do_huge_pmd_anonymous_page
> collapse_huge_page
> do_anonymous_page
> migrate_vma_insert_page
> mcopy_atomic_pte
> 
> Take do_anonymous_page as an example:
> 
> 	/*
> 	 * The memory barrier inside __SetPageUptodate makes sure that
> 	 * preceding stores to the page contents become visible before
> 	 * the set_pte_at() write.
> 	 */
> 	__SetPageUptodate(page);

IIUC, the case here we should make sure others (CPUs) can see new page’s
contents after they have saw PG_uptodate is set. I think commit 0ed361dec369
can tell us more details.

I also looked at commit 52f37629fd3c to see why we need a barrier before
set_pte_at(), but I didn’t find any info to explain why. I guess we want
to make sure the order between the page’s contents and subsequent memory
accesses using the corresponding virtual address, do you agree with this?

Thanks.

> 
> So I think a memory barrier is needed before the set_pte_at() write. Or am I miss something?
> 
> Thanks,
> Miaohe Lin
> 
>> Take arm64 (since it is a Relaxed Memory Order model) as an example (the
>> following code snippet is set_pte()), I see a barrier guarantee. So I am
>> curious what issues you are facing. So I want to know the basis for you to
>> do this change.
>> 
>> static inline void set_pte(pte_t *ptep, pte_t pte)
>> {
>>        *ptep = pte;
>> 
>>        /*
>>         * Only if the new pte is valid and kernel, otherwise TLB maintenance
>>         * or update_mmu_cache() have the necessary barriers.
>>         */
>>        if (pte_valid_not_user(pte)) {
>>               dsb(ishst);
>>               isb();
>>        }
>> }
>> 
>> Thanks.
>> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/6] mm/hugetlb: make detecting shared pte more reliable
  2022-08-16 13:05 ` [PATCH 6/6] mm/hugetlb: make detecting shared pte more reliable Miaohe Lin
@ 2022-08-17 23:56   ` Mike Kravetz
  0 siblings, 0 replies; 44+ messages in thread
From: Mike Kravetz @ 2022-08-17 23:56 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: akpm, songmuchun, linux-mm, linux-kernel

On 08/16/22 21:05, Miaohe Lin wrote:
> If the pagetables are shared, we shouldn't copy or take references. Since
> src could have unshared and dst shares with another vma, huge_pte_none()
> is thus used to determine whether dst_pte is shared. But this check isn't
> reliable. A shared pte could have pte none in pagetable in fact. The page
> count of ptep page should be checked here in order to reliably determine
> whether pte is shared.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/hugetlb.c | 16 ++++++----------
>  1 file changed, 6 insertions(+), 10 deletions(-)

You are correct, this is a better/more reliable way to check for pmd sharing.
It is accurate since we hold i_mmap_rwsem.  I like it.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

Note to self, this will not work if we move to vma based locking for pmd
sharing and do not hold i_mmap_rwsem here.

> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e1356ad57087..25db6d07479e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4795,15 +4795,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  
>  		/*
>  		 * If the pagetables are shared don't copy or take references.
> -		 * dst_pte == src_pte is the common case of src/dest sharing.
>  		 *
> +		 * dst_pte == src_pte is the common case of src/dest sharing.
>  		 * However, src could have 'unshared' and dst shares with
> -		 * another vma.  If dst_pte !none, this implies sharing.
> -		 * Check here before taking page table lock, and once again
> -		 * after taking the lock below.
> +		 * another vma. So page_count of ptep page is checked instead
> +		 * to reliably determine whether pte is shared.
>  		 */
> -		dst_entry = huge_ptep_get(dst_pte);
> -		if ((dst_pte == src_pte) || !huge_pte_none(dst_entry)) {
> +		if (page_count(virt_to_page(dst_pte)) > 1) {
>  			addr |= last_addr_mask;
>  			continue;
>  		}
> @@ -4814,11 +4812,9 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  		entry = huge_ptep_get(src_pte);
>  		dst_entry = huge_ptep_get(dst_pte);
>  again:
> -		if (huge_pte_none(entry) || !huge_pte_none(dst_entry)) {
> +		if (huge_pte_none(entry)) {
>  			/*
> -			 * Skip if src entry none.  Also, skip in the
> -			 * unlikely case dst entry !none as this implies
> -			 * sharing with another vma.
> +			 * Skip if src entry none.
>  			 */
>  			;
>  		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) {
> -- 
> 2.23.0
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] mm/hugetlb: fix sysfs group leak in hugetlb_unregister_node()
  2022-08-17  9:41   ` Yin, Fengwei
@ 2022-08-18  1:00     ` Yin, Fengwei
  0 siblings, 0 replies; 44+ messages in thread
From: Yin, Fengwei @ 2022-08-18  1:00 UTC (permalink / raw)
  To: Miaohe Lin, akpm, mike.kravetz, songmuchun; +Cc: linux-mm, linux-kernel


On 8/17/2022 5:41 PM, Yin, Fengwei wrote:
> Hi Miaohe,
> 
> On 8/16/2022 9:05 PM, Miaohe Lin wrote:
>> }
>>  
>>  	if (h->demote_order) {
>> -		if (sysfs_create_group(hstate_kobjs[hi],
>> -					&hstate_demote_attr_group))
>> +		retval = sysfs_create_group(hstate_kobjs[hi],
>> +					    &hstate_demote_attr_group);
> What about add one more:
>    just return if hstate_attr_group creating failed:
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 0aee2f3ae15c..a67ef4b4eb3f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3845,6 +3845,7 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
>         if (retval) {
>                 kobject_put(hstate_kobjs[hi]);
>                 hstate_kobjs[hi] = NULL;
> +               return retval;
>         }
Please ignore this. Just saw the patch 2 made this change.

Regards
Yin, Fengwei

> 
> Once hstate_kobjs[hi] is set to NULL, hstate_demote_attr_group creating will
> fail as well. Thanks.
> 
> 
> Regards
> Yin, Fengwei
> 
>> +		if (retval) {
>>  			pr_warn("HugeTLB unable to create demote interfaces for %s\n", h->name);
>> +			sysfs_remove_group(hstate_kobjs[hi], hstate_attr_group);
>> +			kobject_put(hstate_kobjs[hi]);
>> +			hstate_kobjs[hi] = NULL;
>> +			return retval;
>> +		}
>>  	}
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] mm/hugetlb: fix sysfs group leak in hugetlb_unregister_node()
  2022-08-16 13:05 ` [PATCH 5/6] mm/hugetlb: fix sysfs group leak in hugetlb_unregister_node() Miaohe Lin
  2022-08-17  9:41   ` Yin, Fengwei
@ 2022-08-18  1:12   ` Yin, Fengwei
  1 sibling, 0 replies; 44+ messages in thread
From: Yin, Fengwei @ 2022-08-18  1:12 UTC (permalink / raw)
  To: Miaohe Lin, akpm, mike.kravetz, songmuchun; +Cc: linux-mm, linux-kernel



On 8/16/2022 9:05 PM, Miaohe Lin wrote:
> The sysfs group per_node_hstate_attr_group and hstate_demote_attr_group
> when h->demote_order != 0 are created in hugetlb_register_node(). But
> these sysfs groups are not removed when unregister the node, thus sysfs
> group is leaked. Using sysfs_remove_group() to fix this issue.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Fengwei Yin <fengwei.yin@intel.com>

Regards
Yin, Fengwei

> ---
>  mm/hugetlb.c | 25 ++++++++++++++++++-------
>  1 file changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b69d7808f457..e1356ad57087 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3850,12 +3850,18 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
>  	}
>  
>  	if (h->demote_order) {
> -		if (sysfs_create_group(hstate_kobjs[hi],
> -					&hstate_demote_attr_group))
> +		retval = sysfs_create_group(hstate_kobjs[hi],
> +					    &hstate_demote_attr_group);
> +		if (retval) {
>  			pr_warn("HugeTLB unable to create demote interfaces for %s\n", h->name);
> +			sysfs_remove_group(hstate_kobjs[hi], hstate_attr_group);
> +			kobject_put(hstate_kobjs[hi]);
> +			hstate_kobjs[hi] = NULL;
> +			return retval;
> +		}
>  	}
>  
> -	return retval;
> +	return 0;
>  }
>  
>  static void __init hugetlb_sysfs_init(void)
> @@ -3941,10 +3947,15 @@ static void hugetlb_unregister_node(struct node *node)
>  
>  	for_each_hstate(h) {
>  		int idx = hstate_index(h);
> -		if (nhs->hstate_kobjs[idx]) {
> -			kobject_put(nhs->hstate_kobjs[idx]);
> -			nhs->hstate_kobjs[idx] = NULL;
> -		}
> +		struct kobject *hstate_kobj = nhs->hstate_kobjs[idx];
> +
> +		if (!hstate_kobj)
> +			continue;
> +		if (h->demote_order)
> +			sysfs_remove_group(hstate_kobj, &hstate_demote_attr_group);
> +		sysfs_remove_group(hstate_kobj, &per_node_hstate_attr_group);
> +		kobject_put(hstate_kobj);
> +		nhs->hstate_kobjs[idx] = NULL;
>  	}
>  
>  	kobject_put(nhs->hugepages_kobj);

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-17 11:21       ` Muchun Song
@ 2022-08-18  1:14         ` Yin, Fengwei
  2022-08-18  1:55           ` Miaohe Lin
  0 siblings, 1 reply; 44+ messages in thread
From: Yin, Fengwei @ 2022-08-18  1:14 UTC (permalink / raw)
  To: Muchun Song, Miaohe Lin
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Linux MM, linux-kernel



On 8/17/2022 7:21 PM, Muchun Song wrote:
> 
> 
>> On Aug 17, 2022, at 16:41, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> On 2022/8/17 10:53, Muchun Song wrote:
>>>
>>>
>>>> On Aug 16, 2022, at 21:05, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>>
>>>> The memory barrier smp_wmb() is needed to make sure that preceding stores
>>>> to the page contents become visible before the below set_pte_at() write.
>>>
>>> I’m not sure if you are right. I think it is set_pte_at()’s responsibility.
>>
>> Maybe not. There're many call sites do the similar things:
>>
>> hugetlb_mcopy_atomic_pte
>> __do_huge_pmd_anonymous_page
>> collapse_huge_page
>> do_anonymous_page
>> migrate_vma_insert_page
>> mcopy_atomic_pte
>>
>> Take do_anonymous_page as an example:
>>
>> 	/*
>> 	 * The memory barrier inside __SetPageUptodate makes sure that
>> 	 * preceding stores to the page contents become visible before
>> 	 * the set_pte_at() write.
>> 	 */
>> 	__SetPageUptodate(page);
> 
> IIUC, the case here we should make sure others (CPUs) can see new page’s
> contents after they have saw PG_uptodate is set. I think commit 0ed361dec369
> can tell us more details.
> 
> I also looked at commit 52f37629fd3c to see why we need a barrier before
> set_pte_at(), but I didn’t find any info to explain why. I guess we want
> to make sure the order between the page’s contents and subsequent memory
> accesses using the corresponding virtual address, do you agree with this?
This is my understanding also. Thanks.

Regards
Yin, Fengwei

> 
> Thanks.
> 
>>
>> So I think a memory barrier is needed before the set_pte_at() write. Or am I miss something?
>>
>> Thanks,
>> Miaohe Lin
>>
>>> Take arm64 (since it is a Relaxed Memory Order model) as an example (the
>>> following code snippet is set_pte()), I see a barrier guarantee. So I am
>>> curious what issues you are facing. So I want to know the basis for you to
>>> do this change.
>>>
>>> static inline void set_pte(pte_t *ptep, pte_t pte)
>>> {
>>>        *ptep = pte;
>>>
>>>        /*
>>>         * Only if the new pte is valid and kernel, otherwise TLB maintenance
>>>         * or update_mmu_cache() have the necessary barriers.
>>>         */
>>>        if (pte_valid_not_user(pte)) {
>>>               dsb(ishst);
>>>               isb();
>>>        }
>>> }
>>>
>>> Thanks.
>>>
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-16 13:05 ` [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at() Miaohe Lin
  2022-08-17  2:53   ` Muchun Song
@ 2022-08-18  1:15   ` Yin, Fengwei
  2022-08-20  8:12   ` Muchun Song
  2 siblings, 0 replies; 44+ messages in thread
From: Yin, Fengwei @ 2022-08-18  1:15 UTC (permalink / raw)
  To: Miaohe Lin, akpm, mike.kravetz, songmuchun; +Cc: linux-mm, linux-kernel



On 8/16/2022 9:05 PM, Miaohe Lin wrote:
> The memory barrier smp_wmb() is needed to make sure that preceding stores
> to the page contents become visible before the below set_pte_at() write.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>

Regards
Yin, Fengwei

> ---
>  mm/hugetlb_vmemmap.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 20f414c0379f..76b2d03a0d8d 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -287,6 +287,11 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>  	copy_page(to, (void *)walk->reuse_addr);
>  	reset_struct_pages(to);
>  
> +	/*
> +	 * Makes sure that preceding stores to the page contents become visible
> +	 * before the set_pte_at() write.
> +	 */
> +	smp_wmb();
>  	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
>  }
>  

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-18  1:14         ` Yin, Fengwei
@ 2022-08-18  1:55           ` Miaohe Lin
  2022-08-18  2:00             ` Yin, Fengwei
  0 siblings, 1 reply; 44+ messages in thread
From: Miaohe Lin @ 2022-08-18  1:55 UTC (permalink / raw)
  To: Yin, Fengwei, Muchun Song
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Linux MM, linux-kernel

On 2022/8/18 9:14, Yin, Fengwei wrote:
> 
> 
> On 8/17/2022 7:21 PM, Muchun Song wrote:
>>
>>
>>> On Aug 17, 2022, at 16:41, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>
>>> On 2022/8/17 10:53, Muchun Song wrote:
>>>>
>>>>
>>>>> On Aug 16, 2022, at 21:05, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>>>
>>>>> The memory barrier smp_wmb() is needed to make sure that preceding stores
>>>>> to the page contents become visible before the below set_pte_at() write.
>>>>
>>>> I’m not sure if you are right. I think it is set_pte_at()’s responsibility.
>>>
>>> Maybe not. There're many call sites do the similar things:
>>>
>>> hugetlb_mcopy_atomic_pte
>>> __do_huge_pmd_anonymous_page
>>> collapse_huge_page
>>> do_anonymous_page
>>> migrate_vma_insert_page
>>> mcopy_atomic_pte
>>>
>>> Take do_anonymous_page as an example:
>>>
>>> 	/*
>>> 	 * The memory barrier inside __SetPageUptodate makes sure that
>>> 	 * preceding stores to the page contents become visible before
>>> 	 * the set_pte_at() write.
>>> 	 */
>>> 	__SetPageUptodate(page);
>>
>> IIUC, the case here we should make sure others (CPUs) can see new page’s
>> contents after they have saw PG_uptodate is set. I think commit 0ed361dec369
>> can tell us more details.
>>
>> I also looked at commit 52f37629fd3c to see why we need a barrier before
>> set_pte_at(), but I didn’t find any info to explain why. I guess we want
>> to make sure the order between the page’s contents and subsequent memory
>> accesses using the corresponding virtual address, do you agree with this?
> This is my understanding also. Thanks.

That's also my understanding. Thanks both.

Thanks,
Miaohe Lin


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-18  1:55           ` Miaohe Lin
@ 2022-08-18  2:00             ` Yin, Fengwei
  2022-08-18  2:47               ` Muchun Song
  0 siblings, 1 reply; 44+ messages in thread
From: Yin, Fengwei @ 2022-08-18  2:00 UTC (permalink / raw)
  To: Miaohe Lin, Yin, Fengwei, Muchun Song
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Linux MM, linux-kernel



On 8/18/2022 9:55 AM, Miaohe Lin wrote:
>>>> 	/*
>>>> 	 * The memory barrier inside __SetPageUptodate makes sure that
>>>> 	 * preceding stores to the page contents become visible before
>>>> 	 * the set_pte_at() write.
>>>> 	 */
>>>> 	__SetPageUptodate(page);
>>> IIUC, the case here we should make sure others (CPUs) can see new page’s
>>> contents after they have saw PG_uptodate is set. I think commit 0ed361dec369
>>> can tell us more details.
>>>
>>> I also looked at commit 52f37629fd3c to see why we need a barrier before
>>> set_pte_at(), but I didn’t find any info to explain why. I guess we want
>>> to make sure the order between the page’s contents and subsequent memory
>>> accesses using the corresponding virtual address, do you agree with this?
>> This is my understanding also. Thanks.
> That's also my understanding. Thanks both.
I have an unclear thing (not related with this patch directly): Who is response
for the read barrier in the read side in this case?

For SetPageUptodate, there are paring write/read memory barrier.


Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-18  2:00             ` Yin, Fengwei
@ 2022-08-18  2:47               ` Muchun Song
  2022-08-18  7:52                 ` Miaohe Lin
  0 siblings, 1 reply; 44+ messages in thread
From: Muchun Song @ 2022-08-18  2:47 UTC (permalink / raw)
  To: Yin, Fengwei
  Cc: Miaohe Lin, Andrew Morton, Mike Kravetz, Muchun Song, Linux MM,
	linux-kernel



> On Aug 18, 2022, at 10:00, Yin, Fengwei <fengwei.yin@intel.com> wrote:
> 
> 
> 
> On 8/18/2022 9:55 AM, Miaohe Lin wrote:
>>>>> 	/*
>>>>> 	 * The memory barrier inside __SetPageUptodate makes sure that
>>>>> 	 * preceding stores to the page contents become visible before
>>>>> 	 * the set_pte_at() write.
>>>>> 	 */
>>>>> 	__SetPageUptodate(page);
>>>> IIUC, the case here we should make sure others (CPUs) can see new page’s
>>>> contents after they have saw PG_uptodate is set. I think commit 0ed361dec369
>>>> can tell us more details.
>>>> 
>>>> I also looked at commit 52f37629fd3c to see why we need a barrier before
>>>> set_pte_at(), but I didn’t find any info to explain why. I guess we want
>>>> to make sure the order between the page’s contents and subsequent memory
>>>> accesses using the corresponding virtual address, do you agree with this?
>>> This is my understanding also. Thanks.
>> That's also my understanding. Thanks both.
> I have an unclear thing (not related with this patch directly): Who is response
> for the read barrier in the read side in this case?
> 
> For SetPageUptodate, there are paring write/read memory barrier.
> 

I have the same question. So I think the example proposed by Miaohe is a little
difference from the case (hugetlb_vmemmap) here.

> 
> Regards
> Yin, Fengwei
> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-18  2:47               ` Muchun Song
@ 2022-08-18  7:52                 ` Miaohe Lin
  2022-08-18  7:59                   ` Muchun Song
  0 siblings, 1 reply; 44+ messages in thread
From: Miaohe Lin @ 2022-08-18  7:52 UTC (permalink / raw)
  To: Muchun Song, Yin, Fengwei
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Linux MM, linux-kernel

On 2022/8/18 10:47, Muchun Song wrote:
> 
> 
>> On Aug 18, 2022, at 10:00, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>
>>
>>
>> On 8/18/2022 9:55 AM, Miaohe Lin wrote:
>>>>>> 	/*
>>>>>> 	 * The memory barrier inside __SetPageUptodate makes sure that
>>>>>> 	 * preceding stores to the page contents become visible before
>>>>>> 	 * the set_pte_at() write.
>>>>>> 	 */
>>>>>> 	__SetPageUptodate(page);
>>>>> IIUC, the case here we should make sure others (CPUs) can see new page’s
>>>>> contents after they have saw PG_uptodate is set. I think commit 0ed361dec369
>>>>> can tell us more details.
>>>>>
>>>>> I also looked at commit 52f37629fd3c to see why we need a barrier before
>>>>> set_pte_at(), but I didn’t find any info to explain why. I guess we want
>>>>> to make sure the order between the page’s contents and subsequent memory
>>>>> accesses using the corresponding virtual address, do you agree with this?
>>>> This is my understanding also. Thanks.
>>> That's also my understanding. Thanks both.
>> I have an unclear thing (not related with this patch directly): Who is response
>> for the read barrier in the read side in this case?
>>
>> For SetPageUptodate, there are paring write/read memory barrier.
>>
> 
> I have the same question. So I think the example proposed by Miaohe is a little
> difference from the case (hugetlb_vmemmap) here.

Per my understanding, memory barrier in PageUptodate() is needed because user might access the
page contents using page_address() (corresponding pagetable entry already exists) soon. But for
the above proposed case, if user wants to access the page contents, the corresponding pagetable
should be visible first or the page contents can't be accessed. So there should be a data dependency
acting as memory barrier between pagetable entry is loaded and page contents is accessed.
Or am I miss something?

Thanks,
Miaohe Lin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-18  7:52                 ` Miaohe Lin
@ 2022-08-18  7:59                   ` Muchun Song
  2022-08-18  8:32                     ` Yin, Fengwei
  0 siblings, 1 reply; 44+ messages in thread
From: Muchun Song @ 2022-08-18  7:59 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Yin, Fengwei, Andrew Morton, Mike Kravetz, Muchun Song, Linux MM,
	linux-kernel



> On Aug 18, 2022, at 15:52, Miaohe Lin <linmiaohe@huawei.com> wrote:
> 
> On 2022/8/18 10:47, Muchun Song wrote:
>> 
>> 
>>> On Aug 18, 2022, at 10:00, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>> 
>>> 
>>> 
>>> On 8/18/2022 9:55 AM, Miaohe Lin wrote:
>>>>>>> 	/*
>>>>>>> 	 * The memory barrier inside __SetPageUptodate makes sure that
>>>>>>> 	 * preceding stores to the page contents become visible before
>>>>>>> 	 * the set_pte_at() write.
>>>>>>> 	 */
>>>>>>> 	__SetPageUptodate(page);
>>>>>> IIUC, the case here we should make sure others (CPUs) can see new page’s
>>>>>> contents after they have saw PG_uptodate is set. I think commit 0ed361dec369
>>>>>> can tell us more details.
>>>>>> 
>>>>>> I also looked at commit 52f37629fd3c to see why we need a barrier before
>>>>>> set_pte_at(), but I didn’t find any info to explain why. I guess we want
>>>>>> to make sure the order between the page’s contents and subsequent memory
>>>>>> accesses using the corresponding virtual address, do you agree with this?
>>>>> This is my understanding also. Thanks.
>>>> That's also my understanding. Thanks both.
>>> I have an unclear thing (not related with this patch directly): Who is response
>>> for the read barrier in the read side in this case?
>>> 
>>> For SetPageUptodate, there are paring write/read memory barrier.
>>> 
>> 
>> I have the same question. So I think the example proposed by Miaohe is a little
>> difference from the case (hugetlb_vmemmap) here.
> 
> Per my understanding, memory barrier in PageUptodate() is needed because user might access the
> page contents using page_address() (corresponding pagetable entry already exists) soon. But for
> the above proposed case, if user wants to access the page contents, the corresponding pagetable
> should be visible first or the page contents can't be accessed. So there should be a data dependency
> acting as memory barrier between pagetable entry is loaded and page contents is accessed.
> Or am I miss something?

Yep, it is a data dependency. The difference between hugetlb_vmemmap and PageUptodate() is that
the page table (a pointer to the mapped page frame) is loaded by MMU while PageUptodate() is
loaded by CPU. Seems like the data dependency should be inserted between the MMU access and the CPU
access. Maybe it is hardware’s guarantee?

> 
> Thanks,
> Miaohe Lin


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-18  7:59                   ` Muchun Song
@ 2022-08-18  8:32                     ` Yin, Fengwei
  2022-08-18  8:40                       ` Muchun Song
  0 siblings, 1 reply; 44+ messages in thread
From: Yin, Fengwei @ 2022-08-18  8:32 UTC (permalink / raw)
  To: Muchun Song, Miaohe Lin
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Linux MM, linux-kernel



On 8/18/2022 3:59 PM, Muchun Song wrote:
> 
> 
>> On Aug 18, 2022, at 15:52, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> On 2022/8/18 10:47, Muchun Song wrote:
>>>
>>>
>>>> On Aug 18, 2022, at 10:00, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>
>>>>
>>>>
>>>> On 8/18/2022 9:55 AM, Miaohe Lin wrote:
>>>>>>>> 	/*
>>>>>>>> 	 * The memory barrier inside __SetPageUptodate makes sure that
>>>>>>>> 	 * preceding stores to the page contents become visible before
>>>>>>>> 	 * the set_pte_at() write.
>>>>>>>> 	 */
>>>>>>>> 	__SetPageUptodate(page);
>>>>>>> IIUC, the case here we should make sure others (CPUs) can see new page’s
>>>>>>> contents after they have saw PG_uptodate is set. I think commit 0ed361dec369
>>>>>>> can tell us more details.
>>>>>>>
>>>>>>> I also looked at commit 52f37629fd3c to see why we need a barrier before
>>>>>>> set_pte_at(), but I didn’t find any info to explain why. I guess we want
>>>>>>> to make sure the order between the page’s contents and subsequent memory
>>>>>>> accesses using the corresponding virtual address, do you agree with this?
>>>>>> This is my understanding also. Thanks.
>>>>> That's also my understanding. Thanks both.
>>>> I have an unclear thing (not related with this patch directly): Who is response
>>>> for the read barrier in the read side in this case?
>>>>
>>>> For SetPageUptodate, there are paring write/read memory barrier.
>>>>
>>>
>>> I have the same question. So I think the example proposed by Miaohe is a little
>>> difference from the case (hugetlb_vmemmap) here.
>>
>> Per my understanding, memory barrier in PageUptodate() is needed because user might access the
>> page contents using page_address() (corresponding pagetable entry already exists) soon. But for
>> the above proposed case, if user wants to access the page contents, the corresponding pagetable
>> should be visible first or the page contents can't be accessed. So there should be a data dependency
>> acting as memory barrier between pagetable entry is loaded and page contents is accessed.
>> Or am I miss something?
> 
> Yep, it is a data dependency. The difference between hugetlb_vmemmap and PageUptodate() is that
> the page table (a pointer to the mapped page frame) is loaded by MMU while PageUptodate() is
> loaded by CPU. Seems like the data dependency should be inserted between the MMU access and the CPU
> access. Maybe it is hardware’s guarantee?
I just found the comment in pmd_install() explained why most arch has no read
side memory barrier except alpha which has read side memory barrier.


Regards
Yin, Fengwei

> 
>>
>> Thanks,
>> Miaohe Lin
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-18  8:32                     ` Yin, Fengwei
@ 2022-08-18  8:40                       ` Muchun Song
  2022-08-18  8:54                         ` Yin, Fengwei
  0 siblings, 1 reply; 44+ messages in thread
From: Muchun Song @ 2022-08-18  8:40 UTC (permalink / raw)
  To: Yin, Fengwei
  Cc: Miaohe Lin, Andrew Morton, Mike Kravetz, Muchun Song, Linux MM,
	linux-kernel



> On Aug 18, 2022, at 16:32, Yin, Fengwei <fengwei.yin@intel.com> wrote:
> 
> 
> 
> On 8/18/2022 3:59 PM, Muchun Song wrote:
>> 
>> 
>>> On Aug 18, 2022, at 15:52, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>> 
>>> On 2022/8/18 10:47, Muchun Song wrote:
>>>> 
>>>> 
>>>>> On Aug 18, 2022, at 10:00, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On 8/18/2022 9:55 AM, Miaohe Lin wrote:
>>>>>>>>> 	/*
>>>>>>>>> 	 * The memory barrier inside __SetPageUptodate makes sure that
>>>>>>>>> 	 * preceding stores to the page contents become visible before
>>>>>>>>> 	 * the set_pte_at() write.
>>>>>>>>> 	 */
>>>>>>>>> 	__SetPageUptodate(page);
>>>>>>>> IIUC, the case here we should make sure others (CPUs) can see new page’s
>>>>>>>> contents after they have saw PG_uptodate is set. I think commit 0ed361dec369
>>>>>>>> can tell us more details.
>>>>>>>> 
>>>>>>>> I also looked at commit 52f37629fd3c to see why we need a barrier before
>>>>>>>> set_pte_at(), but I didn’t find any info to explain why. I guess we want
>>>>>>>> to make sure the order between the page’s contents and subsequent memory
>>>>>>>> accesses using the corresponding virtual address, do you agree with this?
>>>>>>> This is my understanding also. Thanks.
>>>>>> That's also my understanding. Thanks both.
>>>>> I have an unclear thing (not related with this patch directly): Who is response
>>>>> for the read barrier in the read side in this case?
>>>>> 
>>>>> For SetPageUptodate, there are paring write/read memory barrier.
>>>>> 
>>>> 
>>>> I have the same question. So I think the example proposed by Miaohe is a little
>>>> difference from the case (hugetlb_vmemmap) here.
>>> 
>>> Per my understanding, memory barrier in PageUptodate() is needed because user might access the
>>> page contents using page_address() (corresponding pagetable entry already exists) soon. But for
>>> the above proposed case, if user wants to access the page contents, the corresponding pagetable
>>> should be visible first or the page contents can't be accessed. So there should be a data dependency
>>> acting as memory barrier between pagetable entry is loaded and page contents is accessed.
>>> Or am I miss something?
>> 
>> Yep, it is a data dependency. The difference between hugetlb_vmemmap and PageUptodate() is that
>> the page table (a pointer to the mapped page frame) is loaded by MMU while PageUptodate() is
>> loaded by CPU. Seems like the data dependency should be inserted between the MMU access and the CPU
>> access. Maybe it is hardware’s guarantee?
> I just found the comment in pmd_install() explained why most arch has no read

I think pmd_install() is a little different as well. We should make sure the
page table walker (like GUP) see the correct PTE entry after they see the pmd
entry.

> side memory barrier except alpha which has read side memory barrier.

Right. Only alpha has data dependency barrier.

> 
> 
> Regards
> Yin, Fengwei
> 
>> 
>>> 
>>> Thanks,
>>> Miaohe Lin


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-18  8:40                       ` Muchun Song
@ 2022-08-18  8:54                         ` Yin, Fengwei
  2022-08-18  9:18                           ` Muchun Song
  0 siblings, 1 reply; 44+ messages in thread
From: Yin, Fengwei @ 2022-08-18  8:54 UTC (permalink / raw)
  To: Muchun Song
  Cc: Miaohe Lin, Andrew Morton, Mike Kravetz, Muchun Song, Linux MM,
	linux-kernel



On 8/18/2022 4:40 PM, Muchun Song wrote:
> 
> 
>> On Aug 18, 2022, at 16:32, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>
>>
>>
>> On 8/18/2022 3:59 PM, Muchun Song wrote:
>>>
>>>
>>>> On Aug 18, 2022, at 15:52, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>>
>>>> On 2022/8/18 10:47, Muchun Song wrote:
>>>>>
>>>>>
>>>>>> On Aug 18, 2022, at 10:00, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/18/2022 9:55 AM, Miaohe Lin wrote:
>>>>>>>>>> 	/*
>>>>>>>>>> 	 * The memory barrier inside __SetPageUptodate makes sure that
>>>>>>>>>> 	 * preceding stores to the page contents become visible before
>>>>>>>>>> 	 * the set_pte_at() write.
>>>>>>>>>> 	 */
>>>>>>>>>> 	__SetPageUptodate(page);
>>>>>>>>> IIUC, the case here we should make sure others (CPUs) can see new page’s
>>>>>>>>> contents after they have saw PG_uptodate is set. I think commit 0ed361dec369
>>>>>>>>> can tell us more details.
>>>>>>>>>
>>>>>>>>> I also looked at commit 52f37629fd3c to see why we need a barrier before
>>>>>>>>> set_pte_at(), but I didn’t find any info to explain why. I guess we want
>>>>>>>>> to make sure the order between the page’s contents and subsequent memory
>>>>>>>>> accesses using the corresponding virtual address, do you agree with this?
>>>>>>>> This is my understanding also. Thanks.
>>>>>>> That's also my understanding. Thanks both.
>>>>>> I have an unclear thing (not related with this patch directly): Who is response
>>>>>> for the read barrier in the read side in this case?
>>>>>>
>>>>>> For SetPageUptodate, there are paring write/read memory barrier.
>>>>>>
>>>>>
>>>>> I have the same question. So I think the example proposed by Miaohe is a little
>>>>> difference from the case (hugetlb_vmemmap) here.
>>>>
>>>> Per my understanding, memory barrier in PageUptodate() is needed because user might access the
>>>> page contents using page_address() (corresponding pagetable entry already exists) soon. But for
>>>> the above proposed case, if user wants to access the page contents, the corresponding pagetable
>>>> should be visible first or the page contents can't be accessed. So there should be a data dependency
>>>> acting as memory barrier between pagetable entry is loaded and page contents is accessed.
>>>> Or am I miss something?
>>>
>>> Yep, it is a data dependency. The difference between hugetlb_vmemmap and PageUptodate() is that
>>> the page table (a pointer to the mapped page frame) is loaded by MMU while PageUptodate() is
>>> loaded by CPU. Seems like the data dependency should be inserted between the MMU access and the CPU
>>> access. Maybe it is hardware’s guarantee?
>> I just found the comment in pmd_install() explained why most arch has no read
> 
> I think pmd_install() is a little different as well. We should make sure the
> page table walker (like GUP) see the correct PTE entry after they see the pmd
> entry.

The difference I can see is that pmd/pte thing has both hardware page walker and
software page walker (like GUP) as read side. While the case here only has hardware
page walker as read side. But I suppose the memory barrier requirement still apply
here.

Maybe we could do a test: add large delay between reset_struct_page() and set_pte_at?

Regards
Yin, Fengwei 

> 
>> side memory barrier except alpha which has read side memory barrier.
> 
> Right. Only alpha has data dependency barrier.
> 
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>>>
>>>> Thanks,
>>>> Miaohe Lin
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-18  8:54                         ` Yin, Fengwei
@ 2022-08-18  9:18                           ` Muchun Song
  2022-08-18 12:58                             ` Miaohe Lin
  0 siblings, 1 reply; 44+ messages in thread
From: Muchun Song @ 2022-08-18  9:18 UTC (permalink / raw)
  To: Yin, Fengwei, Miaohe Lin
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Linux MM, linux-kernel



> On Aug 18, 2022, at 16:54, Yin, Fengwei <fengwei.yin@intel.com> wrote:
> 
> 
> 
> On 8/18/2022 4:40 PM, Muchun Song wrote:
>> 
>> 
>>> On Aug 18, 2022, at 16:32, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>> 
>>> 
>>> 
>>> On 8/18/2022 3:59 PM, Muchun Song wrote:
>>>> 
>>>> 
>>>>> On Aug 18, 2022, at 15:52, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>>> 
>>>>> On 2022/8/18 10:47, Muchun Song wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Aug 18, 2022, at 10:00, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 8/18/2022 9:55 AM, Miaohe Lin wrote:
>>>>>>>>>>> 	/*
>>>>>>>>>>> 	 * The memory barrier inside __SetPageUptodate makes sure that
>>>>>>>>>>> 	 * preceding stores to the page contents become visible before
>>>>>>>>>>> 	 * the set_pte_at() write.
>>>>>>>>>>> 	 */
>>>>>>>>>>> 	__SetPageUptodate(page);
>>>>>>>>>> IIUC, the case here we should make sure others (CPUs) can see new page’s
>>>>>>>>>> contents after they have saw PG_uptodate is set. I think commit 0ed361dec369
>>>>>>>>>> can tell us more details.
>>>>>>>>>> 
>>>>>>>>>> I also looked at commit 52f37629fd3c to see why we need a barrier before
>>>>>>>>>> set_pte_at(), but I didn’t find any info to explain why. I guess we want
>>>>>>>>>> to make sure the order between the page’s contents and subsequent memory
>>>>>>>>>> accesses using the corresponding virtual address, do you agree with this?
>>>>>>>>> This is my understanding also. Thanks.
>>>>>>>> That's also my understanding. Thanks both.
>>>>>>> I have an unclear thing (not related with this patch directly): Who is response
>>>>>>> for the read barrier in the read side in this case?
>>>>>>> 
>>>>>>> For SetPageUptodate, there are paring write/read memory barrier.
>>>>>>> 
>>>>>> 
>>>>>> I have the same question. So I think the example proposed by Miaohe is a little
>>>>>> difference from the case (hugetlb_vmemmap) here.
>>>>> 
>>>>> Per my understanding, memory barrier in PageUptodate() is needed because user might access the
>>>>> page contents using page_address() (corresponding pagetable entry already exists) soon. But for
>>>>> the above proposed case, if user wants to access the page contents, the corresponding pagetable
>>>>> should be visible first or the page contents can't be accessed. So there should be a data dependency
>>>>> acting as memory barrier between pagetable entry is loaded and page contents is accessed.
>>>>> Or am I miss something?
>>>> 
>>>> Yep, it is a data dependency. The difference between hugetlb_vmemmap and PageUptodate() is that
>>>> the page table (a pointer to the mapped page frame) is loaded by MMU while PageUptodate() is
>>>> loaded by CPU. Seems like the data dependency should be inserted between the MMU access and the CPU
>>>> access. Maybe it is hardware’s guarantee?
>>> I just found the comment in pmd_install() explained why most arch has no read
>> 
>> I think pmd_install() is a little different as well. We should make sure the
>> page table walker (like GUP) see the correct PTE entry after they see the pmd
>> entry.
> 
> The difference I can see is that pmd/pte thing has both hardware page walker and
> software page walker (like GUP) as read side. While the case here only has hardware
> page walker as read side. But I suppose the memory barrier requirement still apply
> here.

I am not against this change. Just in order to make me get a better understanding of
hardware behavior.

> 
> Maybe we could do a test: add large delay between reset_struct_page() and set_pte_at?

Hi Miaohe,

Would you mind doing this test? One thread do vmemmap_restore_pte(), another thread
detect if it can see a tail page with PG_head after the previous thread has executed
set_pte_at().

Thanks.

> 
> Regards
> Yin, Fengwei 
> 
>> 
>>> side memory barrier except alpha which has read side memory barrier.
>> 
>> Right. Only alpha has data dependency barrier.
>> 
>>> 
>>> 
>>> Regards
>>> Yin, Fengwei
>>> 
>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Miaohe Lin


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-18  9:18                           ` Muchun Song
@ 2022-08-18 12:58                             ` Miaohe Lin
  2022-08-18 23:53                               ` Yin, Fengwei
  2022-08-19  3:19                               ` Muchun Song
  0 siblings, 2 replies; 44+ messages in thread
From: Miaohe Lin @ 2022-08-18 12:58 UTC (permalink / raw)
  To: Muchun Song, Yin, Fengwei
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Linux MM, linux-kernel

On 2022/8/18 17:18, Muchun Song wrote:
> 
> 
>> On Aug 18, 2022, at 16:54, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>
>>
>>
>> On 8/18/2022 4:40 PM, Muchun Song wrote:
>>>
>>>
>>>> On Aug 18, 2022, at 16:32, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>
>>>>
>>>>
>>>> On 8/18/2022 3:59 PM, Muchun Song wrote:
>>>>>
>>>>>
>>>>>> On Aug 18, 2022, at 15:52, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>>>>
>>>>>> On 2022/8/18 10:47, Muchun Song wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Aug 18, 2022, at 10:00, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/18/2022 9:55 AM, Miaohe Lin wrote:
>>>>>>>>>>>> 	/*
>>>>>>>>>>>> 	 * The memory barrier inside __SetPageUptodate makes sure that
>>>>>>>>>>>> 	 * preceding stores to the page contents become visible before
>>>>>>>>>>>> 	 * the set_pte_at() write.
>>>>>>>>>>>> 	 */
>>>>>>>>>>>> 	__SetPageUptodate(page);
>>>>>>>>>>> IIUC, the case here we should make sure others (CPUs) can see new page’s
>>>>>>>>>>> contents after they have saw PG_uptodate is set. I think commit 0ed361dec369
>>>>>>>>>>> can tell us more details.
>>>>>>>>>>>
>>>>>>>>>>> I also looked at commit 52f37629fd3c to see why we need a barrier before
>>>>>>>>>>> set_pte_at(), but I didn’t find any info to explain why. I guess we want
>>>>>>>>>>> to make sure the order between the page’s contents and subsequent memory
>>>>>>>>>>> accesses using the corresponding virtual address, do you agree with this?
>>>>>>>>>> This is my understanding also. Thanks.
>>>>>>>>> That's also my understanding. Thanks both.
>>>>>>>> I have an unclear thing (not related with this patch directly): Who is response
>>>>>>>> for the read barrier in the read side in this case?
>>>>>>>>
>>>>>>>> For SetPageUptodate, there are paring write/read memory barrier.
>>>>>>>>
>>>>>>>
>>>>>>> I have the same question. So I think the example proposed by Miaohe is a little
>>>>>>> difference from the case (hugetlb_vmemmap) here.
>>>>>>
>>>>>> Per my understanding, memory barrier in PageUptodate() is needed because user might access the
>>>>>> page contents using page_address() (corresponding pagetable entry already exists) soon. But for
>>>>>> the above proposed case, if user wants to access the page contents, the corresponding pagetable
>>>>>> should be visible first or the page contents can't be accessed. So there should be a data dependency
>>>>>> acting as memory barrier between pagetable entry is loaded and page contents is accessed.
>>>>>> Or am I miss something?
>>>>>
>>>>> Yep, it is a data dependency. The difference between hugetlb_vmemmap and PageUptodate() is that
>>>>> the page table (a pointer to the mapped page frame) is loaded by MMU while PageUptodate() is
>>>>> loaded by CPU. Seems like the data dependency should be inserted between the MMU access and the CPU
>>>>> access. Maybe it is hardware’s guarantee?
>>>> I just found the comment in pmd_install() explained why most arch has no read
>>>
>>> I think pmd_install() is a little different as well. We should make sure the
>>> page table walker (like GUP) see the correct PTE entry after they see the pmd
>>> entry.
>>
>> The difference I can see is that pmd/pte thing has both hardware page walker and
>> software page walker (like GUP) as read side. While the case here only has hardware
>> page walker as read side. But I suppose the memory barrier requirement still apply
>> here.
> 
> I am not against this change. Just in order to make me get a better understanding of
> hardware behavior.
> 
>>
>> Maybe we could do a test: add large delay between reset_struct_page() and set_pte_at?
> 
> Hi Miaohe,
> 
> Would you mind doing this test? One thread do vmemmap_restore_pte(), another thread
> detect if it can see a tail page with PG_head after the previous thread has executed
> set_pte_at().

Will it be easier to construct the memory reorder manually like below?

vmemmap_restore_pte()
	...
	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
	/* might a delay. */
	copy_page(to, (void *)walk->reuse_addr);
	reset_struct_pages(to);

And another thread detects whether it can see a tail page with some invalid fields? If so,
it seems the problem will always trigger? If not, we depend on the observed meory reorder
and set_pte_at doesn't contain a memory barrier?

Thanks,
Miaohe Lin


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-18 12:58                             ` Miaohe Lin
@ 2022-08-18 23:53                               ` Yin, Fengwei
  2022-08-19  3:19                               ` Muchun Song
  1 sibling, 0 replies; 44+ messages in thread
From: Yin, Fengwei @ 2022-08-18 23:53 UTC (permalink / raw)
  To: Miaohe Lin, Muchun Song
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Linux MM, linux-kernel



On 8/18/2022 8:58 PM, Miaohe Lin wrote:
> On 2022/8/18 17:18, Muchun Song wrote:
>>
>>
>>> On Aug 18, 2022, at 16:54, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>
>>>
>>>
>>> On 8/18/2022 4:40 PM, Muchun Song wrote:
>>>>
>>>>
>>>>> On Aug 18, 2022, at 16:32, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 8/18/2022 3:59 PM, Muchun Song wrote:
>>>>>>
>>>>>>
>>>>>>> On Aug 18, 2022, at 15:52, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>>>>>
>>>>>>> On 2022/8/18 10:47, Muchun Song wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Aug 18, 2022, at 10:00, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 8/18/2022 9:55 AM, Miaohe Lin wrote:
>>>>>>>>>>>>> 	/*
>>>>>>>>>>>>> 	 * The memory barrier inside __SetPageUptodate makes sure that
>>>>>>>>>>>>> 	 * preceding stores to the page contents become visible before
>>>>>>>>>>>>> 	 * the set_pte_at() write.
>>>>>>>>>>>>> 	 */
>>>>>>>>>>>>> 	__SetPageUptodate(page);
>>>>>>>>>>>> IIUC, the case here we should make sure others (CPUs) can see new page’s
>>>>>>>>>>>> contents after they have saw PG_uptodate is set. I think commit 0ed361dec369
>>>>>>>>>>>> can tell us more details.
>>>>>>>>>>>>
>>>>>>>>>>>> I also looked at commit 52f37629fd3c to see why we need a barrier before
>>>>>>>>>>>> set_pte_at(), but I didn’t find any info to explain why. I guess we want
>>>>>>>>>>>> to make sure the order between the page’s contents and subsequent memory
>>>>>>>>>>>> accesses using the corresponding virtual address, do you agree with this?
>>>>>>>>>>> This is my understanding also. Thanks.
>>>>>>>>>> That's also my understanding. Thanks both.
>>>>>>>>> I have an unclear thing (not related with this patch directly): Who is response
>>>>>>>>> for the read barrier in the read side in this case?
>>>>>>>>>
>>>>>>>>> For SetPageUptodate, there are paring write/read memory barrier.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I have the same question. So I think the example proposed by Miaohe is a little
>>>>>>>> difference from the case (hugetlb_vmemmap) here.
>>>>>>>
>>>>>>> Per my understanding, memory barrier in PageUptodate() is needed because user might access the
>>>>>>> page contents using page_address() (corresponding pagetable entry already exists) soon. But for
>>>>>>> the above proposed case, if user wants to access the page contents, the corresponding pagetable
>>>>>>> should be visible first or the page contents can't be accessed. So there should be a data dependency
>>>>>>> acting as memory barrier between pagetable entry is loaded and page contents is accessed.
>>>>>>> Or am I miss something?
>>>>>>
>>>>>> Yep, it is a data dependency. The difference between hugetlb_vmemmap and PageUptodate() is that
>>>>>> the page table (a pointer to the mapped page frame) is loaded by MMU while PageUptodate() is
>>>>>> loaded by CPU. Seems like the data dependency should be inserted between the MMU access and the CPU
>>>>>> access. Maybe it is hardware’s guarantee?
>>>>> I just found the comment in pmd_install() explained why most arch has no read
>>>>
>>>> I think pmd_install() is a little different as well. We should make sure the
>>>> page table walker (like GUP) see the correct PTE entry after they see the pmd
>>>> entry.
>>>
>>> The difference I can see is that pmd/pte thing has both hardware page walker and
>>> software page walker (like GUP) as read side. While the case here only has hardware
>>> page walker as read side. But I suppose the memory barrier requirement still apply
>>> here.
>>
>> I am not against this change. Just in order to make me get a better understanding of
>> hardware behavior.
>>
>>>
>>> Maybe we could do a test: add large delay between reset_struct_page() and set_pte_at?
>>
>> Hi Miaohe,
>>
>> Would you mind doing this test? One thread do vmemmap_restore_pte(), another thread
>> detect if it can see a tail page with PG_head after the previous thread has executed
>> set_pte_at().
> 
> Will it be easier to construct the memory reorder manually like below?
> 
> vmemmap_restore_pte()
> 	...
> 	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
> 	/* might a delay. */
> 	copy_page(to, (void *)walk->reuse_addr);
> 	reset_struct_pages(to);
This should be correct change for the testing. :).

Regards
Yin, Fengwei

> 
> And another thread detects whether it can see a tail page with some invalid fields? If so,
> it seems the problem will always trigger? If not, we depend on the observed meory reorder
> and set_pte_at doesn't contain a memory barrier?
> 
> Thanks,
> Miaohe Lin
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-18 12:58                             ` Miaohe Lin
  2022-08-18 23:53                               ` Yin, Fengwei
@ 2022-08-19  3:19                               ` Muchun Song
  2022-08-19  7:26                                 ` Miaohe Lin
  1 sibling, 1 reply; 44+ messages in thread
From: Muchun Song @ 2022-08-19  3:19 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Yin, Fengwei, Andrew Morton, Mike Kravetz, Muchun Song, Linux MM,
	linux-kernel



> On Aug 18, 2022, at 20:58, Miaohe Lin <linmiaohe@huawei.com> wrote:
> 
> On 2022/8/18 17:18, Muchun Song wrote:
>> 
>> 
>>> On Aug 18, 2022, at 16:54, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>> 
>>> 
>>> 
>>> On 8/18/2022 4:40 PM, Muchun Song wrote:
>>>> 
>>>> 
>>>>> On Aug 18, 2022, at 16:32, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On 8/18/2022 3:59 PM, Muchun Song wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Aug 18, 2022, at 15:52, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>>>>> 
>>>>>>> On 2022/8/18 10:47, Muchun Song wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Aug 18, 2022, at 10:00, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 8/18/2022 9:55 AM, Miaohe Lin wrote:
>>>>>>>>>>>>> 	/*
>>>>>>>>>>>>> 	 * The memory barrier inside __SetPageUptodate makes sure that
>>>>>>>>>>>>> 	 * preceding stores to the page contents become visible before
>>>>>>>>>>>>> 	 * the set_pte_at() write.
>>>>>>>>>>>>> 	 */
>>>>>>>>>>>>> 	__SetPageUptodate(page);
>>>>>>>>>>>> IIUC, the case here we should make sure others (CPUs) can see new page’s
>>>>>>>>>>>> contents after they have saw PG_uptodate is set. I think commit 0ed361dec369
>>>>>>>>>>>> can tell us more details.
>>>>>>>>>>>> 
>>>>>>>>>>>> I also looked at commit 52f37629fd3c to see why we need a barrier before
>>>>>>>>>>>> set_pte_at(), but I didn’t find any info to explain why. I guess we want
>>>>>>>>>>>> to make sure the order between the page’s contents and subsequent memory
>>>>>>>>>>>> accesses using the corresponding virtual address, do you agree with this?
>>>>>>>>>>> This is my understanding also. Thanks.
>>>>>>>>>> That's also my understanding. Thanks both.
>>>>>>>>> I have an unclear thing (not related with this patch directly): Who is response
>>>>>>>>> for the read barrier in the read side in this case?
>>>>>>>>> 
>>>>>>>>> For SetPageUptodate, there are paring write/read memory barrier.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> I have the same question. So I think the example proposed by Miaohe is a little
>>>>>>>> difference from the case (hugetlb_vmemmap) here.
>>>>>>> 
>>>>>>> Per my understanding, memory barrier in PageUptodate() is needed because user might access the
>>>>>>> page contents using page_address() (corresponding pagetable entry already exists) soon. But for
>>>>>>> the above proposed case, if user wants to access the page contents, the corresponding pagetable
>>>>>>> should be visible first or the page contents can't be accessed. So there should be a data dependency
>>>>>>> acting as memory barrier between pagetable entry is loaded and page contents is accessed.
>>>>>>> Or am I miss something?
>>>>>> 
>>>>>> Yep, it is a data dependency. The difference between hugetlb_vmemmap and PageUptodate() is that
>>>>>> the page table (a pointer to the mapped page frame) is loaded by MMU while PageUptodate() is
>>>>>> loaded by CPU. Seems like the data dependency should be inserted between the MMU access and the CPU
>>>>>> access. Maybe it is hardware’s guarantee?
>>>>> I just found the comment in pmd_install() explained why most arch has no read
>>>> 
>>>> I think pmd_install() is a little different as well. We should make sure the
>>>> page table walker (like GUP) see the correct PTE entry after they see the pmd
>>>> entry.
>>> 
>>> The difference I can see is that pmd/pte thing has both hardware page walker and
>>> software page walker (like GUP) as read side. While the case here only has hardware
>>> page walker as read side. But I suppose the memory barrier requirement still apply
>>> here.
>> 
>> I am not against this change. Just in order to make me get a better understanding of
>> hardware behavior.
>> 
>>> 
>>> Maybe we could do a test: add large delay between reset_struct_page() and set_pte_at?
>> 
>> Hi Miaohe,
>> 
>> Would you mind doing this test? One thread do vmemmap_restore_pte(), another thread
>> detect if it can see a tail page with PG_head after the previous thread has executed
>> set_pte_at().
> 
> Will it be easier to construct the memory reorder manually like below?
> 
> vmemmap_restore_pte()
> 	...
> 	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
> 	/* might a delay. */
> 	copy_page(to, (void *)walk->reuse_addr);
> 	reset_struct_pages(to);


Well, you have changed the code ordering. I thought we don’t change the code
ordering. Just let the hardware do reordering. The ideal scenario would be
as follows.


CPU0:						CPU1:

vmemmap_restore_pte()
	copy_page(to, (void *)walk->reuse_addr);
        reset_struct_pages(to); // clear the tail page’s PG_head
	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
						// Detect if it can see a tail page with PG_head.

I should admit it is a little difficult to construct the scenario. After more
thought, I think here should be inserted a barrier. So:

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thanks.

> 
> And another thread detects whether it can see a tail page with some invalid fields? If so,
> it seems the problem will always trigger? If not, we depend on the observed meory reorder
> and set_pte_at doesn't contain a memory barrier?
> 
> Thanks,
> Miaohe Lin


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-19  3:19                               ` Muchun Song
@ 2022-08-19  7:26                                 ` Miaohe Lin
  0 siblings, 0 replies; 44+ messages in thread
From: Miaohe Lin @ 2022-08-19  7:26 UTC (permalink / raw)
  To: Muchun Song, Yin Fengwei
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Linux MM, linux-kernel

On 2022/8/19 11:19, Muchun Song wrote:
> 
> 
>> On Aug 18, 2022, at 20:58, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> On 2022/8/18 17:18, Muchun Song wrote:
>>>
>>>
>>>> On Aug 18, 2022, at 16:54, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>
>>>>
>>>>
>>>> On 8/18/2022 4:40 PM, Muchun Song wrote:
>>>>>
>>>>>
>>>>>> On Aug 18, 2022, at 16:32, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/18/2022 3:59 PM, Muchun Song wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Aug 18, 2022, at 15:52, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>>>>>>
>>>>>>>> On 2022/8/18 10:47, Muchun Song wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Aug 18, 2022, at 10:00, Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 8/18/2022 9:55 AM, Miaohe Lin wrote:
>>>>>>>>>>>>>> 	/*
>>>>>>>>>>>>>> 	 * The memory barrier inside __SetPageUptodate makes sure that
>>>>>>>>>>>>>> 	 * preceding stores to the page contents become visible before
>>>>>>>>>>>>>> 	 * the set_pte_at() write.
>>>>>>>>>>>>>> 	 */
>>>>>>>>>>>>>> 	__SetPageUptodate(page);
>>>>>>>>>>>>> IIUC, the case here we should make sure others (CPUs) can see new page’s
>>>>>>>>>>>>> contents after they have saw PG_uptodate is set. I think commit 0ed361dec369
>>>>>>>>>>>>> can tell us more details.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I also looked at commit 52f37629fd3c to see why we need a barrier before
>>>>>>>>>>>>> set_pte_at(), but I didn’t find any info to explain why. I guess we want
>>>>>>>>>>>>> to make sure the order between the page’s contents and subsequent memory
>>>>>>>>>>>>> accesses using the corresponding virtual address, do you agree with this?
>>>>>>>>>>>> This is my understanding also. Thanks.
>>>>>>>>>>> That's also my understanding. Thanks both.
>>>>>>>>>> I have an unclear thing (not related with this patch directly): Who is response
>>>>>>>>>> for the read barrier in the read side in this case?
>>>>>>>>>>
>>>>>>>>>> For SetPageUptodate, there are paring write/read memory barrier.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have the same question. So I think the example proposed by Miaohe is a little
>>>>>>>>> difference from the case (hugetlb_vmemmap) here.
>>>>>>>>
>>>>>>>> Per my understanding, memory barrier in PageUptodate() is needed because user might access the
>>>>>>>> page contents using page_address() (corresponding pagetable entry already exists) soon. But for
>>>>>>>> the above proposed case, if user wants to access the page contents, the corresponding pagetable
>>>>>>>> should be visible first or the page contents can't be accessed. So there should be a data dependency
>>>>>>>> acting as memory barrier between pagetable entry is loaded and page contents is accessed.
>>>>>>>> Or am I miss something?
>>>>>>>
>>>>>>> Yep, it is a data dependency. The difference between hugetlb_vmemmap and PageUptodate() is that
>>>>>>> the page table (a pointer to the mapped page frame) is loaded by MMU while PageUptodate() is
>>>>>>> loaded by CPU. Seems like the data dependency should be inserted between the MMU access and the CPU
>>>>>>> access. Maybe it is hardware’s guarantee?
>>>>>> I just found the comment in pmd_install() explained why most arch has no read
>>>>>
>>>>> I think pmd_install() is a little different as well. We should make sure the
>>>>> page table walker (like GUP) see the correct PTE entry after they see the pmd
>>>>> entry.
>>>>
>>>> The difference I can see is that pmd/pte thing has both hardware page walker and
>>>> software page walker (like GUP) as read side. While the case here only has hardware
>>>> page walker as read side. But I suppose the memory barrier requirement still apply
>>>> here.
>>>
>>> I am not against this change. Just in order to make me get a better understanding of
>>> hardware behavior.
>>>
>>>>
>>>> Maybe we could do a test: add large delay between reset_struct_page() and set_pte_at?
>>>
>>> Hi Miaohe,
>>>
>>> Would you mind doing this test? One thread do vmemmap_restore_pte(), another thread
>>> detect if it can see a tail page with PG_head after the previous thread has executed
>>> set_pte_at().
>>
>> Will it be easier to construct the memory reorder manually like below?
>>
>> vmemmap_restore_pte()
>> 	...
>> 	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
>> 	/* might a delay. */
>> 	copy_page(to, (void *)walk->reuse_addr);
>> 	reset_struct_pages(to);
> 
> 
> Well, you have changed the code ordering. I thought we don’t change the code
> ordering. Just let the hardware do reordering. The ideal scenario would be
> as follows.
> 
> 
> CPU0:						CPU1:
> 
> vmemmap_restore_pte()
> 	copy_page(to, (void *)walk->reuse_addr);
>         reset_struct_pages(to); // clear the tail page’s PG_head
> 	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
> 						// Detect if it can see a tail page with PG_head.
> 
> I should admit it is a little difficult to construct the scenario. After more
> thought, I think here should be inserted a barrier. So:
> 
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Many thanks both for review and discussion. :)

Thanks,
Miaohe Lin


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-16 13:05 ` [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at() Miaohe Lin
  2022-08-17  2:53   ` Muchun Song
  2022-08-18  1:15   ` Yin, Fengwei
@ 2022-08-20  8:12   ` Muchun Song
  2022-08-22  8:45     ` Miaohe Lin
  2 siblings, 1 reply; 44+ messages in thread
From: Muchun Song @ 2022-08-20  8:12 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, linux-mm, linux-kernel



> On Aug 16, 2022, at 21:05, Miaohe Lin <linmiaohe@huawei.com> wrote:
> 
> The memory barrier smp_wmb() is needed to make sure that preceding stores
> to the page contents become visible before the below set_pte_at() write.

I found another place where is a similar case. See kasan_populate_vmalloc_pte() in
mm/kasan/shadow.c. 

Should we fix it as well?


> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
> mm/hugetlb_vmemmap.c | 5 +++++
> 1 file changed, 5 insertions(+)
> 
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 20f414c0379f..76b2d03a0d8d 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -287,6 +287,11 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> 	copy_page(to, (void *)walk->reuse_addr);
> 	reset_struct_pages(to);
> 
> +	/*
> +	 * Makes sure that preceding stores to the page contents become visible
> +	 * before the set_pte_at() write.
> +	 */
> +	smp_wmb();
> 	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
> }
> 
> -- 
> 2.23.0
> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-20  8:12   ` Muchun Song
@ 2022-08-22  8:45     ` Miaohe Lin
  2022-08-22 10:23       ` Muchun Song
  0 siblings, 1 reply; 44+ messages in thread
From: Miaohe Lin @ 2022-08-22  8:45 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, linux-mm, linux-kernel

On 2022/8/20 16:12, Muchun Song wrote:
> 
> 
>> On Aug 16, 2022, at 21:05, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> The memory barrier smp_wmb() is needed to make sure that preceding stores
>> to the page contents become visible before the below set_pte_at() write.
> 
> I found another place where is a similar case. See kasan_populate_vmalloc_pte() in
> mm/kasan/shadow.c. 

Thanks for your report.

> 
> Should we fix it as well?

I'm not familiar with kasan yet, but I think memory barrier is needed here or memory corrupt
can't be detected until the contents are visible. smp_mb__after_atomic before set_pte_at should
be enough? What's your opinion?

Thanks,
Miaohe Lin


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-22  8:45     ` Miaohe Lin
@ 2022-08-22 10:23       ` Muchun Song
  2022-08-23  1:42         ` Miaohe Lin
  0 siblings, 1 reply; 44+ messages in thread
From: Muchun Song @ 2022-08-22 10:23 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Linux MM, linux-kernel



> On Aug 22, 2022, at 16:45, Miaohe Lin <linmiaohe@huawei.com> wrote:
> 
> On 2022/8/20 16:12, Muchun Song wrote:
>> 
>> 
>>> On Aug 16, 2022, at 21:05, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>> 
>>> The memory barrier smp_wmb() is needed to make sure that preceding stores
>>> to the page contents become visible before the below set_pte_at() write.
>> 
>> I found another place where is a similar case. See kasan_populate_vmalloc_pte() in
>> mm/kasan/shadow.c. 
> 
> Thanks for your report.
> 
>> 
>> Should we fix it as well?
> 
> I'm not familiar with kasan yet, but I think memory barrier is needed here or memory corrupt
> can't be detected until the contents are visible. smp_mb__after_atomic before set_pte_at should
> be enough? What's your opinion?

I didn’t see any atomic operation between set_pte_at() and memset(), I don’t think
smp_mb__after_atomic() is feasible if we really need to insert a barrier. I suggest
you to send a RFC patch to KASAN maintainers, they are more familiar with this than
us.

Thanks.

> 
> Thanks,
> Miaohe Lin
> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at()
  2022-08-22 10:23       ` Muchun Song
@ 2022-08-23  1:42         ` Miaohe Lin
  0 siblings, 0 replies; 44+ messages in thread
From: Miaohe Lin @ 2022-08-23  1:42 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Linux MM, linux-kernel

On 2022/8/22 18:23, Muchun Song wrote:
> 
> 
>> On Aug 22, 2022, at 16:45, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> On 2022/8/20 16:12, Muchun Song wrote:
>>>
>>>
>>>> On Aug 16, 2022, at 21:05, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>>
>>>> The memory barrier smp_wmb() is needed to make sure that preceding stores
>>>> to the page contents become visible before the below set_pte_at() write.
>>>
>>> I found another place where is a similar case. See kasan_populate_vmalloc_pte() in
>>> mm/kasan/shadow.c. 
>>
>> Thanks for your report.
>>
>>>
>>> Should we fix it as well?
>>
>> I'm not familiar with kasan yet, but I think memory barrier is needed here or memory corrupt
>> can't be detected until the contents are visible. smp_mb__after_atomic before set_pte_at should
>> be enough? What's your opinion?
> 
> I didn’t see any atomic operation between set_pte_at() and memset(), I don’t think
> smp_mb__after_atomic() is feasible if we really need to insert a barrier. I suggest

Oh, it should be smp_mb__after_spinlock(), i.e. something like below:

diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index 0e3648b603a6..38e503c89740 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -277,6 +277,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,

        spin_lock(&init_mm.page_table_lock);
        if (likely(pte_none(*ptep))) {
+               smp_mb__after_spinlock();
                set_pte_at(&init_mm, addr, ptep, pte);
                page = 0;
        }

Does this make sense for you?

> you to send a RFC patch to KASAN maintainers, they are more familiar with this than
> us.

Sounds like a good idea. Will do it.

Thanks,
Miaohe Lin


^ permalink raw reply related	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2022-08-23  1:43 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-16 13:05 [PATCH 0/6] A few fixup patches for hugetlb Miaohe Lin
2022-08-16 13:05 ` [PATCH 1/6] mm/hugetlb: fix incorrect update of max_huge_pages Miaohe Lin
2022-08-16 22:52   ` Mike Kravetz
2022-08-16 23:20     ` Andrew Morton
2022-08-16 23:34       ` Mike Kravetz
2022-08-17  1:53         ` Miaohe Lin
2022-08-17  2:28   ` Muchun Song
2022-08-16 13:05 ` [PATCH 2/6] mm/hugetlb: fix WARN_ON(!kobj) in sysfs_create_group() Miaohe Lin
2022-08-16 22:55   ` Mike Kravetz
2022-08-17  2:31   ` Muchun Song
2022-08-17  2:39     ` Miaohe Lin
2022-08-16 13:05 ` [PATCH 3/6] mm/hugetlb: fix missing call to restore_reserve_on_error() Miaohe Lin
2022-08-16 23:31   ` Mike Kravetz
2022-08-17  1:59     ` Miaohe Lin
2022-08-16 13:05 ` [PATCH 4/6] mm: hugetlb_vmemmap: add missing smp_wmb() before set_pte_at() Miaohe Lin
2022-08-17  2:53   ` Muchun Song
2022-08-17  8:41     ` Miaohe Lin
2022-08-17  9:13       ` Yin, Fengwei
2022-08-17 11:21       ` Muchun Song
2022-08-18  1:14         ` Yin, Fengwei
2022-08-18  1:55           ` Miaohe Lin
2022-08-18  2:00             ` Yin, Fengwei
2022-08-18  2:47               ` Muchun Song
2022-08-18  7:52                 ` Miaohe Lin
2022-08-18  7:59                   ` Muchun Song
2022-08-18  8:32                     ` Yin, Fengwei
2022-08-18  8:40                       ` Muchun Song
2022-08-18  8:54                         ` Yin, Fengwei
2022-08-18  9:18                           ` Muchun Song
2022-08-18 12:58                             ` Miaohe Lin
2022-08-18 23:53                               ` Yin, Fengwei
2022-08-19  3:19                               ` Muchun Song
2022-08-19  7:26                                 ` Miaohe Lin
2022-08-18  1:15   ` Yin, Fengwei
2022-08-20  8:12   ` Muchun Song
2022-08-22  8:45     ` Miaohe Lin
2022-08-22 10:23       ` Muchun Song
2022-08-23  1:42         ` Miaohe Lin
2022-08-16 13:05 ` [PATCH 5/6] mm/hugetlb: fix sysfs group leak in hugetlb_unregister_node() Miaohe Lin
2022-08-17  9:41   ` Yin, Fengwei
2022-08-18  1:00     ` Yin, Fengwei
2022-08-18  1:12   ` Yin, Fengwei
2022-08-16 13:05 ` [PATCH 6/6] mm/hugetlb: make detecting shared pte more reliable Miaohe Lin
2022-08-17 23:56   ` Mike Kravetz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.