linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1] mm: migration: fix the FOLL_GET failure on following huge page
@ 2022-08-12  8:49 Haiyue Wang
  2022-08-13 23:28 ` Andrew Morton
                   ` (5 more replies)
  0 siblings, 6 replies; 62+ messages in thread
From: Haiyue Wang @ 2022-08-12  8:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, linmiaohe, ying.huang, songmuchun, naoya.horiguchi,
	Haiyue Wang

Not all huge page APIs support FOLL_GET option, so the __NR_move_pages
will fail to get the page node information for huge page.

This is an temporary solution to mitigate the racing fix.

After supporting follow huge page by FOLL_GET is done, this fix can be
reverted safely.

Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
---
 mm/migrate.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 6a1597c92261..581dfaad9257 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1848,6 +1848,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 
 	for (i = 0; i < nr_pages; i++) {
 		unsigned long addr = (unsigned long)(*pages);
+		unsigned int foll_flags = FOLL_DUMP;
 		struct vm_area_struct *vma;
 		struct page *page;
 		int err = -EFAULT;
@@ -1856,8 +1857,12 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 		if (!vma)
 			goto set_status;
 
+		/* Not all huge page follow APIs support 'FOLL_GET' */
+		if (!is_vm_hugetlb_page(vma))
+			foll_flags |= FOLL_GET;
+
 		/* FOLL_DUMP to ignore special (like zero) pages */
-		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
+		page = follow_page(vma, addr, foll_flags);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
@@ -1865,7 +1870,8 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 
 		if (page && !is_zone_device_page(page)) {
 			err = page_to_nid(page);
-			put_page(page);
+			if (foll_flags & FOLL_GET)
+				put_page(page);
 		} else {
 			err = -ENOENT;
 		}
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v1] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-12  8:49 [PATCH v1] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
@ 2022-08-13 23:28 ` Andrew Morton
  2022-08-14  6:20   ` Wang, Haiyue
  2022-08-14 14:05 ` [PATCH v2 0/3] fix follow_page related issues Haiyue Wang
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 62+ messages in thread
From: Andrew Morton @ 2022-08-13 23:28 UTC (permalink / raw)
  To: Haiyue Wang
  Cc: linux-mm, linux-kernel, david, linmiaohe, ying.huang, songmuchun,
	naoya.horiguchi

On Fri, 12 Aug 2022 16:49:21 +0800 Haiyue Wang <haiyue.wang@intel.com> wrote:

> Not all huge page APIs support FOLL_GET option, so the __NR_move_pages
> will fail to get the page node information for huge page.

Which ones need fixing?

What are the user-visible runtime effects of this bug?

Is a -stable backport warranted?

> This is an temporary solution to mitigate the racing fix.
> 
> After supporting follow huge page by FOLL_GET is done, this fix can be
> reverted safely.
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [PATCH v1] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-13 23:28 ` Andrew Morton
@ 2022-08-14  6:20   ` Wang, Haiyue
  2022-08-14  6:49     ` Wang, Haiyue
  0 siblings, 1 reply; 62+ messages in thread
From: Wang, Haiyue @ 2022-08-14  6:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, david, linmiaohe, Huang, Ying,
	songmuchun, naoya.horiguchi

> -----Original Message-----
> From: Andrew Morton <akpm@linux-foundation.org>
> Sent: Sunday, August 14, 2022 07:29
> To: Wang, Haiyue <haiyue.wang@intel.com>
> Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; david@redhat.com; linmiaohe@huawei.com; Huang,
> Ying <ying.huang@intel.com>; songmuchun@bytedance.com; naoya.horiguchi@linux.dev
> Subject: Re: [PATCH v1] mm: migration: fix the FOLL_GET failure on following huge page
> 
> On Fri, 12 Aug 2022 16:49:21 +0800 Haiyue Wang <haiyue.wang@intel.com> wrote:
> 
> > Not all huge page APIs support FOLL_GET option, so the __NR_move_pages
> > will fail to get the page node information for huge page.
> 
> Which ones need fixing?

1. 'follow_huge_pud' arch/s390/mm/hugetlbpage.c

2. 'follow_huge_addr' arch/ia64/mm/hugetlbpage.c

3. 'follow_huge_pgd' mm/hugetlb.c

And I found that only 'pud' and 'pmd' need to check 'is_vm_hugetlb_page' like:
pud_huge(*pud) && is_vm_hugetlb_page(vma)
pmd_huge(pmdval) && is_vm_hugetlb_page(vma)

So I'm not sure whether my patch can cover 2 & 3 for other huge page use cases
except by hugetlbfs.

> 
> What are the user-visible runtime effects of this bug?
> 

In my test, the '__NR_move_pages' system call will return '-2' for 1GB huge page
memory map when dump the page node information. [Test on linux-5.19 stable]

> Is a -stable backport warranted?

Yes.

Since the mainline has introduced the new patch:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3218f8712d6bb

The backported needs to rebase like for 5.19:

-		if (page && !is_zone_device_page(page)) {
+		if (page) {

> 
> > This is an temporary solution to mitigate the racing fix.
> >
> > After supporting follow huge page by FOLL_GET is done, this fix can be
> > reverted safely.
> >


^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [PATCH v1] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-14  6:20   ` Wang, Haiyue
@ 2022-08-14  6:49     ` Wang, Haiyue
  0 siblings, 0 replies; 62+ messages in thread
From: Wang, Haiyue @ 2022-08-14  6:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, david, linmiaohe, Huang, Ying,
	songmuchun, naoya.horiguchi

> -----Original Message-----
> From: Wang, Haiyue
> Sent: Sunday, August 14, 2022 14:20
> To: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; david@redhat.com; linmiaohe@huawei.com; Huang,
> Ying <ying.huang@intel.com>; songmuchun@bytedance.com; naoya.horiguchi@linux.dev
> Subject: RE: [PATCH v1] mm: migration: fix the FOLL_GET failure on following huge page
> 
> > -----Original Message-----
> > From: Andrew Morton <akpm@linux-foundation.org>
> > Sent: Sunday, August 14, 2022 07:29
> > To: Wang, Haiyue <haiyue.wang@intel.com>
> > Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; david@redhat.com; linmiaohe@huawei.com; Huang,
> > Ying <ying.huang@intel.com>; songmuchun@bytedance.com; naoya.horiguchi@linux.dev
> > Subject: Re: [PATCH v1] mm: migration: fix the FOLL_GET failure on following huge page
> >
> > On Fri, 12 Aug 2022 16:49:21 +0800 Haiyue Wang <haiyue.wang@intel.com> wrote:
> >
> > > Not all huge page APIs support FOLL_GET option, so the __NR_move_pages
> > > will fail to get the page node information for huge page.
> >


> > Is a -stable backport warranted?
> 
> Yes.
> 
> Since the mainline has introduced the new patch:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3218f8712d6bb
> 

Looks like 'is_zone_device_page' will cause 'FOLL_GET' page reference count issue, which should
call 'put_page' before call branch jump.

try_grab_page --> 
		if (flags & FOLL_GET)
			folio_ref_inc(folio);

Or I misunderstood the 'is_zone_device_page' ? ;-)

> The backported needs to rebase like for 5.19:
> 
> -		if (page && !is_zone_device_page(page)) {
> +		if (page) {
> 
> >
> > >


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v2 0/3] fix follow_page related issues
  2022-08-12  8:49 [PATCH v1] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
  2022-08-13 23:28 ` Andrew Morton
@ 2022-08-14 14:05 ` Haiyue Wang
  2022-08-14 14:05   ` [PATCH v2 1/3] mm: revert handling Non-LRU pages returned by follow_page Haiyue Wang
                     ` (2 more replies)
  2022-08-15  1:03 ` [PATCH v3 0/2] fix follow_page related issues Haiyue Wang
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 62+ messages in thread
From: Haiyue Wang @ 2022-08-14 14:05 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, linmiaohe, ying.huang, songmuchun, naoya.horiguchi,
	alex.sierra, Haiyue Wang

v2: Add the Non-LRU pages fix with two patches, so that
    'mm: migration: fix the FOLL_GET' can be applied directly
    on linux-5.19 stable branch.

Haiyue Wang (3):
  mm: revert handling Non-LRU pages returned by follow_page
  mm: migration: fix the FOLL_GET failure on following huge page
  mm: handling Non-LRU pages returned by follow_page

 mm/huge_memory.c |  4 ++--
 mm/ksm.c         | 16 +++++++++++++---
 mm/migrate.c     | 20 +++++++++++++++-----
 3 files changed, 30 insertions(+), 10 deletions(-)

-- 
2.37.2


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v2 1/3] mm: revert handling Non-LRU pages returned by follow_page
  2022-08-14 14:05 ` [PATCH v2 0/3] fix follow_page related issues Haiyue Wang
@ 2022-08-14 14:05   ` Haiyue Wang
  2022-08-14 16:30     ` David Hildenbrand
  2022-08-14 14:05   ` [PATCH v2 2/3] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
  2022-08-14 14:05   ` [PATCH v2 3/3] mm: handling Non-LRU pages returned by follow_page Haiyue Wang
  2 siblings, 1 reply; 62+ messages in thread
From: Haiyue Wang @ 2022-08-14 14:05 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, linmiaohe, ying.huang, songmuchun, naoya.horiguchi,
	alex.sierra, Haiyue Wang

The commit
3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
doesn't handle the follow_page with flag FOLL_GET correctly, this will
do get_page on page, it shouldn't just return directly without put_page.

So revert the related fix to prepare for clean patch to handle Non-LRU
pages returned by follow_page.

Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
---
 mm/huge_memory.c | 2 +-
 mm/ksm.c         | 6 +++---
 mm/migrate.c     | 4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8a7c1b344abe..2ee6d38a1426 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2963,7 +2963,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		/* FOLL_DUMP to ignore special (like zero) pages */
 		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
 
-		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
+		if (IS_ERR_OR_NULL(page))
 			continue;
 
 		if (!is_transparent_hugepage(page))
diff --git a/mm/ksm.c b/mm/ksm.c
index 42ab153335a2..fe3e0a39f73a 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -475,7 +475,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 		cond_resched();
 		page = follow_page(vma, addr,
 				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
-		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
+		if (IS_ERR_OR_NULL(page))
 			break;
 		if (PageKsm(page))
 			ret = handle_mm_fault(vma, addr,
@@ -560,7 +560,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 		goto out;
 
 	page = follow_page(vma, addr, FOLL_GET);
-	if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
+	if (IS_ERR_OR_NULL(page))
 		goto out;
 	if (PageAnon(page)) {
 		flush_anon_page(vma, page, addr);
@@ -2308,7 +2308,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 			if (ksm_test_exit(mm))
 				break;
 			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
-			if (IS_ERR_OR_NULL(*page) || is_zone_device_page(*page)) {
+			if (IS_ERR_OR_NULL(*page)) {
 				ksm_scan.address += PAGE_SIZE;
 				cond_resched();
 				continue;
diff --git a/mm/migrate.c b/mm/migrate.c
index 6a1597c92261..3d5f0262ab60 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1672,7 +1672,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 		goto out;
 
 	err = -ENOENT;
-	if (!page || is_zone_device_page(page))
+	if (!page)
 		goto out;
 
 	err = 0;
@@ -1863,7 +1863,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 		if (IS_ERR(page))
 			goto set_status;
 
-		if (page && !is_zone_device_page(page)) {
+		if (page) {
 			err = page_to_nid(page);
 			put_page(page);
 		} else {
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 2/3] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-14 14:05 ` [PATCH v2 0/3] fix follow_page related issues Haiyue Wang
  2022-08-14 14:05   ` [PATCH v2 1/3] mm: revert handling Non-LRU pages returned by follow_page Haiyue Wang
@ 2022-08-14 14:05   ` Haiyue Wang
  2022-08-14 14:05   ` [PATCH v2 3/3] mm: handling Non-LRU pages returned by follow_page Haiyue Wang
  2 siblings, 0 replies; 62+ messages in thread
From: Haiyue Wang @ 2022-08-14 14:05 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, linmiaohe, ying.huang, songmuchun, naoya.horiguchi,
	alex.sierra, Haiyue Wang

Not all huge page APIs support FOLL_GET option, so the __NR_move_pages
will fail to get the page node information for huge page.

This is an temporary solution to mitigate the racing fix.

After supporting follow huge page by FOLL_GET is done, this fix can be
reverted safely.

Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
---
 mm/migrate.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 3d5f0262ab60..5d304de3950b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1848,6 +1848,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 
 	for (i = 0; i < nr_pages; i++) {
 		unsigned long addr = (unsigned long)(*pages);
+		unsigned int foll_flags = FOLL_DUMP;
 		struct vm_area_struct *vma;
 		struct page *page;
 		int err = -EFAULT;
@@ -1856,8 +1857,12 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 		if (!vma)
 			goto set_status;
 
+		/* Not all huge page follow APIs support 'FOLL_GET' */
+		if (!is_vm_hugetlb_page(vma))
+			foll_flags |= FOLL_GET;
+
 		/* FOLL_DUMP to ignore special (like zero) pages */
-		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
+		page = follow_page(vma, addr, foll_flags);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
@@ -1865,7 +1870,8 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 
 		if (page) {
 			err = page_to_nid(page);
-			put_page(page);
+			if (foll_flags & FOLL_GET)
+				put_page(page);
 		} else {
 			err = -ENOENT;
 		}
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 3/3] mm: handling Non-LRU pages returned by follow_page
  2022-08-14 14:05 ` [PATCH v2 0/3] fix follow_page related issues Haiyue Wang
  2022-08-14 14:05   ` [PATCH v2 1/3] mm: revert handling Non-LRU pages returned by follow_page Haiyue Wang
  2022-08-14 14:05   ` [PATCH v2 2/3] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
@ 2022-08-14 14:05   ` Haiyue Wang
  2022-08-14 16:34     ` David Hildenbrand
  2 siblings, 1 reply; 62+ messages in thread
From: Haiyue Wang @ 2022-08-14 14:05 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, linmiaohe, ying.huang, songmuchun, naoya.horiguchi,
	alex.sierra, Haiyue Wang

Add the missed put_page handling for handling Non-LRU pages returned by
follow_page with FOLL_GET flag set.

This is the second patch for fixing the commit
3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")

Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
---
 mm/huge_memory.c |  2 +-
 mm/ksm.c         | 10 ++++++++++
 mm/migrate.c     |  6 +++++-
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2ee6d38a1426..b2ba17c3dcd7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2966,7 +2966,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		if (IS_ERR_OR_NULL(page))
 			continue;
 
-		if (!is_transparent_hugepage(page))
+		if (is_zone_device_page(page) || !is_transparent_hugepage(page))
 			goto next;
 
 		total++;
diff --git a/mm/ksm.c b/mm/ksm.c
index fe3e0a39f73a..1360bb52ada6 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -477,6 +477,10 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
 		if (IS_ERR_OR_NULL(page))
 			break;
+		if (is_zone_device_page(page)) {
+			put_page(page);
+			break;
+		}
 		if (PageKsm(page))
 			ret = handle_mm_fault(vma, addr,
 					      FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE,
@@ -562,10 +566,13 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 	page = follow_page(vma, addr, FOLL_GET);
 	if (IS_ERR_OR_NULL(page))
 		goto out;
+	if (is_zone_device_page(page))
+		goto out_putpage;
 	if (PageAnon(page)) {
 		flush_anon_page(vma, page, addr);
 		flush_dcache_page(page);
 	} else {
+out_putpage:
 		put_page(page);
 out:
 		page = NULL;
@@ -2313,6 +2320,8 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 				cond_resched();
 				continue;
 			}
+			if (is_zone_device_page(*page))
+				goto next_page;
 			if (PageAnon(*page)) {
 				flush_anon_page(vma, *page, ksm_scan.address);
 				flush_dcache_page(*page);
@@ -2327,6 +2336,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 				mmap_read_unlock(mm);
 				return rmap_item;
 			}
+next_page:
 			put_page(*page);
 			ksm_scan.address += PAGE_SIZE;
 			cond_resched();
diff --git a/mm/migrate.c b/mm/migrate.c
index 5d304de3950b..fee12cd2f294 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1675,6 +1675,9 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 	if (!page)
 		goto out;
 
+	if (is_zone_device_page(page))
+		goto out_putpage;
+
 	err = 0;
 	if (page_to_nid(page) == node)
 		goto out_putpage;
@@ -1869,7 +1872,8 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 			goto set_status;
 
 		if (page) {
-			err = page_to_nid(page);
+			err = !is_zone_device_page(page) ? page_to_nid(page)
+							 : -ENOENT;
 			if (foll_flags & FOLL_GET)
 				put_page(page);
 		} else {
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 1/3] mm: revert handling Non-LRU pages returned by follow_page
  2022-08-14 14:05   ` [PATCH v2 1/3] mm: revert handling Non-LRU pages returned by follow_page Haiyue Wang
@ 2022-08-14 16:30     ` David Hildenbrand
  2022-08-15  1:02       ` Wang, Haiyue
  0 siblings, 1 reply; 62+ messages in thread
From: David Hildenbrand @ 2022-08-14 16:30 UTC (permalink / raw)
  To: Haiyue Wang, linux-mm, linux-kernel
  Cc: akpm, linmiaohe, ying.huang, songmuchun, naoya.horiguchi, alex.sierra

On 14.08.22 16:05, Haiyue Wang wrote:
> The commit
> 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
> doesn't handle the follow_page with flag FOLL_GET correctly, this will
> do get_page on page, it shouldn't just return directly without put_page.
> 
> So revert the related fix to prepare for clean patch to handle Non-LRU
> pages returned by follow_page.

What? Why?

Just fix it.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 3/3] mm: handling Non-LRU pages returned by follow_page
  2022-08-14 14:05   ` [PATCH v2 3/3] mm: handling Non-LRU pages returned by follow_page Haiyue Wang
@ 2022-08-14 16:34     ` David Hildenbrand
  2022-08-15  1:03       ` Wang, Haiyue
  0 siblings, 1 reply; 62+ messages in thread
From: David Hildenbrand @ 2022-08-14 16:34 UTC (permalink / raw)
  To: Haiyue Wang, linux-mm, linux-kernel
  Cc: akpm, linmiaohe, ying.huang, songmuchun, naoya.horiguchi, alex.sierra

On 14.08.22 16:05, Haiyue Wang wrote:
> Add the missed put_page handling for handling Non-LRU pages returned by
> follow_page with FOLL_GET flag set.
> 
> This is the second patch for fixing the commit
> 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
> 
> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> ---
>  mm/huge_memory.c |  2 +-
>  mm/ksm.c         | 10 ++++++++++
>  mm/migrate.c     |  6 +++++-
>  3 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2ee6d38a1426..b2ba17c3dcd7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2966,7 +2966,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		if (IS_ERR_OR_NULL(page))
>  			continue;
>  
> -		if (!is_transparent_hugepage(page))
> +		if (is_zone_device_page(page) || !is_transparent_hugepage(page))
>  			goto next;
>  
>  		total++;
> diff --git a/mm/ksm.c b/mm/ksm.c
> index fe3e0a39f73a..1360bb52ada6 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -477,6 +477,10 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
>  				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
>  		if (IS_ERR_OR_NULL(page))
>  			break;
> +		if (is_zone_device_page(page)) {
> +			put_page(page);
> +			break;
> +		}

I think we can drop this check completely. While working on patches that
touch this code I realized that this check is completely useless. device
pages are never PageKsm pages and there is no need to special-case here.

If a zone device page could be PageKsm, then we wouldn't handle it here
correctly and not break ksm.

So just drop it.



-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [PATCH v2 1/3] mm: revert handling Non-LRU pages returned by follow_page
  2022-08-14 16:30     ` David Hildenbrand
@ 2022-08-15  1:02       ` Wang, Haiyue
  0 siblings, 0 replies; 62+ messages in thread
From: Wang, Haiyue @ 2022-08-15  1:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, linux-kernel
  Cc: akpm, linmiaohe, Huang, Ying, songmuchun, naoya.horiguchi, alex.sierra

> -----Original Message-----
> From: David Hildenbrand <david@redhat.com>
> Sent: Monday, August 15, 2022 00:31
> To: Wang, Haiyue <haiyue.wang@intel.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org
> Cc: akpm@linux-foundation.org; linmiaohe@huawei.com; Huang, Ying <ying.huang@intel.com>;
> songmuchun@bytedance.com; naoya.horiguchi@linux.dev; alex.sierra@amd.com
> Subject: Re: [PATCH v2 1/3] mm: revert handling Non-LRU pages returned by follow_page
> 
> On 14.08.22 16:05, Haiyue Wang wrote:
> > The commit
> > 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
> > doesn't handle the follow_page with flag FOLL_GET correctly, this will
> > do get_page on page, it shouldn't just return directly without put_page.
> >
> > So revert the related fix to prepare for clean patch to handle Non-LRU
> > pages returned by follow_page.
> 
> What? Why?
> 

Just as the cover letter said, for applying the PATCH 2/3 can be applied on
Linux-5.19 branch directly. I will drop this kind of fix, and fix the issue
directly in v3.

> Just fix it.
> 
> --
> Thanks,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [PATCH v2 3/3] mm: handling Non-LRU pages returned by follow_page
  2022-08-14 16:34     ` David Hildenbrand
@ 2022-08-15  1:03       ` Wang, Haiyue
  0 siblings, 0 replies; 62+ messages in thread
From: Wang, Haiyue @ 2022-08-15  1:03 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, linux-kernel
  Cc: akpm, linmiaohe, Huang, Ying, songmuchun, naoya.horiguchi, alex.sierra

> -----Original Message-----
> From: David Hildenbrand <david@redhat.com>
> Sent: Monday, August 15, 2022 00:34
> To: Wang, Haiyue <haiyue.wang@intel.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org
> Cc: akpm@linux-foundation.org; linmiaohe@huawei.com; Huang, Ying <ying.huang@intel.com>;
> songmuchun@bytedance.com; naoya.horiguchi@linux.dev; alex.sierra@amd.com
> Subject: Re: [PATCH v2 3/3] mm: handling Non-LRU pages returned by follow_page
> 
> On 14.08.22 16:05, Haiyue Wang wrote:
> > Add the missed put_page handling for handling Non-LRU pages returned by
> > follow_page with FOLL_GET flag set.
> >
> > This is the second patch for fixing the commit
> > 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
> >
> > Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> > ---
> >  mm/huge_memory.c |  2 +-
> >  mm/ksm.c         | 10 ++++++++++
> >  mm/migrate.c     |  6 +++++-
> >  3 files changed, 16 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index fe3e0a39f73a..1360bb52ada6 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -477,6 +477,10 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
> >  				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
> >  		if (IS_ERR_OR_NULL(page))
> >  			break;
> > +		if (is_zone_device_page(page)) {
> > +			put_page(page);
> > +			break;
> > +		}
> 
> I think we can drop this check completely. While working on patches that
> touch this code I realized that this check is completely useless. device
> pages are never PageKsm pages and there is no need to special-case here.
> 
> If a zone device page could be PageKsm, then we wouldn't handle it here
> correctly and not break ksm.
> 
> So just drop it.
> 

Fixed in v3.

> 
> 
> --
> Thanks,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v3 0/2] fix follow_page related issues
  2022-08-12  8:49 [PATCH v1] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
  2022-08-13 23:28 ` Andrew Morton
  2022-08-14 14:05 ` [PATCH v2 0/3] fix follow_page related issues Haiyue Wang
@ 2022-08-15  1:03 ` Haiyue Wang
  2022-08-15  1:03   ` [PATCH v3 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
  2022-08-15  1:03   ` [PATCH v3 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
  2022-08-15  1:59 ` [PATCH v4 0/2] fix follow_page related issues Haiyue Wang
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 62+ messages in thread
From: Haiyue Wang @ 2022-08-15  1:03 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, linmiaohe, ying.huang, songmuchun, naoya.horiguchi,
	alex.sierra, Haiyue Wang

v3: Merge the fix for handling Non-LRU pages into one patch.
    Drop the break_ksm zone device page check.

v2: Add the Non-LRU pages fix with two patches, so that
    'mm: migration: fix the FOLL_GET' can be applied directly
    on linux-5.19 stable branch.

Haiyue Wang (2):
  mm: migration: fix the FOLL_GET failure on following huge page
  mm: fix the handling Non-LRU pages returned by follow_page

 mm/huge_memory.c |  4 ++--
 mm/ksm.c         | 12 +++++++++---
 mm/migrate.c     | 20 +++++++++++++++-----
 3 files changed, 26 insertions(+), 10 deletions(-)

-- 
2.37.2


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v3 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-15  1:03 ` [PATCH v3 0/2] fix follow_page related issues Haiyue Wang
@ 2022-08-15  1:03   ` Haiyue Wang
  2022-08-15  1:59     ` Huang, Ying
  2022-08-15  1:03   ` [PATCH v3 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
  1 sibling, 1 reply; 62+ messages in thread
From: Haiyue Wang @ 2022-08-15  1:03 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, linmiaohe, ying.huang, songmuchun, naoya.horiguchi,
	alex.sierra, Haiyue Wang

Not all huge page APIs support FOLL_GET option, so the __NR_move_pages
will fail to get the page node information for huge page.

This is an temporary solution to mitigate the racing fix.

After supporting follow huge page by FOLL_GET is done, this fix can be
reverted safely.

Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
---
 mm/migrate.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 6a1597c92261..581dfaad9257 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1848,6 +1848,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 
 	for (i = 0; i < nr_pages; i++) {
 		unsigned long addr = (unsigned long)(*pages);
+		unsigned int foll_flags = FOLL_DUMP;
 		struct vm_area_struct *vma;
 		struct page *page;
 		int err = -EFAULT;
@@ -1856,8 +1857,12 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 		if (!vma)
 			goto set_status;
 
+		/* Not all huge page follow APIs support 'FOLL_GET' */
+		if (!is_vm_hugetlb_page(vma))
+			foll_flags |= FOLL_GET;
+
 		/* FOLL_DUMP to ignore special (like zero) pages */
-		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
+		page = follow_page(vma, addr, foll_flags);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
@@ -1865,7 +1870,8 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 
 		if (page && !is_zone_device_page(page)) {
 			err = page_to_nid(page);
-			put_page(page);
+			if (foll_flags & FOLL_GET)
+				put_page(page);
 		} else {
 			err = -ENOENT;
 		}
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 2/2] mm: fix the handling Non-LRU pages returned by follow_page
  2022-08-15  1:03 ` [PATCH v3 0/2] fix follow_page related issues Haiyue Wang
  2022-08-15  1:03   ` [PATCH v3 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
@ 2022-08-15  1:03   ` Haiyue Wang
  2022-08-15  1:39     ` Huang, Ying
  1 sibling, 1 reply; 62+ messages in thread
From: Haiyue Wang @ 2022-08-15  1:03 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, linmiaohe, ying.huang, songmuchun, naoya.horiguchi,
	alex.sierra, Haiyue Wang, Felix Kuehling, Alistair Popple

The page returned by follow_page with 'FOLL_GET' has get_page called, so
it needs to call put_page for handling the reference count correctly.

And as David reviewed, "device pages are never PageKsm pages". Drop this
zone device page check for break_ksm.

Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
---
 mm/huge_memory.c |  4 ++--
 mm/ksm.c         | 12 +++++++++---
 mm/migrate.c     | 10 +++++++---
 3 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8a7c1b344abe..b2ba17c3dcd7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2963,10 +2963,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		/* FOLL_DUMP to ignore special (like zero) pages */
 		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
 
-		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
+		if (IS_ERR_OR_NULL(page))
 			continue;
 
-		if (!is_transparent_hugepage(page))
+		if (is_zone_device_page(page) || !is_transparent_hugepage(page))
 			goto next;
 
 		total++;
diff --git a/mm/ksm.c b/mm/ksm.c
index 42ab153335a2..e26f57fc1f0e 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -475,7 +475,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 		cond_resched();
 		page = follow_page(vma, addr,
 				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
-		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
+		if (IS_ERR_OR_NULL(page))
 			break;
 		if (PageKsm(page))
 			ret = handle_mm_fault(vma, addr,
@@ -560,12 +560,15 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 		goto out;
 
 	page = follow_page(vma, addr, FOLL_GET);
-	if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
+	if (IS_ERR_OR_NULL(page))
 		goto out;
+	if (is_zone_device_page(page))
+		goto out_putpage;
 	if (PageAnon(page)) {
 		flush_anon_page(vma, page, addr);
 		flush_dcache_page(page);
 	} else {
+out_putpage:
 		put_page(page);
 out:
 		page = NULL;
@@ -2308,11 +2311,13 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 			if (ksm_test_exit(mm))
 				break;
 			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
-			if (IS_ERR_OR_NULL(*page) || is_zone_device_page(*page)) {
+			if (IS_ERR_OR_NULL(*page)) {
 				ksm_scan.address += PAGE_SIZE;
 				cond_resched();
 				continue;
 			}
+			if (is_zone_device_page(*page))
+				goto next_page;
 			if (PageAnon(*page)) {
 				flush_anon_page(vma, *page, ksm_scan.address);
 				flush_dcache_page(*page);
@@ -2327,6 +2332,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 				mmap_read_unlock(mm);
 				return rmap_item;
 			}
+next_page:
 			put_page(*page);
 			ksm_scan.address += PAGE_SIZE;
 			cond_resched();
diff --git a/mm/migrate.c b/mm/migrate.c
index 581dfaad9257..fee12cd2f294 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1672,9 +1672,12 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 		goto out;
 
 	err = -ENOENT;
-	if (!page || is_zone_device_page(page))
+	if (!page)
 		goto out;
 
+	if (is_zone_device_page(page))
+		goto out_putpage;
+
 	err = 0;
 	if (page_to_nid(page) == node)
 		goto out_putpage;
@@ -1868,8 +1871,9 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 		if (IS_ERR(page))
 			goto set_status;
 
-		if (page && !is_zone_device_page(page)) {
-			err = page_to_nid(page);
+		if (page) {
+			err = !is_zone_device_page(page) ? page_to_nid(page)
+							 : -ENOENT;
 			if (foll_flags & FOLL_GET)
 				put_page(page);
 		} else {
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 2/2] mm: fix the handling Non-LRU pages returned by follow_page
  2022-08-15  1:03   ` [PATCH v3 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
@ 2022-08-15  1:39     ` Huang, Ying
  2022-08-15  1:46       ` Wang, Haiyue
  0 siblings, 1 reply; 62+ messages in thread
From: Huang, Ying @ 2022-08-15  1:39 UTC (permalink / raw)
  To: Haiyue Wang
  Cc: linux-mm, linux-kernel, akpm, david, linmiaohe, songmuchun,
	naoya.horiguchi, alex.sierra, Felix Kuehling, Alistair Popple

Haiyue Wang <haiyue.wang@intel.com> writes:

> The page returned by follow_page with 'FOLL_GET' has get_page called, so

Better to add "()" after function name to improve the readability, for
example, "follow_page()".

> it needs to call put_page for handling the reference count correctly.
>
> And as David reviewed, "device pages are never PageKsm pages". Drop this
> zone device page check for break_ksm.
>
> Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")

In your patch description, I cannot find how the above commit is related
to the bug you fixed.  Add some words about that?

Best Regards,
Huang, Ying

> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> ---
>  mm/huge_memory.c |  4 ++--
>  mm/ksm.c         | 12 +++++++++---
>  mm/migrate.c     | 10 +++++++---
>  3 files changed, 18 insertions(+), 8 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8a7c1b344abe..b2ba17c3dcd7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2963,10 +2963,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		/* FOLL_DUMP to ignore special (like zero) pages */
>  		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
>  
> -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> +		if (IS_ERR_OR_NULL(page))
>  			continue;
>  
> -		if (!is_transparent_hugepage(page))
> +		if (is_zone_device_page(page) || !is_transparent_hugepage(page))
>  			goto next;
>  
>  		total++;
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 42ab153335a2..e26f57fc1f0e 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -475,7 +475,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
>  		cond_resched();
>  		page = follow_page(vma, addr,
>  				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
> -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> +		if (IS_ERR_OR_NULL(page))
>  			break;
>  		if (PageKsm(page))
>  			ret = handle_mm_fault(vma, addr,
> @@ -560,12 +560,15 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
>  		goto out;
>  
>  	page = follow_page(vma, addr, FOLL_GET);
> -	if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> +	if (IS_ERR_OR_NULL(page))
>  		goto out;
> +	if (is_zone_device_page(page))
> +		goto out_putpage;
>  	if (PageAnon(page)) {
>  		flush_anon_page(vma, page, addr);
>  		flush_dcache_page(page);
>  	} else {
> +out_putpage:
>  		put_page(page);
>  out:
>  		page = NULL;
> @@ -2308,11 +2311,13 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>  			if (ksm_test_exit(mm))
>  				break;
>  			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
> -			if (IS_ERR_OR_NULL(*page) || is_zone_device_page(*page)) {
> +			if (IS_ERR_OR_NULL(*page)) {
>  				ksm_scan.address += PAGE_SIZE;
>  				cond_resched();
>  				continue;
>  			}
> +			if (is_zone_device_page(*page))
> +				goto next_page;
>  			if (PageAnon(*page)) {
>  				flush_anon_page(vma, *page, ksm_scan.address);
>  				flush_dcache_page(*page);
> @@ -2327,6 +2332,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>  				mmap_read_unlock(mm);
>  				return rmap_item;
>  			}
> +next_page:
>  			put_page(*page);
>  			ksm_scan.address += PAGE_SIZE;
>  			cond_resched();
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 581dfaad9257..fee12cd2f294 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1672,9 +1672,12 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
>  		goto out;
>  
>  	err = -ENOENT;
> -	if (!page || is_zone_device_page(page))
> +	if (!page)
>  		goto out;
>  
> +	if (is_zone_device_page(page))
> +		goto out_putpage;
> +
>  	err = 0;
>  	if (page_to_nid(page) == node)
>  		goto out_putpage;
> @@ -1868,8 +1871,9 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>  		if (IS_ERR(page))
>  			goto set_status;
>  
> -		if (page && !is_zone_device_page(page)) {
> -			err = page_to_nid(page);
> +		if (page) {
> +			err = !is_zone_device_page(page) ? page_to_nid(page)
> +							 : -ENOENT;
>  			if (foll_flags & FOLL_GET)
>  				put_page(page);
>  		} else {

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [PATCH v3 2/2] mm: fix the handling Non-LRU pages returned by follow_page
  2022-08-15  1:39     ` Huang, Ying
@ 2022-08-15  1:46       ` Wang, Haiyue
  0 siblings, 0 replies; 62+ messages in thread
From: Wang, Haiyue @ 2022-08-15  1:46 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, akpm, david, linmiaohe, songmuchun,
	naoya.horiguchi, alex.sierra, Felix Kuehling, Alistair Popple

> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Monday, August 15, 2022 09:40
> To: Wang, Haiyue <haiyue.wang@intel.com>
> Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org; david@redhat.com;
> linmiaohe@huawei.com; songmuchun@bytedance.com; naoya.horiguchi@linux.dev; alex.sierra@amd.com; Felix
> Kuehling <Felix.Kuehling@amd.com>; Alistair Popple <apopple@nvidia.com>
> Subject: Re: [PATCH v3 2/2] mm: fix the handling Non-LRU pages returned by follow_page
> 
> Haiyue Wang <haiyue.wang@intel.com> writes:
> 
> > The page returned by follow_page with 'FOLL_GET' has get_page called, so
> 
> Better to add "()" after function name to improve the readability, for
> example, "follow_page()".

Got it, didn't notice this. ;-)

> 
> > it needs to call put_page for handling the reference count correctly.
> >
> > And as David reviewed, "device pages are never PageKsm pages". Drop this
> > zone device page check for break_ksm.
> >
> > Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
> 
> In your patch description, I cannot find how the above commit is related
> to the bug you fixed.  Add some words about that?

The first commit session, but maybe not very clear, will try to add more words
In v4.

> 
> Best Regards,
> Huang, Ying
> 
> > Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> > ---
> >  mm/huge_memory.c |  4 ++--
> >  mm/ksm.c         | 12 +++++++++---
> >  mm/migrate.c     | 10 +++++++---
> >  3 files changed, 18 insertions(+), 8 deletions(-)
> >


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v4 0/2] fix follow_page related issues
  2022-08-12  8:49 [PATCH v1] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
                   ` (2 preceding siblings ...)
  2022-08-15  1:03 ` [PATCH v3 0/2] fix follow_page related issues Haiyue Wang
@ 2022-08-15  1:59 ` Haiyue Wang
  2022-08-15  1:59   ` [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
  2022-08-15  1:59   ` [PATCH v4 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
  2022-08-15  7:02 ` [PATCH v5 0/2] fix follow_page related issues Haiyue Wang
  2022-08-16  2:20 ` [PATCH v6 0/2] fix follow_page related issues Haiyue Wang
  5 siblings, 2 replies; 62+ messages in thread
From: Haiyue Wang @ 2022-08-15  1:59 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, linmiaohe, ying.huang, songmuchun, naoya.horiguchi,
	alex.sierra, Haiyue Wang

v4: add '()' for the function for readability.
    add more words about the Non-LRU pages fix in commit message.

v3: Merge the fix for handling Non-LRU pages into one patch.
    Drop the break_ksm zone device page check.

v2: Add the Non-LRU pages fix with two patches, so that
    'mm: migration: fix the FOLL_GET' can be applied directly
    on linux-5.19 stable branch.

Haiyue Wang (2):
  mm: migration: fix the FOLL_GET failure on following huge page
  mm: fix the handling Non-LRU pages returned by follow_page

 mm/huge_memory.c |  4 ++--
 mm/ksm.c         | 12 +++++++++---
 mm/migrate.c     | 20 +++++++++++++++-----
 3 files changed, 26 insertions(+), 10 deletions(-)

-- 
2.37.2


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-15  1:59 ` [PATCH v4 0/2] fix follow_page related issues Haiyue Wang
@ 2022-08-15  1:59   ` Haiyue Wang
  2022-08-15  4:28     ` Alistair Popple
  2022-08-15  1:59   ` [PATCH v4 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
  1 sibling, 1 reply; 62+ messages in thread
From: Haiyue Wang @ 2022-08-15  1:59 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, linmiaohe, ying.huang, songmuchun, naoya.horiguchi,
	alex.sierra, Haiyue Wang

Not all huge page APIs support FOLL_GET option, so the __NR_move_pages
will fail to get the page node information for huge page.

This is an temporary solution to mitigate the racing fix.

After supporting follow huge page by FOLL_GET is done, this fix can be
reverted safely.

Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
---
 mm/migrate.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 6a1597c92261..581dfaad9257 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1848,6 +1848,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 
 	for (i = 0; i < nr_pages; i++) {
 		unsigned long addr = (unsigned long)(*pages);
+		unsigned int foll_flags = FOLL_DUMP;
 		struct vm_area_struct *vma;
 		struct page *page;
 		int err = -EFAULT;
@@ -1856,8 +1857,12 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 		if (!vma)
 			goto set_status;
 
+		/* Not all huge page follow APIs support 'FOLL_GET' */
+		if (!is_vm_hugetlb_page(vma))
+			foll_flags |= FOLL_GET;
+
 		/* FOLL_DUMP to ignore special (like zero) pages */
-		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
+		page = follow_page(vma, addr, foll_flags);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
@@ -1865,7 +1870,8 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 
 		if (page && !is_zone_device_page(page)) {
 			err = page_to_nid(page);
-			put_page(page);
+			if (foll_flags & FOLL_GET)
+				put_page(page);
 		} else {
 			err = -ENOENT;
 		}
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 2/2] mm: fix the handling Non-LRU pages returned by follow_page
  2022-08-15  1:59 ` [PATCH v4 0/2] fix follow_page related issues Haiyue Wang
  2022-08-15  1:59   ` [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
@ 2022-08-15  1:59   ` Haiyue Wang
  1 sibling, 0 replies; 62+ messages in thread
From: Haiyue Wang @ 2022-08-15  1:59 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, linmiaohe, ying.huang, songmuchun, naoya.horiguchi,
	alex.sierra, Haiyue Wang, Alistair Popple, Felix Kuehling

The handling Non-LRU pages returned by follow_page() jumps directly, it
doesn't call put_page() to handle the reference count, since 'FOLL_GET'
flag for follow_page() has get_page() called. Fix the zone device page
check by handling the page reference count correctly before returning.

And as David reviewed, "device pages are never PageKsm pages". Drop this
zone device page check for break_ksm().

Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
---
 mm/huge_memory.c |  4 ++--
 mm/ksm.c         | 12 +++++++++---
 mm/migrate.c     | 10 +++++++---
 3 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8a7c1b344abe..b2ba17c3dcd7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2963,10 +2963,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		/* FOLL_DUMP to ignore special (like zero) pages */
 		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
 
-		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
+		if (IS_ERR_OR_NULL(page))
 			continue;
 
-		if (!is_transparent_hugepage(page))
+		if (is_zone_device_page(page) || !is_transparent_hugepage(page))
 			goto next;
 
 		total++;
diff --git a/mm/ksm.c b/mm/ksm.c
index 42ab153335a2..e26f57fc1f0e 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -475,7 +475,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 		cond_resched();
 		page = follow_page(vma, addr,
 				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
-		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
+		if (IS_ERR_OR_NULL(page))
 			break;
 		if (PageKsm(page))
 			ret = handle_mm_fault(vma, addr,
@@ -560,12 +560,15 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 		goto out;
 
 	page = follow_page(vma, addr, FOLL_GET);
-	if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
+	if (IS_ERR_OR_NULL(page))
 		goto out;
+	if (is_zone_device_page(page))
+		goto out_putpage;
 	if (PageAnon(page)) {
 		flush_anon_page(vma, page, addr);
 		flush_dcache_page(page);
 	} else {
+out_putpage:
 		put_page(page);
 out:
 		page = NULL;
@@ -2308,11 +2311,13 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 			if (ksm_test_exit(mm))
 				break;
 			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
-			if (IS_ERR_OR_NULL(*page) || is_zone_device_page(*page)) {
+			if (IS_ERR_OR_NULL(*page)) {
 				ksm_scan.address += PAGE_SIZE;
 				cond_resched();
 				continue;
 			}
+			if (is_zone_device_page(*page))
+				goto next_page;
 			if (PageAnon(*page)) {
 				flush_anon_page(vma, *page, ksm_scan.address);
 				flush_dcache_page(*page);
@@ -2327,6 +2332,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 				mmap_read_unlock(mm);
 				return rmap_item;
 			}
+next_page:
 			put_page(*page);
 			ksm_scan.address += PAGE_SIZE;
 			cond_resched();
diff --git a/mm/migrate.c b/mm/migrate.c
index 581dfaad9257..fee12cd2f294 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1672,9 +1672,12 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 		goto out;
 
 	err = -ENOENT;
-	if (!page || is_zone_device_page(page))
+	if (!page)
 		goto out;
 
+	if (is_zone_device_page(page))
+		goto out_putpage;
+
 	err = 0;
 	if (page_to_nid(page) == node)
 		goto out_putpage;
@@ -1868,8 +1871,9 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 		if (IS_ERR(page))
 			goto set_status;
 
-		if (page && !is_zone_device_page(page)) {
-			err = page_to_nid(page);
+		if (page) {
+			err = !is_zone_device_page(page) ? page_to_nid(page)
+							 : -ENOENT;
 			if (foll_flags & FOLL_GET)
 				put_page(page);
 		} else {
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-15  1:03   ` [PATCH v3 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
@ 2022-08-15  1:59     ` Huang, Ying
  2022-08-15  2:10       ` Wang, Haiyue
  0 siblings, 1 reply; 62+ messages in thread
From: Huang, Ying @ 2022-08-15  1:59 UTC (permalink / raw)
  To: Haiyue Wang
  Cc: linux-mm, linux-kernel, akpm, david, linmiaohe, songmuchun,
	naoya.horiguchi, alex.sierra

Haiyue Wang <haiyue.wang@intel.com> writes:

> Not all huge page APIs support FOLL_GET option, so the __NR_move_pages

move_pages() is a syscall, so you can just call it move_pages(), or
move_pages() syscall.

> will fail to get the page node information for huge page.
                                                 ~~~~~~~~~
some huge pages?

> This is an temporary solution to mitigate the racing fix.

Why is it "racing fix"?  This isn't a race condition fix.

Best Regards,
Huang, Ying

> After supporting follow huge page by FOLL_GET is done, this fix can be
> reverted safely.
>
> Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [PATCH v3 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-15  1:59     ` Huang, Ying
@ 2022-08-15  2:10       ` Wang, Haiyue
  2022-08-15  2:15         ` Wang, Haiyue
  0 siblings, 1 reply; 62+ messages in thread
From: Wang, Haiyue @ 2022-08-15  2:10 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, akpm, david, linmiaohe, songmuchun,
	naoya.horiguchi, alex.sierra

> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Monday, August 15, 2022 09:59
> To: Wang, Haiyue <haiyue.wang@intel.com>
> Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org; david@redhat.com;
> linmiaohe@huawei.com; songmuchun@bytedance.com; naoya.horiguchi@linux.dev; alex.sierra@amd.com
> Subject: Re: [PATCH v3 1/2] mm: migration: fix the FOLL_GET failure on following huge page
> 
> Haiyue Wang <haiyue.wang@intel.com> writes:
> 
> > Not all huge page APIs support FOLL_GET option, so the __NR_move_pages
> 
> move_pages() is a syscall, so you can just call it move_pages(), or
> move_pages() syscall.

The application meets the issue, use the bellow function:
syscall (__NR_move_pages, 0, n_pages, ptr, 0, status, 0)

So I used it directly in the commit. "move_pages() syscall" is better.
Will update latter.

> 
> > will fail to get the page node information for huge page.
>                                                  ~~~~~~~~~
> some huge pages?

OK.

> 
> > This is an temporary solution to mitigate the racing fix.
> 
> Why is it "racing fix"?  This isn't a race condition fix.

The 'Fixes' commit is about race condition fix.

How about " his is an temporary solution to mitigate the side effect
Of the race condition fix"

> 
> Best Regards,
> Huang, Ying
> 
> > After supporting follow huge page by FOLL_GET is done, this fix can be
> > reverted safely.
> >
> > Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
> > Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> 
> [snip]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [PATCH v3 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-15  2:10       ` Wang, Haiyue
@ 2022-08-15  2:15         ` Wang, Haiyue
  2022-08-15  2:51           ` Huang, Ying
  0 siblings, 1 reply; 62+ messages in thread
From: Wang, Haiyue @ 2022-08-15  2:15 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, akpm, david, linmiaohe, songmuchun,
	naoya.horiguchi, alex.sierra

> -----Original Message-----
> From: Wang, Haiyue
> Sent: Monday, August 15, 2022 10:11
> To: Huang, Ying <ying.huang@intel.com>
> Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org; david@redhat.com;
> linmiaohe@huawei.com; songmuchun@bytedance.com; naoya.horiguchi@linux.dev; alex.sierra@amd.com
> Subject: RE: [PATCH v3 1/2] mm: migration: fix the FOLL_GET failure on following huge page
> 
> > -----Original Message-----
> > From: Huang, Ying <ying.huang@intel.com>
> > Sent: Monday, August 15, 2022 09:59
> > To: Wang, Haiyue <haiyue.wang@intel.com>
> > Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org; david@redhat.com;
> > linmiaohe@huawei.com; songmuchun@bytedance.com; naoya.horiguchi@linux.dev; alex.sierra@amd.com
> > Subject: Re: [PATCH v3 1/2] mm: migration: fix the FOLL_GET failure on following huge page
> >
> > Haiyue Wang <haiyue.wang@intel.com> writes:
> >


> >
> > > This is an temporary solution to mitigate the racing fix.
> >
> > Why is it "racing fix"?  This isn't a race condition fix.
> 
> The 'Fixes' commit is about race condition fix.
> 
> How about " his is an temporary solution to mitigate the side effect
> Of the race condition fix"

Try to add more words to make things clean:

"This is an temporary solution to mitigate the side effect of the race
condition fix by calling follow_page() with FOLL_GET set."

> 
> >
> > Best Regards,
> > Huang, Ying
> >
> > > After supporting follow huge page by FOLL_GET is done, this fix can be
> > > reverted safely.
> > >
> > > Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
> > > Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> >
> > [snip]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-15  2:15         ` Wang, Haiyue
@ 2022-08-15  2:51           ` Huang, Ying
  0 siblings, 0 replies; 62+ messages in thread
From: Huang, Ying @ 2022-08-15  2:51 UTC (permalink / raw)
  To: Wang, Haiyue
  Cc: linux-mm, linux-kernel, akpm, david, linmiaohe, songmuchun,
	naoya.horiguchi, alex.sierra

"Wang, Haiyue" <haiyue.wang@intel.com> writes:

>> -----Original Message-----
>> From: Wang, Haiyue
>> Sent: Monday, August 15, 2022 10:11
>> To: Huang, Ying <ying.huang@intel.com>
>> Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org; david@redhat.com;
>> linmiaohe@huawei.com; songmuchun@bytedance.com; naoya.horiguchi@linux.dev; alex.sierra@amd.com
>> Subject: RE: [PATCH v3 1/2] mm: migration: fix the FOLL_GET failure on following huge page
>>
>> > -----Original Message-----
>> > From: Huang, Ying <ying.huang@intel.com>
>> > Sent: Monday, August 15, 2022 09:59
>> > To: Wang, Haiyue <haiyue.wang@intel.com>
>> > Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org; david@redhat.com;
>> > linmiaohe@huawei.com; songmuchun@bytedance.com; naoya.horiguchi@linux.dev; alex.sierra@amd.com
>> > Subject: Re: [PATCH v3 1/2] mm: migration: fix the FOLL_GET failure on following huge page
>> >
>> > Haiyue Wang <haiyue.wang@intel.com> writes:
>> >
>
>
>> >
>> > > This is an temporary solution to mitigate the racing fix.
>> >
>> > Why is it "racing fix"?  This isn't a race condition fix.
>>
>> The 'Fixes' commit is about race condition fix.
>>
>> How about " his is an temporary solution to mitigate the side effect
>> Of the race condition fix"
>
> Try to add more words to make things clean:
>
> "This is an temporary solution to mitigate the side effect of the race
> condition fix by calling follow_page() with FOLL_GET set."

Looks good to me.  Thanks!

Best Regards,
Huang, Ying

>>
>> >
>> > Best Regards,
>> > Huang, Ying
>> >
>> > > After supporting follow huge page by FOLL_GET is done, this fix can be
>> > > reverted safely.
>> > >
>> > > Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
>> > > Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
>> >
>> > [snip]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-15  1:59   ` [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
@ 2022-08-15  4:28     ` Alistair Popple
  2022-08-15  4:40       ` Wang, Haiyue
  0 siblings, 1 reply; 62+ messages in thread
From: Alistair Popple @ 2022-08-15  4:28 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Haiyue Wang
  Cc: akpm, david, linmiaohe, ying.huang, songmuchun, naoya.horiguchi,
	alex.sierra, Haiyue Wang

On Monday, 15 August 2022 11:59:08 AM AEST Haiyue Wang wrote:
> Not all huge page APIs support FOLL_GET option, so the __NR_move_pages
> will fail to get the page node information for huge page.

I think you should be explicit in the commit message about which functions do 
not support FOLL_GET as it's not obvious what support needs to be added before 
this fix can be reverted.

Thanks.

 - Alistair

> This is an temporary solution to mitigate the racing fix.
> 
> After supporting follow huge page by FOLL_GET is done, this fix can be
> reverted safely.
> 
> Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing 
with memory offline")
> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> ---
>  mm/migrate.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6a1597c92261..581dfaad9257 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1848,6 +1848,7 @@ static void do_pages_stat_array(struct mm_struct *mm, 
unsigned long nr_pages,
>  
>  	for (i = 0; i < nr_pages; i++) {
>  		unsigned long addr = (unsigned long)(*pages);
> +		unsigned int foll_flags = FOLL_DUMP;
>  		struct vm_area_struct *vma;
>  		struct page *page;
>  		int err = -EFAULT;
> @@ -1856,8 +1857,12 @@ static void do_pages_stat_array(struct mm_struct *mm, 
unsigned long nr_pages,
>  		if (!vma)
>  			goto set_status;
>  
> +		/* Not all huge page follow APIs support 'FOLL_GET' */
> +		if (!is_vm_hugetlb_page(vma))
> +			foll_flags |= FOLL_GET;
> +
>  		/* FOLL_DUMP to ignore special (like zero) pages */
> -		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> +		page = follow_page(vma, addr, foll_flags);
>  
>  		err = PTR_ERR(page);
>  		if (IS_ERR(page))
> @@ -1865,7 +1870,8 @@ static void do_pages_stat_array(struct mm_struct *mm, 
unsigned long nr_pages,
>  
>  		if (page && !is_zone_device_page(page)) {
>  			err = page_to_nid(page);
> -			put_page(page);
> +			if (foll_flags & FOLL_GET)
> +				put_page(page);
>  		} else {
>  			err = -ENOENT;
>  		}
> 





^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-15  4:28     ` Alistair Popple
@ 2022-08-15  4:40       ` Wang, Haiyue
  2022-08-15  5:16         ` Alistair Popple
  0 siblings, 1 reply; 62+ messages in thread
From: Wang, Haiyue @ 2022-08-15  4:40 UTC (permalink / raw)
  To: Alistair Popple, linux-mm, linux-kernel
  Cc: akpm, david, linmiaohe, Huang, Ying, songmuchun, naoya.horiguchi,
	alex.sierra

> -----Original Message-----
> From: Alistair Popple <apopple@nvidia.com>
> Sent: Monday, August 15, 2022 12:29
> To: linux-mm@kvack.org; linux-kernel@vger.kernel.org; Wang, Haiyue <haiyue.wang@intel.com>
> Cc: akpm@linux-foundation.org; david@redhat.com; linmiaohe@huawei.com; Huang, Ying
> <ying.huang@intel.com>; songmuchun@bytedance.com; naoya.horiguchi@linux.dev; alex.sierra@amd.com; Wang,
> Haiyue <haiyue.wang@intel.com>
> Subject: Re: [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on following huge page
> 
> On Monday, 15 August 2022 11:59:08 AM AEST Haiyue Wang wrote:
> > Not all huge page APIs support FOLL_GET option, so the __NR_move_pages
> > will fail to get the page node information for huge page.
> 
> I think you should be explicit in the commit message about which functions do
> not support FOLL_GET as it's not obvious what support needs to be added before
> this fix can be reverted.

Yes, make sense, will add them in new patch.

> 
> Thanks.
> 
>  - Alistair
> 
> > This is an temporary solution to mitigate the racing fix.
> >
> > After supporting follow huge page by FOLL_GET is done, this fix can be
> > reverted safely.
> >
> > Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing
> with memory offline")
> > Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> > ---
> >  mm/migrate.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 6a1597c92261..581dfaad9257 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1848,6 +1848,7 @@ static void do_pages_stat_array(struct mm_struct *mm,
> unsigned long nr_pages,
> >
> >  	for (i = 0; i < nr_pages; i++) {
> >  		unsigned long addr = (unsigned long)(*pages);
> > +		unsigned int foll_flags = FOLL_DUMP;
> >  		struct vm_area_struct *vma;
> >  		struct page *page;
> >  		int err = -EFAULT;
> > @@ -1856,8 +1857,12 @@ static void do_pages_stat_array(struct mm_struct *mm,
> unsigned long nr_pages,
> >  		if (!vma)
> >  			goto set_status;
> >
> > +		/* Not all huge page follow APIs support 'FOLL_GET' */
> > +		if (!is_vm_hugetlb_page(vma))
> > +			foll_flags |= FOLL_GET;
> > +
> >  		/* FOLL_DUMP to ignore special (like zero) pages */
> > -		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> > +		page = follow_page(vma, addr, foll_flags);
> >
> >  		err = PTR_ERR(page);
> >  		if (IS_ERR(page))
> > @@ -1865,7 +1870,8 @@ static void do_pages_stat_array(struct mm_struct *mm,
> unsigned long nr_pages,
> >
> >  		if (page && !is_zone_device_page(page)) {
> >  			err = page_to_nid(page);
> > -			put_page(page);
> > +			if (foll_flags & FOLL_GET)
> > +				put_page(page);
> >  		} else {
> >  			err = -ENOENT;
> >  		}
> >
> 
> 
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-15  4:40       ` Wang, Haiyue
@ 2022-08-15  5:16         ` Alistair Popple
  2022-08-15  5:20           ` Wang, Haiyue
  0 siblings, 1 reply; 62+ messages in thread
From: Alistair Popple @ 2022-08-15  5:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Wang, Haiyue
  Cc: akpm, david, linmiaohe, Huang, Ying, songmuchun, naoya.horiguchi,
	alex.sierra

On Monday, 15 August 2022 2:40:48 PM AEST Wang, Haiyue wrote:
> > -----Original Message-----
> > From: Alistair Popple <apopple@nvidia.com>
> > Sent: Monday, August 15, 2022 12:29
> > To: linux-mm@kvack.org; linux-kernel@vger.kernel.org; Wang, Haiyue 
<haiyue.wang@intel.com>
> > Cc: akpm@linux-foundation.org; david@redhat.com; linmiaohe@huawei.com; 
Huang, Ying
> > <ying.huang@intel.com>; songmuchun@bytedance.com; 
naoya.horiguchi@linux.dev; alex.sierra@amd.com; Wang,
> > Haiyue <haiyue.wang@intel.com>
> > Subject: Re: [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on 
following huge page
> > 
> > On Monday, 15 August 2022 11:59:08 AM AEST Haiyue Wang wrote:
> > > Not all huge page APIs support FOLL_GET option, so the __NR_move_pages
> > > will fail to get the page node information for huge page.
> > 
> > I think you should be explicit in the commit message about which functions 
do
> > not support FOLL_GET as it's not obvious what support needs to be added 
before
> > this fix can be reverted.
> 
> Yes, make sense, will add them in new patch.

Actually while you're at it I think it would be good to include a description 
of the impact of this failure in the commit message. Ie. You're answer to:

> What are the user-visible runtime effects of this bug?

As it documents what should be tested if this fix does actually ever get 
reverted.

> > 
> > Thanks.
> > 
> >  - Alistair
> > 
> > > This is an temporary solution to mitigate the racing fix.
> > >
> > > After supporting follow huge page by FOLL_GET is done, this fix can be
> > > reverted safely.
> > >
> > > Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array 
racing
> > with memory offline")
> > > Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> > > ---
> > >  mm/migrate.c | 10 ++++++++--
> > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index 6a1597c92261..581dfaad9257 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -1848,6 +1848,7 @@ static void do_pages_stat_array(struct mm_struct 
*mm,
> > unsigned long nr_pages,
> > >
> > >  	for (i = 0; i < nr_pages; i++) {
> > >  		unsigned long addr = (unsigned long)(*pages);
> > > +		unsigned int foll_flags = FOLL_DUMP;
> > >  		struct vm_area_struct *vma;
> > >  		struct page *page;
> > >  		int err = -EFAULT;
> > > @@ -1856,8 +1857,12 @@ static void do_pages_stat_array(struct mm_struct 
*mm,
> > unsigned long nr_pages,
> > >  		if (!vma)
> > >  			goto set_status;
> > >
> > > +		/* Not all huge page follow APIs support 'FOLL_GET' */
> > > +		if (!is_vm_hugetlb_page(vma))
> > > +			foll_flags |= FOLL_GET;
> > > +
> > >  		/* FOLL_DUMP to ignore special (like zero) pages */
> > > -		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> > > +		page = follow_page(vma, addr, foll_flags);
> > >
> > >  		err = PTR_ERR(page);
> > >  		if (IS_ERR(page))
> > > @@ -1865,7 +1870,8 @@ static void do_pages_stat_array(struct mm_struct 
*mm,
> > unsigned long nr_pages,
> > >
> > >  		if (page && !is_zone_device_page(page)) {
> > >  			err = page_to_nid(page);
> > > -			put_page(page);
> > > +			if (foll_flags & FOLL_GET)
> > > +				put_page(page);
> > >  		} else {
> > >  			err = -ENOENT;
> > >  		}
> > >
> > 
> > 
> > 
> 
> 





^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-15  5:16         ` Alistair Popple
@ 2022-08-15  5:20           ` Wang, Haiyue
  2022-08-15  5:35             ` Alistair Popple
  0 siblings, 1 reply; 62+ messages in thread
From: Wang, Haiyue @ 2022-08-15  5:20 UTC (permalink / raw)
  To: Alistair Popple, linux-mm, linux-kernel
  Cc: akpm, david, linmiaohe, Huang, Ying, songmuchun, naoya.horiguchi,
	alex.sierra

> -----Original Message-----
> From: Alistair Popple <apopple@nvidia.com>
> Sent: Monday, August 15, 2022 13:17
> To: linux-mm@kvack.org; linux-kernel@vger.kernel.org; Wang, Haiyue <haiyue.wang@intel.com>
> Cc: akpm@linux-foundation.org; david@redhat.com; linmiaohe@huawei.com; Huang, Ying
> <ying.huang@intel.com>; songmuchun@bytedance.com; naoya.horiguchi@linux.dev; alex.sierra@amd.com
> Subject: Re: [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on following huge page
> 
> On Monday, 15 August 2022 2:40:48 PM AEST Wang, Haiyue wrote:
> > > -----Original Message-----
> > > From: Alistair Popple <apopple@nvidia.com>
> > > Sent: Monday, August 15, 2022 12:29
> > > To: linux-mm@kvack.org; linux-kernel@vger.kernel.org; Wang, Haiyue
> <haiyue.wang@intel.com>
> > > Cc: akpm@linux-foundation.org; david@redhat.com; linmiaohe@huawei.com;
> Huang, Ying
> > > <ying.huang@intel.com>; songmuchun@bytedance.com;
> naoya.horiguchi@linux.dev; alex.sierra@amd.com; Wang,
> > > Haiyue <haiyue.wang@intel.com>
> > > Subject: Re: [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on
> following huge page
> > >
> > > On Monday, 15 August 2022 11:59:08 AM AEST Haiyue Wang wrote:
> > > > Not all huge page APIs support FOLL_GET option, so the __NR_move_pages
> > > > will fail to get the page node information for huge page.
> > >
> > > I think you should be explicit in the commit message about which functions
> do
> > > not support FOLL_GET as it's not obvious what support needs to be added
> before
> > > this fix can be reverted.
> >
> > Yes, make sense, will add them in new patch.
> 
> Actually while you're at it I think it would be good to include a description
> of the impact of this failure in the commit message. Ie. You're answer to:
> 
> > What are the user-visible runtime effects of this bug?
> 
> As it documents what should be tested if this fix does actually ever get
> reverted.

An short example *.c code to capture the bug in commit message ?

> 
> > >
> > > Thanks.
> > >
> > >  - Alistair
> > >
> > > > This is an temporary solution to mitigate the racing fix.
> > > >
> > > > After supporting follow huge page by FOLL_GET is done, this fix can be
> > > > reverted safely.
> > > >
> > > > Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array
> racing
> > > with memory offline")
> > > > Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> > > > ---
> > > >  mm/migrate.c | 10 ++++++++--
> > > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > > index 6a1597c92261..581dfaad9257 100644
> > > > --- a/mm/migrate.c
> > > > +++ b/mm/migrate.c
> > > > @@ -1848,6 +1848,7 @@ static void do_pages_stat_array(struct mm_struct
> *mm,
> > > unsigned long nr_pages,
> > > >
> > > >  	for (i = 0; i < nr_pages; i++) {
> > > >  		unsigned long addr = (unsigned long)(*pages);
> > > > +		unsigned int foll_flags = FOLL_DUMP;
> > > >  		struct vm_area_struct *vma;
> > > >  		struct page *page;
> > > >  		int err = -EFAULT;
> > > > @@ -1856,8 +1857,12 @@ static void do_pages_stat_array(struct mm_struct
> *mm,
> > > unsigned long nr_pages,
> > > >  		if (!vma)
> > > >  			goto set_status;
> > > >
> > > > +		/* Not all huge page follow APIs support 'FOLL_GET' */
> > > > +		if (!is_vm_hugetlb_page(vma))
> > > > +			foll_flags |= FOLL_GET;
> > > > +
> > > >  		/* FOLL_DUMP to ignore special (like zero) pages */
> > > > -		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> > > > +		page = follow_page(vma, addr, foll_flags);
> > > >
> > > >  		err = PTR_ERR(page);
> > > >  		if (IS_ERR(page))
> > > > @@ -1865,7 +1870,8 @@ static void do_pages_stat_array(struct mm_struct
> *mm,
> > > unsigned long nr_pages,
> > > >
> > > >  		if (page && !is_zone_device_page(page)) {
> > > >  			err = page_to_nid(page);
> > > > -			put_page(page);
> > > > +			if (foll_flags & FOLL_GET)
> > > > +				put_page(page);
> > > >  		} else {
> > > >  			err = -ENOENT;
> > > >  		}
> > > >
> > >
> > >
> > >
> >
> >
> 
> 
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-15  5:20           ` Wang, Haiyue
@ 2022-08-15  5:35             ` Alistair Popple
  2022-08-15  5:37               ` Wang, Haiyue
  0 siblings, 1 reply; 62+ messages in thread
From: Alistair Popple @ 2022-08-15  5:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Wang, Haiyue
  Cc: akpm, david, linmiaohe, Huang, Ying, songmuchun, naoya.horiguchi,
	alex.sierra

On Monday, 15 August 2022 3:20:28 PM AEST Wang, Haiyue wrote:
> > -----Original Message-----
> > From: Alistair Popple <apopple@nvidia.com>
> > Sent: Monday, August 15, 2022 13:17
> > To: linux-mm@kvack.org; linux-kernel@vger.kernel.org; Wang, Haiyue 
<haiyue.wang@intel.com>
> > Cc: akpm@linux-foundation.org; david@redhat.com; linmiaohe@huawei.com; 
Huang, Ying
> > <ying.huang@intel.com>; songmuchun@bytedance.com; 
naoya.horiguchi@linux.dev; alex.sierra@amd.com
> > Subject: Re: [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on 
following huge page
> > 
> > On Monday, 15 August 2022 2:40:48 PM AEST Wang, Haiyue wrote:
> > > > -----Original Message-----
> > > > From: Alistair Popple <apopple@nvidia.com>
> > > > Sent: Monday, August 15, 2022 12:29
> > > > To: linux-mm@kvack.org; linux-kernel@vger.kernel.org; Wang, Haiyue
> > <haiyue.wang@intel.com>
> > > > Cc: akpm@linux-foundation.org; david@redhat.com; linmiaohe@huawei.com;
> > Huang, Ying
> > > > <ying.huang@intel.com>; songmuchun@bytedance.com;
> > naoya.horiguchi@linux.dev; alex.sierra@amd.com; Wang,
> > > > Haiyue <haiyue.wang@intel.com>
> > > > Subject: Re: [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on
> > following huge page
> > > >
> > > > On Monday, 15 August 2022 11:59:08 AM AEST Haiyue Wang wrote:
> > > > > Not all huge page APIs support FOLL_GET option, so the 
__NR_move_pages
> > > > > will fail to get the page node information for huge page.
> > > >
> > > > I think you should be explicit in the commit message about which 
functions
> > do
> > > > not support FOLL_GET as it's not obvious what support needs to be 
added
> > before
> > > > this fix can be reverted.
> > >
> > > Yes, make sense, will add them in new patch.
> > 
> > Actually while you're at it I think it would be good to include a 
description
> > of the impact of this failure in the commit message. Ie. You're answer to:
> > 
> > > What are the user-visible runtime effects of this bug?
> > 
> > As it documents what should be tested if this fix does actually ever get
> > reverted.
> 
> An short example *.c code to capture the bug in commit message ?

That's probably overkill. Just being a bit more explicit about the 
circumstances in which sys_move_pages() actually fails would be good. Eg. 
something like this:

"Without this sys_move_pages() will return -ENOENT for 1GB huge page
memory map when dumping the page node information for nodes != NULL"

> > 
> > > >
> > > > Thanks.
> > > >
> > > >  - Alistair
> > > >
> > > > > This is an temporary solution to mitigate the racing fix.
> > > > >
> > > > > After supporting follow huge page by FOLL_GET is done, this fix can 
be
> > > > > reverted safely.
> > > > >
> > > > > Fixes: 4cd614841c06 ("mm: migration: fix possible 
do_pages_stat_array
> > racing
> > > > with memory offline")
> > > > > Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> > > > > ---
> > > > >  mm/migrate.c | 10 ++++++++--
> > > > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > > > index 6a1597c92261..581dfaad9257 100644
> > > > > --- a/mm/migrate.c
> > > > > +++ b/mm/migrate.c
> > > > > @@ -1848,6 +1848,7 @@ static void do_pages_stat_array(struct 
mm_struct
> > *mm,
> > > > unsigned long nr_pages,
> > > > >
> > > > >  	for (i = 0; i < nr_pages; i++) {
> > > > >  		unsigned long addr = (unsigned long)(*pages);
> > > > > +		unsigned int foll_flags = FOLL_DUMP;
> > > > >  		struct vm_area_struct *vma;
> > > > >  		struct page *page;
> > > > >  		int err = -EFAULT;
> > > > > @@ -1856,8 +1857,12 @@ static void do_pages_stat_array(struct 
mm_struct
> > *mm,
> > > > unsigned long nr_pages,
> > > > >  		if (!vma)
> > > > >  			goto set_status;
> > > > >
> > > > > +		/* Not all huge page follow APIs support 'FOLL_GET' 
*/
> > > > > +		if (!is_vm_hugetlb_page(vma))
> > > > > +			foll_flags |= FOLL_GET;
> > > > > +
> > > > >  		/* FOLL_DUMP to ignore special (like zero) pages */
> > > > > -		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> > > > > +		page = follow_page(vma, addr, foll_flags);
> > > > >
> > > > >  		err = PTR_ERR(page);
> > > > >  		if (IS_ERR(page))
> > > > > @@ -1865,7 +1870,8 @@ static void do_pages_stat_array(struct 
mm_struct
> > *mm,
> > > > unsigned long nr_pages,
> > > > >
> > > > >  		if (page && !is_zone_device_page(page)) {
> > > > >  			err = page_to_nid(page);
> > > > > -			put_page(page);
> > > > > +			if (foll_flags & FOLL_GET)
> > > > > +				put_page(page);
> > > > >  		} else {
> > > > >  			err = -ENOENT;
> > > > >  		}
> > > > >
> > > >
> > > >
> > > >
> > >
> > >
> > 
> > 
> > 
> 
> 





^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-15  5:35             ` Alistair Popple
@ 2022-08-15  5:37               ` Wang, Haiyue
  0 siblings, 0 replies; 62+ messages in thread
From: Wang, Haiyue @ 2022-08-15  5:37 UTC (permalink / raw)
  To: Alistair Popple, linux-mm, linux-kernel
  Cc: akpm, david, linmiaohe, Huang, Ying, songmuchun, naoya.horiguchi,
	alex.sierra

> -----Original Message-----
> From: Alistair Popple <apopple@nvidia.com>
> Sent: Monday, August 15, 2022 13:35
> To: linux-mm@kvack.org; linux-kernel@vger.kernel.org; Wang, Haiyue <haiyue.wang@intel.com>
> Cc: akpm@linux-foundation.org; david@redhat.com; linmiaohe@huawei.com; Huang, Ying
> <ying.huang@intel.com>; songmuchun@bytedance.com; naoya.horiguchi@linux.dev; alex.sierra@amd.com
> Subject: Re: [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on following huge page
> 
> On Monday, 15 August 2022 3:20:28 PM AEST Wang, Haiyue wrote:
> > > -----Original Message-----
> > > From: Alistair Popple <apopple@nvidia.com>
> > > Sent: Monday, August 15, 2022 13:17
> > > To: linux-mm@kvack.org; linux-kernel@vger.kernel.org; Wang, Haiyue
> <haiyue.wang@intel.com>
> > > Cc: akpm@linux-foundation.org; david@redhat.com; linmiaohe@huawei.com;
> Huang, Ying
> > > <ying.huang@intel.com>; songmuchun@bytedance.com;
> naoya.horiguchi@linux.dev; alex.sierra@amd.com
> > > Subject: Re: [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on
> following huge page
> > >
> > > On Monday, 15 August 2022 2:40:48 PM AEST Wang, Haiyue wrote:
> > > > > -----Original Message-----
> > > > > From: Alistair Popple <apopple@nvidia.com>
> > > > > Sent: Monday, August 15, 2022 12:29
> > > > > To: linux-mm@kvack.org; linux-kernel@vger.kernel.org; Wang, Haiyue
> > > <haiyue.wang@intel.com>
> > > > > Cc: akpm@linux-foundation.org; david@redhat.com; linmiaohe@huawei.com;
> > > Huang, Ying
> > > > > <ying.huang@intel.com>; songmuchun@bytedance.com;
> > > naoya.horiguchi@linux.dev; alex.sierra@amd.com; Wang,
> > > > > Haiyue <haiyue.wang@intel.com>
> > > > > Subject: Re: [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on
> > > following huge page
> > > > >
> > > > > On Monday, 15 August 2022 11:59:08 AM AEST Haiyue Wang wrote:
> > > > > > Not all huge page APIs support FOLL_GET option, so the
> __NR_move_pages
> > > > > > will fail to get the page node information for huge page.
> > > > >
> > > > > I think you should be explicit in the commit message about which
> functions
> > > do
> > > > > not support FOLL_GET as it's not obvious what support needs to be
> added
> > > before
> > > > > this fix can be reverted.
> > > >
> > > > Yes, make sense, will add them in new patch.
> > >
> > > Actually while you're at it I think it would be good to include a
> description
> > > of the impact of this failure in the commit message. Ie. You're answer to:
> > >
> > > > What are the user-visible runtime effects of this bug?
> > >
> > > As it documents what should be tested if this fix does actually ever get
> > > reverted.
> >
> > An short example *.c code to capture the bug in commit message ?
> 
> That's probably overkill. Just being a bit more explicit about the
> circumstances in which sys_move_pages() actually fails would be good. Eg.
> something like this:
> 
> "Without this sys_move_pages() will return -ENOENT for 1GB huge page
> memory map when dumping the page node information for nodes != NULL"

OK, understood, will try to add more information about the error. ;-)

> 
> > >
> > > > >
> > > > > Thanks.
> > > > >
> > > > >  - Alistair
> > > > >
> > > > > > This is an temporary solution to mitigate the racing fix.
> > > > > >
> > > > > > After supporting follow huge page by FOLL_GET is done, this fix can
> be
> > > > > > reverted safely.
> > > > > >
> > > > > > Fixes: 4cd614841c06 ("mm: migration: fix possible
> do_pages_stat_array
> > > racing
> > > > > with memory offline")
> > > > > > Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> > > > > > ---
> > > > > >  mm/migrate.c | 10 ++++++++--
> > > > > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > > > > index 6a1597c92261..581dfaad9257 100644
> > > > > > --- a/mm/migrate.c
> > > > > > +++ b/mm/migrate.c
> > > > > > @@ -1848,6 +1848,7 @@ static void do_pages_stat_array(struct
> mm_struct
> > > *mm,
> > > > > unsigned long nr_pages,
> > > > > >
> > > > > >  	for (i = 0; i < nr_pages; i++) {
> > > > > >  		unsigned long addr = (unsigned long)(*pages);
> > > > > > +		unsigned int foll_flags = FOLL_DUMP;
> > > > > >  		struct vm_area_struct *vma;
> > > > > >  		struct page *page;
> > > > > >  		int err = -EFAULT;
> > > > > > @@ -1856,8 +1857,12 @@ static void do_pages_stat_array(struct
> mm_struct
> > > *mm,
> > > > > unsigned long nr_pages,
> > > > > >  		if (!vma)
> > > > > >  			goto set_status;
> > > > > >
> > > > > > +		/* Not all huge page follow APIs support 'FOLL_GET'
> */
> > > > > > +		if (!is_vm_hugetlb_page(vma))
> > > > > > +			foll_flags |= FOLL_GET;
> > > > > > +
> > > > > >  		/* FOLL_DUMP to ignore special (like zero) pages */
> > > > > > -		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> > > > > > +		page = follow_page(vma, addr, foll_flags);
> > > > > >
> > > > > >  		err = PTR_ERR(page);
> > > > > >  		if (IS_ERR(page))
> > > > > > @@ -1865,7 +1870,8 @@ static void do_pages_stat_array(struct
> mm_struct
> > > *mm,
> > > > > unsigned long nr_pages,
> > > > > >
> > > > > >  		if (page && !is_zone_device_page(page)) {
> > > > > >  			err = page_to_nid(page);
> > > > > > -			put_page(page);
> > > > > > +			if (foll_flags & FOLL_GET)
> > > > > > +				put_page(page);
> > > > > >  		} else {
> > > > > >  			err = -ENOENT;
> > > > > >  		}
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > >
> >
> >
> 
> 
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v5 0/2] fix follow_page related issues
  2022-08-12  8:49 [PATCH v1] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
                   ` (3 preceding siblings ...)
  2022-08-15  1:59 ` [PATCH v4 0/2] fix follow_page related issues Haiyue Wang
@ 2022-08-15  7:02 ` Haiyue Wang
  2022-08-15  7:02   ` [PATCH v5 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
  2022-08-15  7:02   ` [PATCH v5 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
  2022-08-16  2:20 ` [PATCH v6 0/2] fix follow_page related issues Haiyue Wang
  5 siblings, 2 replies; 62+ messages in thread
From: Haiyue Wang @ 2022-08-15  7:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, apopple, linmiaohe, ying.huang, songmuchun,
	naoya.horiguchi, alex.sierra, Haiyue Wang

v5: reword the commit message for FOLL_GET with more information.

v4: add '()' for the function for readability.
    add more words about the Non-LRU pages fix in commit message.

v3: Merge the fix for handling Non-LRU pages into one patch.
    Drop the break_ksm zone device page check.

v2: Add the Non-LRU pages fix with two patches, so that
    'mm: migration: fix the FOLL_GET' can be applied directly
    on linux-5.19 stable branch.

Haiyue Wang (2):
  mm: migration: fix the FOLL_GET failure on following huge page
  mm: fix the handling Non-LRU pages returned by follow_page

 mm/huge_memory.c |  4 ++--
 mm/ksm.c         | 12 +++++++++---
 mm/migrate.c     | 20 +++++++++++++++-----
 3 files changed, 26 insertions(+), 10 deletions(-)

-- 
2.37.2


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v5 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-15  7:02 ` [PATCH v5 0/2] fix follow_page related issues Haiyue Wang
@ 2022-08-15  7:02   ` Haiyue Wang
  2022-08-15  7:40     ` Huang, Ying
  2022-08-15  7:02   ` [PATCH v5 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
  1 sibling, 1 reply; 62+ messages in thread
From: Haiyue Wang @ 2022-08-15  7:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, apopple, linmiaohe, ying.huang, songmuchun,
	naoya.horiguchi, alex.sierra, Haiyue Wang

Not all huge page APIs support FOLL_GET option, so move_pages() syscall
will fail to get the page node information for some huge pages.

Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
return NULL page for FOLL_GET when calling move_pages() syscall with the
NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.

Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
      Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev

But these huge page APIs don't support FOLL_GET:
  1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
  2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
     It will cause WARN_ON_ONCE for FOLL_GET.
  3. follow_huge_pgd() in mm/hugetlb.c

This is an temporary solution to mitigate the side effect of the race
condition fix by calling follow_page() with FOLL_GET set for huge pages.

After supporting follow huge page by FOLL_GET is done, this fix can be
reverted safely.

Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
---
 mm/migrate.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 6a1597c92261..581dfaad9257 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1848,6 +1848,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 
 	for (i = 0; i < nr_pages; i++) {
 		unsigned long addr = (unsigned long)(*pages);
+		unsigned int foll_flags = FOLL_DUMP;
 		struct vm_area_struct *vma;
 		struct page *page;
 		int err = -EFAULT;
@@ -1856,8 +1857,12 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 		if (!vma)
 			goto set_status;
 
+		/* Not all huge page follow APIs support 'FOLL_GET' */
+		if (!is_vm_hugetlb_page(vma))
+			foll_flags |= FOLL_GET;
+
 		/* FOLL_DUMP to ignore special (like zero) pages */
-		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
+		page = follow_page(vma, addr, foll_flags);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
@@ -1865,7 +1870,8 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 
 		if (page && !is_zone_device_page(page)) {
 			err = page_to_nid(page);
-			put_page(page);
+			if (foll_flags & FOLL_GET)
+				put_page(page);
 		} else {
 			err = -ENOENT;
 		}
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 2/2] mm: fix the handling Non-LRU pages returned by follow_page
  2022-08-15  7:02 ` [PATCH v5 0/2] fix follow_page related issues Haiyue Wang
  2022-08-15  7:02   ` [PATCH v5 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
@ 2022-08-15  7:02   ` Haiyue Wang
  2022-08-15  7:50     ` Huang, Ying
                       ` (2 more replies)
  1 sibling, 3 replies; 62+ messages in thread
From: Haiyue Wang @ 2022-08-15  7:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, apopple, linmiaohe, ying.huang, songmuchun,
	naoya.horiguchi, alex.sierra, Haiyue Wang, Felix Kuehling

The handling Non-LRU pages returned by follow_page() jumps directly, it
doesn't call put_page() to handle the reference count, since 'FOLL_GET'
flag for follow_page() has get_page() called. Fix the zone device page
check by handling the page reference count correctly before returning.

And as David reviewed, "device pages are never PageKsm pages". Drop this
zone device page check for break_ksm().

Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
---
 mm/huge_memory.c |  4 ++--
 mm/ksm.c         | 12 +++++++++---
 mm/migrate.c     | 10 +++++++---
 3 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8a7c1b344abe..b2ba17c3dcd7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2963,10 +2963,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		/* FOLL_DUMP to ignore special (like zero) pages */
 		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
 
-		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
+		if (IS_ERR_OR_NULL(page))
 			continue;
 
-		if (!is_transparent_hugepage(page))
+		if (is_zone_device_page(page) || !is_transparent_hugepage(page))
 			goto next;
 
 		total++;
diff --git a/mm/ksm.c b/mm/ksm.c
index 42ab153335a2..e26f57fc1f0e 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -475,7 +475,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 		cond_resched();
 		page = follow_page(vma, addr,
 				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
-		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
+		if (IS_ERR_OR_NULL(page))
 			break;
 		if (PageKsm(page))
 			ret = handle_mm_fault(vma, addr,
@@ -560,12 +560,15 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 		goto out;
 
 	page = follow_page(vma, addr, FOLL_GET);
-	if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
+	if (IS_ERR_OR_NULL(page))
 		goto out;
+	if (is_zone_device_page(page))
+		goto out_putpage;
 	if (PageAnon(page)) {
 		flush_anon_page(vma, page, addr);
 		flush_dcache_page(page);
 	} else {
+out_putpage:
 		put_page(page);
 out:
 		page = NULL;
@@ -2308,11 +2311,13 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 			if (ksm_test_exit(mm))
 				break;
 			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
-			if (IS_ERR_OR_NULL(*page) || is_zone_device_page(*page)) {
+			if (IS_ERR_OR_NULL(*page)) {
 				ksm_scan.address += PAGE_SIZE;
 				cond_resched();
 				continue;
 			}
+			if (is_zone_device_page(*page))
+				goto next_page;
 			if (PageAnon(*page)) {
 				flush_anon_page(vma, *page, ksm_scan.address);
 				flush_dcache_page(*page);
@@ -2327,6 +2332,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 				mmap_read_unlock(mm);
 				return rmap_item;
 			}
+next_page:
 			put_page(*page);
 			ksm_scan.address += PAGE_SIZE;
 			cond_resched();
diff --git a/mm/migrate.c b/mm/migrate.c
index 581dfaad9257..fee12cd2f294 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1672,9 +1672,12 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 		goto out;
 
 	err = -ENOENT;
-	if (!page || is_zone_device_page(page))
+	if (!page)
 		goto out;
 
+	if (is_zone_device_page(page))
+		goto out_putpage;
+
 	err = 0;
 	if (page_to_nid(page) == node)
 		goto out_putpage;
@@ -1868,8 +1871,9 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 		if (IS_ERR(page))
 			goto set_status;
 
-		if (page && !is_zone_device_page(page)) {
-			err = page_to_nid(page);
+		if (page) {
+			err = !is_zone_device_page(page) ? page_to_nid(page)
+							 : -ENOENT;
 			if (foll_flags & FOLL_GET)
 				put_page(page);
 		} else {
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-15  7:02   ` [PATCH v5 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
@ 2022-08-15  7:40     ` Huang, Ying
  0 siblings, 0 replies; 62+ messages in thread
From: Huang, Ying @ 2022-08-15  7:40 UTC (permalink / raw)
  To: Haiyue Wang
  Cc: linux-mm, linux-kernel, akpm, david, apopple, linmiaohe,
	songmuchun, naoya.horiguchi, alex.sierra

Haiyue Wang <haiyue.wang@intel.com> writes:

> Not all huge page APIs support FOLL_GET option, so move_pages() syscall
> will fail to get the page node information for some huge pages.
>
> Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
> return NULL page for FOLL_GET when calling move_pages() syscall with the
> NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.
>
> Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
>       Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev
>
> But these huge page APIs don't support FOLL_GET:
>   1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
>   2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
>      It will cause WARN_ON_ONCE for FOLL_GET.
>   3. follow_huge_pgd() in mm/hugetlb.c
>
> This is an temporary solution to mitigate the side effect of the race
> condition fix by calling follow_page() with FOLL_GET set for huge pages.
>
> After supporting follow huge page by FOLL_GET is done, this fix can be
> reverted safely.
>
> Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>

LGTM, Thanks!

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

> ---
>  mm/migrate.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6a1597c92261..581dfaad9257 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1848,6 +1848,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>  
>  	for (i = 0; i < nr_pages; i++) {
>  		unsigned long addr = (unsigned long)(*pages);
> +		unsigned int foll_flags = FOLL_DUMP;
>  		struct vm_area_struct *vma;
>  		struct page *page;
>  		int err = -EFAULT;
> @@ -1856,8 +1857,12 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>  		if (!vma)
>  			goto set_status;
>  
> +		/* Not all huge page follow APIs support 'FOLL_GET' */
> +		if (!is_vm_hugetlb_page(vma))
> +			foll_flags |= FOLL_GET;
> +
>  		/* FOLL_DUMP to ignore special (like zero) pages */
> -		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> +		page = follow_page(vma, addr, foll_flags);
>  
>  		err = PTR_ERR(page);
>  		if (IS_ERR(page))
> @@ -1865,7 +1870,8 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>  
>  		if (page && !is_zone_device_page(page)) {
>  			err = page_to_nid(page);
> -			put_page(page);
> +			if (foll_flags & FOLL_GET)
> +				put_page(page);
>  		} else {
>  			err = -ENOENT;
>  		}

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 2/2] mm: fix the handling Non-LRU pages returned by follow_page
  2022-08-15  7:02   ` [PATCH v5 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
@ 2022-08-15  7:50     ` Huang, Ying
  2022-08-15 14:28     ` Felix Kuehling
  2022-08-16  0:00     ` Alistair Popple
  2 siblings, 0 replies; 62+ messages in thread
From: Huang, Ying @ 2022-08-15  7:50 UTC (permalink / raw)
  To: Haiyue Wang
  Cc: linux-mm, linux-kernel, akpm, david, apopple, linmiaohe,
	songmuchun, naoya.horiguchi, alex.sierra, Felix Kuehling

Haiyue Wang <haiyue.wang@intel.com> writes:

> The handling Non-LRU pages returned by follow_page() jumps directly, it
> doesn't call put_page() to handle the reference count, since 'FOLL_GET'
> flag for follow_page() has get_page() called. Fix the zone device page
> check by handling the page reference count correctly before returning.
>
> And as David reviewed, "device pages are never PageKsm pages". Drop this
> zone device page check for break_ksm().
>
> Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>

LGTM, Thanks!

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

> ---
>  mm/huge_memory.c |  4 ++--
>  mm/ksm.c         | 12 +++++++++---
>  mm/migrate.c     | 10 +++++++---
>  3 files changed, 18 insertions(+), 8 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8a7c1b344abe..b2ba17c3dcd7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2963,10 +2963,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		/* FOLL_DUMP to ignore special (like zero) pages */
>  		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
>  
> -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> +		if (IS_ERR_OR_NULL(page))
>  			continue;
>  
> -		if (!is_transparent_hugepage(page))
> +		if (is_zone_device_page(page) || !is_transparent_hugepage(page))
>  			goto next;
>  
>  		total++;
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 42ab153335a2..e26f57fc1f0e 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -475,7 +475,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
>  		cond_resched();
>  		page = follow_page(vma, addr,
>  				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
> -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> +		if (IS_ERR_OR_NULL(page))
>  			break;
>  		if (PageKsm(page))
>  			ret = handle_mm_fault(vma, addr,
> @@ -560,12 +560,15 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
>  		goto out;
>  
>  	page = follow_page(vma, addr, FOLL_GET);
> -	if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> +	if (IS_ERR_OR_NULL(page))
>  		goto out;
> +	if (is_zone_device_page(page))
> +		goto out_putpage;
>  	if (PageAnon(page)) {
>  		flush_anon_page(vma, page, addr);
>  		flush_dcache_page(page);
>  	} else {
> +out_putpage:
>  		put_page(page);
>  out:
>  		page = NULL;
> @@ -2308,11 +2311,13 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>  			if (ksm_test_exit(mm))
>  				break;
>  			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
> -			if (IS_ERR_OR_NULL(*page) || is_zone_device_page(*page)) {
> +			if (IS_ERR_OR_NULL(*page)) {
>  				ksm_scan.address += PAGE_SIZE;
>  				cond_resched();
>  				continue;
>  			}
> +			if (is_zone_device_page(*page))
> +				goto next_page;
>  			if (PageAnon(*page)) {
>  				flush_anon_page(vma, *page, ksm_scan.address);
>  				flush_dcache_page(*page);
> @@ -2327,6 +2332,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>  				mmap_read_unlock(mm);
>  				return rmap_item;
>  			}
> +next_page:
>  			put_page(*page);
>  			ksm_scan.address += PAGE_SIZE;
>  			cond_resched();
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 581dfaad9257..fee12cd2f294 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1672,9 +1672,12 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
>  		goto out;
>  
>  	err = -ENOENT;
> -	if (!page || is_zone_device_page(page))
> +	if (!page)
>  		goto out;
>  
> +	if (is_zone_device_page(page))
> +		goto out_putpage;
> +
>  	err = 0;
>  	if (page_to_nid(page) == node)
>  		goto out_putpage;
> @@ -1868,8 +1871,9 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>  		if (IS_ERR(page))
>  			goto set_status;
>  
> -		if (page && !is_zone_device_page(page)) {
> -			err = page_to_nid(page);
> +		if (page) {
> +			err = !is_zone_device_page(page) ? page_to_nid(page)
> +							 : -ENOENT;
>  			if (foll_flags & FOLL_GET)
>  				put_page(page);
>  		} else {

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 2/2] mm: fix the handling Non-LRU pages returned by follow_page
  2022-08-15  7:02   ` [PATCH v5 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
  2022-08-15  7:50     ` Huang, Ying
@ 2022-08-15 14:28     ` Felix Kuehling
  2022-08-16  0:00     ` Alistair Popple
  2 siblings, 0 replies; 62+ messages in thread
From: Felix Kuehling @ 2022-08-15 14:28 UTC (permalink / raw)
  To: Haiyue Wang, linux-mm, linux-kernel
  Cc: akpm, david, apopple, linmiaohe, ying.huang, songmuchun,
	naoya.horiguchi, alex.sierra

Am 2022-08-15 um 03:02 schrieb Haiyue Wang:
> The handling Non-LRU pages returned by follow_page() jumps directly, it
> doesn't call put_page() to handle the reference count, since 'FOLL_GET'
> flag for follow_page() has get_page() called. Fix the zone device page
> check by handling the page reference count correctly before returning.
>
> And as David reviewed, "device pages are never PageKsm pages". Drop this
> zone device page check for break_ksm().
>
> Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>

Thank you for catching this. The patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


> ---
>   mm/huge_memory.c |  4 ++--
>   mm/ksm.c         | 12 +++++++++---
>   mm/migrate.c     | 10 +++++++---
>   3 files changed, 18 insertions(+), 8 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8a7c1b344abe..b2ba17c3dcd7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2963,10 +2963,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>   		/* FOLL_DUMP to ignore special (like zero) pages */
>   		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
>   
> -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> +		if (IS_ERR_OR_NULL(page))
>   			continue;
>   
> -		if (!is_transparent_hugepage(page))
> +		if (is_zone_device_page(page) || !is_transparent_hugepage(page))
>   			goto next;
>   
>   		total++;
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 42ab153335a2..e26f57fc1f0e 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -475,7 +475,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
>   		cond_resched();
>   		page = follow_page(vma, addr,
>   				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
> -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> +		if (IS_ERR_OR_NULL(page))
>   			break;
>   		if (PageKsm(page))
>   			ret = handle_mm_fault(vma, addr,
> @@ -560,12 +560,15 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
>   		goto out;
>   
>   	page = follow_page(vma, addr, FOLL_GET);
> -	if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> +	if (IS_ERR_OR_NULL(page))
>   		goto out;
> +	if (is_zone_device_page(page))
> +		goto out_putpage;
>   	if (PageAnon(page)) {
>   		flush_anon_page(vma, page, addr);
>   		flush_dcache_page(page);
>   	} else {
> +out_putpage:
>   		put_page(page);
>   out:
>   		page = NULL;
> @@ -2308,11 +2311,13 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>   			if (ksm_test_exit(mm))
>   				break;
>   			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
> -			if (IS_ERR_OR_NULL(*page) || is_zone_device_page(*page)) {
> +			if (IS_ERR_OR_NULL(*page)) {
>   				ksm_scan.address += PAGE_SIZE;
>   				cond_resched();
>   				continue;
>   			}
> +			if (is_zone_device_page(*page))
> +				goto next_page;
>   			if (PageAnon(*page)) {
>   				flush_anon_page(vma, *page, ksm_scan.address);
>   				flush_dcache_page(*page);
> @@ -2327,6 +2332,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>   				mmap_read_unlock(mm);
>   				return rmap_item;
>   			}
> +next_page:
>   			put_page(*page);
>   			ksm_scan.address += PAGE_SIZE;
>   			cond_resched();
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 581dfaad9257..fee12cd2f294 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1672,9 +1672,12 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
>   		goto out;
>   
>   	err = -ENOENT;
> -	if (!page || is_zone_device_page(page))
> +	if (!page)
>   		goto out;
>   
> +	if (is_zone_device_page(page))
> +		goto out_putpage;
> +
>   	err = 0;
>   	if (page_to_nid(page) == node)
>   		goto out_putpage;
> @@ -1868,8 +1871,9 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>   		if (IS_ERR(page))
>   			goto set_status;
>   
> -		if (page && !is_zone_device_page(page)) {
> -			err = page_to_nid(page);
> +		if (page) {
> +			err = !is_zone_device_page(page) ? page_to_nid(page)
> +							 : -ENOENT;
>   			if (foll_flags & FOLL_GET)
>   				put_page(page);
>   		} else {

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 2/2] mm: fix the handling Non-LRU pages returned by follow_page
  2022-08-15  7:02   ` [PATCH v5 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
  2022-08-15  7:50     ` Huang, Ying
  2022-08-15 14:28     ` Felix Kuehling
@ 2022-08-16  0:00     ` Alistair Popple
  2022-08-16  1:12       ` Wang, Haiyue
  2 siblings, 1 reply; 62+ messages in thread
From: Alistair Popple @ 2022-08-16  0:00 UTC (permalink / raw)
  To: Haiyue Wang
  Cc: linux-mm, linux-kernel, akpm, david, linmiaohe, ying.huang,
	songmuchun, naoya.horiguchi, alex.sierra, Felix Kuehling


Haiyue Wang <haiyue.wang@intel.com> writes:

> The handling Non-LRU pages returned by follow_page() jumps directly, it
> doesn't call put_page() to handle the reference count, since 'FOLL_GET'
> flag for follow_page() has get_page() called. Fix the zone device page
> check by handling the page reference count correctly before returning.
>
> And as David reviewed, "device pages are never PageKsm pages". Drop this
> zone device page check for break_ksm().
>
> Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> ---
>  mm/huge_memory.c |  4 ++--
>  mm/ksm.c         | 12 +++++++++---
>  mm/migrate.c     | 10 +++++++---
>  3 files changed, 18 insertions(+), 8 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8a7c1b344abe..b2ba17c3dcd7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2963,10 +2963,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		/* FOLL_DUMP to ignore special (like zero) pages */
>  		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
>
> -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> +		if (IS_ERR_OR_NULL(page))
>  			continue;
>
> -		if (!is_transparent_hugepage(page))
> +		if (is_zone_device_page(page) || !is_transparent_hugepage(page))
>  			goto next;
>
>  		total++;
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 42ab153335a2..e26f57fc1f0e 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -475,7 +475,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
>  		cond_resched();
>  		page = follow_page(vma, addr,
>  				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
> -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> +		if (IS_ERR_OR_NULL(page))
>  			break;
>  		if (PageKsm(page))
>  			ret = handle_mm_fault(vma, addr,
> @@ -560,12 +560,15 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
>  		goto out;
>
>  	page = follow_page(vma, addr, FOLL_GET);
> -	if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> +	if (IS_ERR_OR_NULL(page))
>  		goto out;
> +	if (is_zone_device_page(page))

Same as for break_ksm() I think we should be able to drop the
is_zone_device_page() check here because scan_get_next_rmap_item()
already filters out zone device pages.

> +		goto out_putpage;
>  	if (PageAnon(page)) {
>  		flush_anon_page(vma, page, addr);
>  		flush_dcache_page(page);
>  	} else {
> +out_putpage:
>  		put_page(page);
>  out:
>  		page = NULL;
> @@ -2308,11 +2311,13 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>  			if (ksm_test_exit(mm))
>  				break;
>  			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
> -			if (IS_ERR_OR_NULL(*page) || is_zone_device_page(*page)) {
> +			if (IS_ERR_OR_NULL(*page)) {
>  				ksm_scan.address += PAGE_SIZE;
>  				cond_resched();
>  				continue;
>  			}
> +			if (is_zone_device_page(*page))
> +				goto next_page;
>  			if (PageAnon(*page)) {
>  				flush_anon_page(vma, *page, ksm_scan.address);
>  				flush_dcache_page(*page);
> @@ -2327,6 +2332,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>  				mmap_read_unlock(mm);
>  				return rmap_item;
>  			}
> +next_page:
>  			put_page(*page);
>  			ksm_scan.address += PAGE_SIZE;
>  			cond_resched();
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 581dfaad9257..fee12cd2f294 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1672,9 +1672,12 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
>  		goto out;
>
>  	err = -ENOENT;
> -	if (!page || is_zone_device_page(page))
> +	if (!page)
>  		goto out;
>
> +	if (is_zone_device_page(page))
> +		goto out_putpage;
> +
>  	err = 0;
>  	if (page_to_nid(page) == node)
>  		goto out_putpage;
> @@ -1868,8 +1871,9 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>  		if (IS_ERR(page))
>  			goto set_status;
>
> -		if (page && !is_zone_device_page(page)) {
> -			err = page_to_nid(page);
> +		if (page) {
> +			err = !is_zone_device_page(page) ? page_to_nid(page)
> +							 : -ENOENT;

Can we remove the multiple layers of conditionals here? Something like
this is cleaner and easier to understand IMHO:

-               if (page && !is_zone_device_page(page)) {
-                       err = page_to_nid(page);
-                       if (foll_flags & FOLL_GET)
-                               put_page(page);
-               } else {
+               if (!page) {
                        err = -ENOENT;
+                       goto set_status;
                }
+
+               if (is_zone_device_page(page))
+                       err = -ENOENT;
+               else
+                       err = page_to_nid_page(page);
+
+               if (foll_flags & FOLL_GET)
+                       put_page(page);

Thanks.

 - Alistair

>  			if (foll_flags & FOLL_GET)
>  				put_page(page);
>  		} else {

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [PATCH v5 2/2] mm: fix the handling Non-LRU pages returned by follow_page
  2022-08-16  0:00     ` Alistair Popple
@ 2022-08-16  1:12       ` Wang, Haiyue
  2022-08-16  2:45         ` Alistair Popple
  0 siblings, 1 reply; 62+ messages in thread
From: Wang, Haiyue @ 2022-08-16  1:12 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, linux-kernel, akpm, david, linmiaohe, Huang, Ying,
	songmuchun, naoya.horiguchi, alex.sierra, Felix Kuehling

> -----Original Message-----
> From: Alistair Popple <apopple@nvidia.com>
> Sent: Tuesday, August 16, 2022 08:01
> To: Wang, Haiyue <haiyue.wang@intel.com>
> Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org; david@redhat.com;
> linmiaohe@huawei.com; Huang, Ying <ying.huang@intel.com>; songmuchun@bytedance.com;
> naoya.horiguchi@linux.dev; alex.sierra@amd.com; Felix Kuehling <Felix.Kuehling@amd.com>
> Subject: Re: [PATCH v5 2/2] mm: fix the handling Non-LRU pages returned by follow_page
> 
> 
> Haiyue Wang <haiyue.wang@intel.com> writes:
> 
> > The handling Non-LRU pages returned by follow_page() jumps directly, it
> > doesn't call put_page() to handle the reference count, since 'FOLL_GET'
> > flag for follow_page() has get_page() called. Fix the zone device page
> > check by handling the page reference count correctly before returning.
> >
> > And as David reviewed, "device pages are never PageKsm pages". Drop this
> > zone device page check for break_ksm().
> >
> > Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
> > Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> > ---
> >  mm/huge_memory.c |  4 ++--
> >  mm/ksm.c         | 12 +++++++++---
> >  mm/migrate.c     | 10 +++++++---
> >  3 files changed, 18 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 8a7c1b344abe..b2ba17c3dcd7 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2963,10 +2963,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
> >  		/* FOLL_DUMP to ignore special (like zero) pages */
> >  		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> >
> > -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> > +		if (IS_ERR_OR_NULL(page))
> >  			continue;
> >
> > -		if (!is_transparent_hugepage(page))
> > +		if (is_zone_device_page(page) || !is_transparent_hugepage(page))
> >  			goto next;
> >
> >  		total++;
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 42ab153335a2..e26f57fc1f0e 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -475,7 +475,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
> >  		cond_resched();
> >  		page = follow_page(vma, addr,
> >  				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
> > -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> > +		if (IS_ERR_OR_NULL(page))
> >  			break;
> >  		if (PageKsm(page))
> >  			ret = handle_mm_fault(vma, addr,
> > @@ -560,12 +560,15 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
> >  		goto out;
> >
> >  	page = follow_page(vma, addr, FOLL_GET);
> > -	if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> > +	if (IS_ERR_OR_NULL(page))
> >  		goto out;
> > +	if (is_zone_device_page(page))
> 
> Same as for break_ksm() I think we should be able to drop the
> is_zone_device_page() check here because scan_get_next_rmap_item()
> already filters out zone device pages.
> 

The 'page' for scan_get_next_rmap_item() is from 'vma' which is NOT MERGEABLE:
	for (; vma; vma = vma->vm_next) {
		if (!(vma->vm_flags & VM_MERGEABLE))
			continue;

The 'page' for get_mergeable_page() is from 'vma' which is MERGEABLE by 'find_mergeable_vma()'

So they may be different, and the unstable_tree_search_insert() shows the logical:

 'page' vs 'tree_page':

		tree_page = get_mergeable_page(tree_rmap_item);
		if (!tree_page)
			return NULL;

		/*
		 * Don't substitute a ksm page for a forked page.
		 */
		if (page == tree_page) {
			put_page(tree_page);
			return NULL;
		}

		ret = memcmp_pages(page, tree_page);


> > +		goto out_putpage;
> >  	if (PageAnon(page)) {
> >  		flush_anon_page(vma, page, addr);
> >  		flush_dcache_page(page);
> >  	} else {
> > +out_putpage:
> >  		put_page(page);
> >  out:
> >  		page = NULL;
> > @@ -2308,11 +2311,13 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> >  			if (ksm_test_exit(mm))
> >  				break;
> >  			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
> > -			if (IS_ERR_OR_NULL(*page) || is_zone_device_page(*page)) {
> > +			if (IS_ERR_OR_NULL(*page)) {
> >  				ksm_scan.address += PAGE_SIZE;
> >  				cond_resched();
> >  				continue;
> >  			}
> > +			if (is_zone_device_page(*page))
> > +				goto next_page;
> >  			if (PageAnon(*page)) {
> >  				flush_anon_page(vma, *page, ksm_scan.address);
> >  				flush_dcache_page(*page);
> > @@ -2327,6 +2332,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> >  				mmap_read_unlock(mm);
> >  				return rmap_item;
> >  			}
> > +next_page:
> >  			put_page(*page);
> >  			ksm_scan.address += PAGE_SIZE;
> >  			cond_resched();
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 581dfaad9257..fee12cd2f294 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1672,9 +1672,12 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
> >  		goto out;
> >
> >  	err = -ENOENT;
> > -	if (!page || is_zone_device_page(page))
> > +	if (!page)
> >  		goto out;
> >
> > +	if (is_zone_device_page(page))
> > +		goto out_putpage;
> > +
> >  	err = 0;
> >  	if (page_to_nid(page) == node)
> >  		goto out_putpage;
> > @@ -1868,8 +1871,9 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
> >  		if (IS_ERR(page))
> >  			goto set_status;
> >
> > -		if (page && !is_zone_device_page(page)) {
> > -			err = page_to_nid(page);
> > +		if (page) {
> > +			err = !is_zone_device_page(page) ? page_to_nid(page)
> > +							 : -ENOENT;
> 
> Can we remove the multiple layers of conditionals here? Something like
> this is cleaner and easier to understand IMHO:

OK, I will try it in new patch.

> 
> -               if (page && !is_zone_device_page(page)) {
> -                       err = page_to_nid(page);
> -                       if (foll_flags & FOLL_GET)
> -                               put_page(page);
> -               } else {
> +               if (!page) {
>                         err = -ENOENT;
> +                       goto set_status;
>                 }
> +
> +               if (is_zone_device_page(page))
> +                       err = -ENOENT;
> +               else
> +                       err = page_to_nid_page(page);
> +
> +               if (foll_flags & FOLL_GET)
> +                       put_page(page);
> 
> Thanks.
> 
>  - Alistair
> 
> >  			if (foll_flags & FOLL_GET)
> >  				put_page(page);
> >  		} else {

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v6 0/2] fix follow_page related issues
  2022-08-12  8:49 [PATCH v1] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
                   ` (4 preceding siblings ...)
  2022-08-15  7:02 ` [PATCH v5 0/2] fix follow_page related issues Haiyue Wang
@ 2022-08-16  2:20 ` Haiyue Wang
  2022-08-16  2:21   ` [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
  2022-08-16  2:21   ` [PATCH v6 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
  5 siblings, 2 replies; 62+ messages in thread
From: Haiyue Wang @ 2022-08-16  2:20 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, apopple, linmiaohe, ying.huang, songmuchun,
	naoya.horiguchi, alex.sierra, Haiyue Wang

v6: Simplify the multiple layers of conditionals for if {}

-               if (page) {
-                       err = !is_zone_device_page(page) ? page_to_nid(page)
-                                                        : -ENOENT;
-                       if (foll_flags & FOLL_GET)
-                               put_page(page);
-               } else {
-                       err = -ENOENT;
-               }
+               err = -ENOENT;
+               if (!page)
+                       goto set_status;
+
+               if (!is_zone_device_page(page))
+                       err = page_to_nid(page);
+
+               if (foll_flags & FOLL_GET)
+                       put_page(page);

v5: reword the commit message for FOLL_GET with more information.

v4: add '()' for the function for readability.
    add more words about the Non-LRU pages fix in commit message.

v3: Merge the fix for handling Non-LRU pages into one patch.
    Drop the break_ksm zone device page check.

v2: Add the Non-LRU pages fix with two patches, so that
    'mm: migration: fix the FOLL_GET' can be applied directly
    on linux-5.19 stable branch.

Haiyue Wang (2):
  mm: migration: fix the FOLL_GET failure on following huge page
  mm: fix the handling Non-LRU pages returned by follow_page

 mm/huge_memory.c |  4 ++--
 mm/ksm.c         | 12 +++++++++---
 mm/migrate.c     | 23 +++++++++++++++++------
 3 files changed, 28 insertions(+), 11 deletions(-)

-- 
2.37.2


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-16  2:20 ` [PATCH v6 0/2] fix follow_page related issues Haiyue Wang
@ 2022-08-16  2:21   ` Haiyue Wang
  2022-08-16  8:54     ` Baolin Wang
                       ` (2 more replies)
  2022-08-16  2:21   ` [PATCH v6 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
  1 sibling, 3 replies; 62+ messages in thread
From: Haiyue Wang @ 2022-08-16  2:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, apopple, linmiaohe, ying.huang, songmuchun,
	naoya.horiguchi, alex.sierra, Haiyue Wang

Not all huge page APIs support FOLL_GET option, so move_pages() syscall
will fail to get the page node information for some huge pages.

Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
return NULL page for FOLL_GET when calling move_pages() syscall with the
NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.

Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
      Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev

But these huge page APIs don't support FOLL_GET:
  1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
  2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
     It will cause WARN_ON_ONCE for FOLL_GET.
  3. follow_huge_pgd() in mm/hugetlb.c

This is an temporary solution to mitigate the side effect of the race
condition fix by calling follow_page() with FOLL_GET set for huge pages.

After supporting follow huge page by FOLL_GET is done, this fix can be
reverted safely.

Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/migrate.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 6a1597c92261..581dfaad9257 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1848,6 +1848,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 
 	for (i = 0; i < nr_pages; i++) {
 		unsigned long addr = (unsigned long)(*pages);
+		unsigned int foll_flags = FOLL_DUMP;
 		struct vm_area_struct *vma;
 		struct page *page;
 		int err = -EFAULT;
@@ -1856,8 +1857,12 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 		if (!vma)
 			goto set_status;
 
+		/* Not all huge page follow APIs support 'FOLL_GET' */
+		if (!is_vm_hugetlb_page(vma))
+			foll_flags |= FOLL_GET;
+
 		/* FOLL_DUMP to ignore special (like zero) pages */
-		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
+		page = follow_page(vma, addr, foll_flags);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
@@ -1865,7 +1870,8 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 
 		if (page && !is_zone_device_page(page)) {
 			err = page_to_nid(page);
-			put_page(page);
+			if (foll_flags & FOLL_GET)
+				put_page(page);
 		} else {
 			err = -ENOENT;
 		}
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v6 2/2] mm: fix the handling Non-LRU pages returned by follow_page
  2022-08-16  2:20 ` [PATCH v6 0/2] fix follow_page related issues Haiyue Wang
  2022-08-16  2:21   ` [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
@ 2022-08-16  2:21   ` Haiyue Wang
  2022-08-16  4:42     ` Alistair Popple
  2022-08-17  2:34     ` Miaohe Lin
  1 sibling, 2 replies; 62+ messages in thread
From: Haiyue Wang @ 2022-08-16  2:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, apopple, linmiaohe, ying.huang, songmuchun,
	naoya.horiguchi, alex.sierra, Haiyue Wang, Felix Kuehling

The handling Non-LRU pages returned by follow_page() jumps directly, it
doesn't call put_page() to handle the reference count, since 'FOLL_GET'
flag for follow_page() has get_page() called. Fix the zone device page
check by handling the page reference count correctly before returning.

And as David reviewed, "device pages are never PageKsm pages". Drop this
zone device page check for break_ksm().

Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 mm/huge_memory.c |  4 ++--
 mm/ksm.c         | 12 +++++++++---
 mm/migrate.c     | 19 ++++++++++++-------
 3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8a7c1b344abe..b2ba17c3dcd7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2963,10 +2963,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		/* FOLL_DUMP to ignore special (like zero) pages */
 		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
 
-		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
+		if (IS_ERR_OR_NULL(page))
 			continue;
 
-		if (!is_transparent_hugepage(page))
+		if (is_zone_device_page(page) || !is_transparent_hugepage(page))
 			goto next;
 
 		total++;
diff --git a/mm/ksm.c b/mm/ksm.c
index 42ab153335a2..e26f57fc1f0e 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -475,7 +475,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 		cond_resched();
 		page = follow_page(vma, addr,
 				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
-		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
+		if (IS_ERR_OR_NULL(page))
 			break;
 		if (PageKsm(page))
 			ret = handle_mm_fault(vma, addr,
@@ -560,12 +560,15 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 		goto out;
 
 	page = follow_page(vma, addr, FOLL_GET);
-	if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
+	if (IS_ERR_OR_NULL(page))
 		goto out;
+	if (is_zone_device_page(page))
+		goto out_putpage;
 	if (PageAnon(page)) {
 		flush_anon_page(vma, page, addr);
 		flush_dcache_page(page);
 	} else {
+out_putpage:
 		put_page(page);
 out:
 		page = NULL;
@@ -2308,11 +2311,13 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 			if (ksm_test_exit(mm))
 				break;
 			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
-			if (IS_ERR_OR_NULL(*page) || is_zone_device_page(*page)) {
+			if (IS_ERR_OR_NULL(*page)) {
 				ksm_scan.address += PAGE_SIZE;
 				cond_resched();
 				continue;
 			}
+			if (is_zone_device_page(*page))
+				goto next_page;
 			if (PageAnon(*page)) {
 				flush_anon_page(vma, *page, ksm_scan.address);
 				flush_dcache_page(*page);
@@ -2327,6 +2332,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 				mmap_read_unlock(mm);
 				return rmap_item;
 			}
+next_page:
 			put_page(*page);
 			ksm_scan.address += PAGE_SIZE;
 			cond_resched();
diff --git a/mm/migrate.c b/mm/migrate.c
index 581dfaad9257..44e05ce41d49 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1672,9 +1672,12 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 		goto out;
 
 	err = -ENOENT;
-	if (!page || is_zone_device_page(page))
+	if (!page)
 		goto out;
 
+	if (is_zone_device_page(page))
+		goto out_putpage;
+
 	err = 0;
 	if (page_to_nid(page) == node)
 		goto out_putpage;
@@ -1868,13 +1871,15 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 		if (IS_ERR(page))
 			goto set_status;
 
-		if (page && !is_zone_device_page(page)) {
+		err = -ENOENT;
+		if (!page)
+			goto set_status;
+
+		if (!is_zone_device_page(page))
 			err = page_to_nid(page);
-			if (foll_flags & FOLL_GET)
-				put_page(page);
-		} else {
-			err = -ENOENT;
-		}
+
+		if (foll_flags & FOLL_GET)
+			put_page(page);
 set_status:
 		*status = err;
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 2/2] mm: fix the handling Non-LRU pages returned by follow_page
  2022-08-16  1:12       ` Wang, Haiyue
@ 2022-08-16  2:45         ` Alistair Popple
  0 siblings, 0 replies; 62+ messages in thread
From: Alistair Popple @ 2022-08-16  2:45 UTC (permalink / raw)
  To: Wang, Haiyue
  Cc: linux-mm, linux-kernel, akpm, david, linmiaohe, Huang, Ying,
	songmuchun, naoya.horiguchi, alex.sierra, Felix Kuehling


"Wang, Haiyue" <haiyue.wang@intel.com> writes:

>> -----Original Message-----
>> From: Alistair Popple <apopple@nvidia.com>
>> Sent: Tuesday, August 16, 2022 08:01
>> To: Wang, Haiyue <haiyue.wang@intel.com>
>> Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org; david@redhat.com;
>> linmiaohe@huawei.com; Huang, Ying <ying.huang@intel.com>; songmuchun@bytedance.com;
>> naoya.horiguchi@linux.dev; alex.sierra@amd.com; Felix Kuehling <Felix.Kuehling@amd.com>
>> Subject: Re: [PATCH v5 2/2] mm: fix the handling Non-LRU pages returned by follow_page
>>
>>
>> Haiyue Wang <haiyue.wang@intel.com> writes:
>>
>> > The handling Non-LRU pages returned by follow_page() jumps directly, it
>> > doesn't call put_page() to handle the reference count, since 'FOLL_GET'
>> > flag for follow_page() has get_page() called. Fix the zone device page
>> > check by handling the page reference count correctly before returning.
>> >
>> > And as David reviewed, "device pages are never PageKsm pages". Drop this
>> > zone device page check for break_ksm().
>> >
>> > Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
>> > Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
>> > ---
>> >  mm/huge_memory.c |  4 ++--
>> >  mm/ksm.c         | 12 +++++++++---
>> >  mm/migrate.c     | 10 +++++++---
>> >  3 files changed, 18 insertions(+), 8 deletions(-)
>> >
>> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> > index 8a7c1b344abe..b2ba17c3dcd7 100644
>> > --- a/mm/huge_memory.c
>> > +++ b/mm/huge_memory.c
>> > @@ -2963,10 +2963,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>> >  		/* FOLL_DUMP to ignore special (like zero) pages */
>> >  		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
>> >
>> > -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
>> > +		if (IS_ERR_OR_NULL(page))
>> >  			continue;
>> >
>> > -		if (!is_transparent_hugepage(page))
>> > +		if (is_zone_device_page(page) || !is_transparent_hugepage(page))
>> >  			goto next;
>> >
>> >  		total++;
>> > diff --git a/mm/ksm.c b/mm/ksm.c
>> > index 42ab153335a2..e26f57fc1f0e 100644
>> > --- a/mm/ksm.c
>> > +++ b/mm/ksm.c
>> > @@ -475,7 +475,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
>> >  		cond_resched();
>> >  		page = follow_page(vma, addr,
>> >  				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
>> > -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
>> > +		if (IS_ERR_OR_NULL(page))
>> >  			break;
>> >  		if (PageKsm(page))
>> >  			ret = handle_mm_fault(vma, addr,
>> > @@ -560,12 +560,15 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
>> >  		goto out;
>> >
>> >  	page = follow_page(vma, addr, FOLL_GET);
>> > -	if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
>> > +	if (IS_ERR_OR_NULL(page))
>> >  		goto out;
>> > +	if (is_zone_device_page(page))
>>
>> Same as for break_ksm() I think we should be able to drop the
>> is_zone_device_page() check here because scan_get_next_rmap_item()
>> already filters out zone device pages.
>>
>
> The 'page' for scan_get_next_rmap_item() is from 'vma' which is NOT MERGEABLE:
> 	for (; vma; vma = vma->vm_next) {
> 		if (!(vma->vm_flags & VM_MERGEABLE))
> 			continue;
>
> The 'page' for get_mergeable_page() is from 'vma' which is MERGEABLE by 'find_mergeable_vma()'

Oh, ok. I'm actually not too familiar with KSM but I think I follow so
if you think we need to keep the check by all means do so.

> So they may be different, and the unstable_tree_search_insert() shows the logical:
>
>  'page' vs 'tree_page':
>
> 		tree_page = get_mergeable_page(tree_rmap_item);
> 		if (!tree_page)
> 			return NULL;
>
> 		/*
> 		 * Don't substitute a ksm page for a forked page.
> 		 */
> 		if (page == tree_page) {
> 			put_page(tree_page);
> 			return NULL;
> 		}
>
> 		ret = memcmp_pages(page, tree_page);
>
>
>> > +		goto out_putpage;
>> >  	if (PageAnon(page)) {
>> >  		flush_anon_page(vma, page, addr);
>> >  		flush_dcache_page(page);
>> >  	} else {
>> > +out_putpage:
>> >  		put_page(page);
>> >  out:
>> >  		page = NULL;
>> > @@ -2308,11 +2311,13 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>> >  			if (ksm_test_exit(mm))
>> >  				break;
>> >  			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
>> > -			if (IS_ERR_OR_NULL(*page) || is_zone_device_page(*page)) {
>> > +			if (IS_ERR_OR_NULL(*page)) {
>> >  				ksm_scan.address += PAGE_SIZE;
>> >  				cond_resched();
>> >  				continue;
>> >  			}
>> > +			if (is_zone_device_page(*page))
>> > +				goto next_page;
>> >  			if (PageAnon(*page)) {
>> >  				flush_anon_page(vma, *page, ksm_scan.address);
>> >  				flush_dcache_page(*page);
>> > @@ -2327,6 +2332,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>> >  				mmap_read_unlock(mm);
>> >  				return rmap_item;
>> >  			}
>> > +next_page:
>> >  			put_page(*page);
>> >  			ksm_scan.address += PAGE_SIZE;
>> >  			cond_resched();
>> > diff --git a/mm/migrate.c b/mm/migrate.c
>> > index 581dfaad9257..fee12cd2f294 100644
>> > --- a/mm/migrate.c
>> > +++ b/mm/migrate.c
>> > @@ -1672,9 +1672,12 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
>> >  		goto out;
>> >
>> >  	err = -ENOENT;
>> > -	if (!page || is_zone_device_page(page))
>> > +	if (!page)
>> >  		goto out;
>> >
>> > +	if (is_zone_device_page(page))
>> > +		goto out_putpage;
>> > +
>> >  	err = 0;
>> >  	if (page_to_nid(page) == node)
>> >  		goto out_putpage;
>> > @@ -1868,8 +1871,9 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>> >  		if (IS_ERR(page))
>> >  			goto set_status;
>> >
>> > -		if (page && !is_zone_device_page(page)) {
>> > -			err = page_to_nid(page);
>> > +		if (page) {
>> > +			err = !is_zone_device_page(page) ? page_to_nid(page)
>> > +							 : -ENOENT;
>>
>> Can we remove the multiple layers of conditionals here? Something like
>> this is cleaner and easier to understand IMHO:
>
> OK, I will try it in new patch.

Thanks.

>>
>> -               if (page && !is_zone_device_page(page)) {
>> -                       err = page_to_nid(page);
>> -                       if (foll_flags & FOLL_GET)
>> -                               put_page(page);
>> -               } else {
>> +               if (!page) {
>>                         err = -ENOENT;
>> +                       goto set_status;
>>                 }
>> +
>> +               if (is_zone_device_page(page))
>> +                       err = -ENOENT;
>> +               else
>> +                       err = page_to_nid_page(page);
>> +
>> +               if (foll_flags & FOLL_GET)
>> +                       put_page(page);
>>
>> Thanks.
>>
>>  - Alistair
>>
>> >  			if (foll_flags & FOLL_GET)
>> >  				put_page(page);
>> >  		} else {

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 2/2] mm: fix the handling Non-LRU pages returned by follow_page
  2022-08-16  2:21   ` [PATCH v6 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
@ 2022-08-16  4:42     ` Alistair Popple
  2022-08-17  2:34     ` Miaohe Lin
  1 sibling, 0 replies; 62+ messages in thread
From: Alistair Popple @ 2022-08-16  4:42 UTC (permalink / raw)
  To: Haiyue Wang
  Cc: linux-mm, linux-kernel, akpm, david, linmiaohe, ying.huang,
	songmuchun, naoya.horiguchi, alex.sierra, Felix Kuehling


Looks good, thanks.

Reviewed-by: Alistair Popple <apopple@nvidia.com>

Haiyue Wang <haiyue.wang@intel.com> writes:

> The handling Non-LRU pages returned by follow_page() jumps directly, it
> doesn't call put_page() to handle the reference count, since 'FOLL_GET'
> flag for follow_page() has get_page() called. Fix the zone device page
> check by handling the page reference count correctly before returning.
>
> And as David reviewed, "device pages are never PageKsm pages". Drop this
> zone device page check for break_ksm().
>
> Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
> ---
>  mm/huge_memory.c |  4 ++--
>  mm/ksm.c         | 12 +++++++++---
>  mm/migrate.c     | 19 ++++++++++++-------
>  3 files changed, 23 insertions(+), 12 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8a7c1b344abe..b2ba17c3dcd7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2963,10 +2963,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		/* FOLL_DUMP to ignore special (like zero) pages */
>  		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
>
> -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> +		if (IS_ERR_OR_NULL(page))
>  			continue;
>
> -		if (!is_transparent_hugepage(page))
> +		if (is_zone_device_page(page) || !is_transparent_hugepage(page))
>  			goto next;
>
>  		total++;
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 42ab153335a2..e26f57fc1f0e 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -475,7 +475,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
>  		cond_resched();
>  		page = follow_page(vma, addr,
>  				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
> -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> +		if (IS_ERR_OR_NULL(page))
>  			break;
>  		if (PageKsm(page))
>  			ret = handle_mm_fault(vma, addr,
> @@ -560,12 +560,15 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
>  		goto out;
>
>  	page = follow_page(vma, addr, FOLL_GET);
> -	if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> +	if (IS_ERR_OR_NULL(page))
>  		goto out;
> +	if (is_zone_device_page(page))
> +		goto out_putpage;
>  	if (PageAnon(page)) {
>  		flush_anon_page(vma, page, addr);
>  		flush_dcache_page(page);
>  	} else {
> +out_putpage:
>  		put_page(page);
>  out:
>  		page = NULL;
> @@ -2308,11 +2311,13 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>  			if (ksm_test_exit(mm))
>  				break;
>  			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
> -			if (IS_ERR_OR_NULL(*page) || is_zone_device_page(*page)) {
> +			if (IS_ERR_OR_NULL(*page)) {
>  				ksm_scan.address += PAGE_SIZE;
>  				cond_resched();
>  				continue;
>  			}
> +			if (is_zone_device_page(*page))
> +				goto next_page;
>  			if (PageAnon(*page)) {
>  				flush_anon_page(vma, *page, ksm_scan.address);
>  				flush_dcache_page(*page);
> @@ -2327,6 +2332,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>  				mmap_read_unlock(mm);
>  				return rmap_item;
>  			}
> +next_page:
>  			put_page(*page);
>  			ksm_scan.address += PAGE_SIZE;
>  			cond_resched();
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 581dfaad9257..44e05ce41d49 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1672,9 +1672,12 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
>  		goto out;
>
>  	err = -ENOENT;
> -	if (!page || is_zone_device_page(page))
> +	if (!page)
>  		goto out;
>
> +	if (is_zone_device_page(page))
> +		goto out_putpage;
> +
>  	err = 0;
>  	if (page_to_nid(page) == node)
>  		goto out_putpage;
> @@ -1868,13 +1871,15 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>  		if (IS_ERR(page))
>  			goto set_status;
>
> -		if (page && !is_zone_device_page(page)) {
> +		err = -ENOENT;
> +		if (!page)
> +			goto set_status;
> +
> +		if (!is_zone_device_page(page))
>  			err = page_to_nid(page);
> -			if (foll_flags & FOLL_GET)
> -				put_page(page);
> -		} else {
> -			err = -ENOENT;
> -		}
> +
> +		if (foll_flags & FOLL_GET)
> +			put_page(page);
>  set_status:
>  		*status = err;

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-16  2:21   ` [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
@ 2022-08-16  8:54     ` Baolin Wang
  2022-08-17  0:58     ` Andrew Morton
  2022-08-17  2:12     ` Miaohe Lin
  2 siblings, 0 replies; 62+ messages in thread
From: Baolin Wang @ 2022-08-16  8:54 UTC (permalink / raw)
  To: Haiyue Wang, linux-mm, linux-kernel
  Cc: akpm, david, apopple, linmiaohe, ying.huang, songmuchun,
	naoya.horiguchi, alex.sierra



On 8/16/2022 10:21 AM, Haiyue Wang wrote:
> Not all huge page APIs support FOLL_GET option, so move_pages() syscall
> will fail to get the page node information for some huge pages.
> 
> Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
> return NULL page for FOLL_GET when calling move_pages() syscall with the
> NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.
> 
> Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
>        Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev
> 
> But these huge page APIs don't support FOLL_GET:
>    1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
>    2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
>       It will cause WARN_ON_ONCE for FOLL_GET.
>    3. follow_huge_pgd() in mm/hugetlb.c
> 
> This is an temporary solution to mitigate the side effect of the race
> condition fix by calling follow_page() with FOLL_GET set for huge pages.
> 
> After supporting follow huge page by FOLL_GET is done, this fix can be
> reverted safely.
> 
> Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> ---

Looks good to me.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

>   mm/migrate.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6a1597c92261..581dfaad9257 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1848,6 +1848,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>   
>   	for (i = 0; i < nr_pages; i++) {
>   		unsigned long addr = (unsigned long)(*pages);
> +		unsigned int foll_flags = FOLL_DUMP;
>   		struct vm_area_struct *vma;
>   		struct page *page;
>   		int err = -EFAULT;
> @@ -1856,8 +1857,12 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>   		if (!vma)
>   			goto set_status;
>   
> +		/* Not all huge page follow APIs support 'FOLL_GET' */
> +		if (!is_vm_hugetlb_page(vma))
> +			foll_flags |= FOLL_GET;
> +
>   		/* FOLL_DUMP to ignore special (like zero) pages */
> -		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> +		page = follow_page(vma, addr, foll_flags);
>   
>   		err = PTR_ERR(page);
>   		if (IS_ERR(page))
> @@ -1865,7 +1870,8 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>   
>   		if (page && !is_zone_device_page(page)) {
>   			err = page_to_nid(page);
> -			put_page(page);
> +			if (foll_flags & FOLL_GET)
> +				put_page(page);
>   		} else {
>   			err = -ENOENT;
>   		}

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-16  2:21   ` [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
  2022-08-16  8:54     ` Baolin Wang
@ 2022-08-17  0:58     ` Andrew Morton
  2022-08-17  3:31       ` Wang, Haiyue
  2022-08-17  2:12     ` Miaohe Lin
  2 siblings, 1 reply; 62+ messages in thread
From: Andrew Morton @ 2022-08-17  0:58 UTC (permalink / raw)
  To: Haiyue Wang
  Cc: linux-mm, linux-kernel, david, apopple, linmiaohe, ying.huang,
	songmuchun, naoya.horiguchi, alex.sierra, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Mike Kravetz

On Tue, 16 Aug 2022 10:21:00 +0800 Haiyue Wang <haiyue.wang@intel.com> wrote:

> Not all huge page APIs support FOLL_GET option, so move_pages() syscall
> will fail to get the page node information for some huge pages.
> 
> Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
> return NULL page for FOLL_GET when calling move_pages() syscall with the
> NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.
> 
> Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
>       Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev
> 
> But these huge page APIs don't support FOLL_GET:
>   1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c

Let's tell the s390 maintainers.

>   2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
>      It will cause WARN_ON_ONCE for FOLL_GET.

ia64 doesn't have maintainers :( Can we come up with something local to
arch/ia64 for this?

>   3. follow_huge_pgd() in mm/hugetlb.c

Hi, Mike.

> This is an temporary solution to mitigate the side effect of the race
> condition fix by calling follow_page() with FOLL_GET set for huge pages.
> 
> After supporting follow huge page by FOLL_GET is done, this fix can be
> reverted safely.
> 
> ...
>
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1848,6 +1848,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>  
>  	for (i = 0; i < nr_pages; i++) {
>  		unsigned long addr = (unsigned long)(*pages);
> +		unsigned int foll_flags = FOLL_DUMP;
>  		struct vm_area_struct *vma;
>  		struct page *page;
>  		int err = -EFAULT;
> @@ -1856,8 +1857,12 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>  		if (!vma)
>  			goto set_status;
>  
> +		/* Not all huge page follow APIs support 'FOLL_GET' */
> +		if (!is_vm_hugetlb_page(vma))
> +			foll_flags |= FOLL_GET;
> +
>  		/* FOLL_DUMP to ignore special (like zero) pages */
> -		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> +		page = follow_page(vma, addr, foll_flags);
>  
>  		err = PTR_ERR(page);
>  		if (IS_ERR(page))
> @@ -1865,7 +1870,8 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>  
>  		if (page && !is_zone_device_page(page)) {
>  			err = page_to_nid(page);
> -			put_page(page);
> +			if (foll_flags & FOLL_GET)
> +				put_page(page);
>  		} else {
>  			err = -ENOENT;
>  		}

I would be better to fix this for real at those three client code sites?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-16  2:21   ` [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
  2022-08-16  8:54     ` Baolin Wang
  2022-08-17  0:58     ` Andrew Morton
@ 2022-08-17  2:12     ` Miaohe Lin
  2 siblings, 0 replies; 62+ messages in thread
From: Miaohe Lin @ 2022-08-17  2:12 UTC (permalink / raw)
  To: Haiyue Wang, linux-mm, linux-kernel
  Cc: akpm, david, apopple, ying.huang, songmuchun, naoya.horiguchi,
	alex.sierra

On 2022/8/16 10:21, Haiyue Wang wrote:
> Not all huge page APIs support FOLL_GET option, so move_pages() syscall
> will fail to get the page node information for some huge pages.
> 
> Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
> return NULL page for FOLL_GET when calling move_pages() syscall with the
> NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.
> 
> Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
>       Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev
> 
> But these huge page APIs don't support FOLL_GET:
>   1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
>   2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
>      It will cause WARN_ON_ONCE for FOLL_GET.
>   3. follow_huge_pgd() in mm/hugetlb.c
> 
> This is an temporary solution to mitigate the side effect of the race
> condition fix by calling follow_page() with FOLL_GET set for huge pages.
> 
> After supporting follow huge page by FOLL_GET is done, this fix can be
> reverted safely.
> 
> Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

LGTM. Many thanks for fixing this.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks,
Miaohe Lin



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 2/2] mm: fix the handling Non-LRU pages returned by follow_page
  2022-08-16  2:21   ` [PATCH v6 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
  2022-08-16  4:42     ` Alistair Popple
@ 2022-08-17  2:34     ` Miaohe Lin
  2022-08-23 10:07       ` David Hildenbrand
  1 sibling, 1 reply; 62+ messages in thread
From: Miaohe Lin @ 2022-08-17  2:34 UTC (permalink / raw)
  To: Haiyue Wang, linux-mm, linux-kernel
  Cc: akpm, david, apopple, ying.huang, songmuchun, naoya.horiguchi,
	alex.sierra, Felix Kuehling

On 2022/8/16 10:21, Haiyue Wang wrote:
> The handling Non-LRU pages returned by follow_page() jumps directly, it
> doesn't call put_page() to handle the reference count, since 'FOLL_GET'
> flag for follow_page() has get_page() called. Fix the zone device page
> check by handling the page reference count correctly before returning.
> 
> And as David reviewed, "device pages are never PageKsm pages". Drop this
> zone device page check for break_ksm().
> 
> Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>

Thanks for your fixing. LGTM with one nit below. But I have no strong opinion on it.
So with or without fixing below nit:

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

> ---
>  mm/huge_memory.c |  4 ++--
>  mm/ksm.c         | 12 +++++++++---
>  mm/migrate.c     | 19 ++++++++++++-------
>  3 files changed, 23 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8a7c1b344abe..b2ba17c3dcd7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2963,10 +2963,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		/* FOLL_DUMP to ignore special (like zero) pages */
>  		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
>  
> -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> +		if (IS_ERR_OR_NULL(page))
>  			continue;
>  
> -		if (!is_transparent_hugepage(page))
> +		if (is_zone_device_page(page) || !is_transparent_hugepage(page))

!is_transparent_hugepage should already do the work here? IIRC, zone_device_page can't be
a transhuge page anyway. And only transparent_hugepage is cared here.

Thanks,
Miaohe Lin


^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-17  0:58     ` Andrew Morton
@ 2022-08-17  3:31       ` Wang, Haiyue
  2022-08-17  5:43         ` Andrew Morton
  0 siblings, 1 reply; 62+ messages in thread
From: Wang, Haiyue @ 2022-08-17  3:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, david, apopple, linmiaohe, Huang, Ying,
	songmuchun, naoya.horiguchi, alex.sierra, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Mike Kravetz

> -----Original Message-----
> From: Andrew Morton <akpm@linux-foundation.org>
> Sent: Wednesday, August 17, 2022 08:59
> To: Wang, Haiyue <haiyue.wang@intel.com>
> Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; david@redhat.com; apopple@nvidia.com;
> linmiaohe@huawei.com; Huang, Ying <ying.huang@intel.com>; songmuchun@bytedance.com;
> naoya.horiguchi@linux.dev; alex.sierra@amd.com; Heiko Carstens <hca@linux.ibm.com>; Vasily Gorbik
> <gor@linux.ibm.com>; Alexander Gordeev <agordeev@linux.ibm.com>; Christian Borntraeger
> <borntraeger@linux.ibm.com>; Sven Schnelle <svens@linux.ibm.com>; Mike Kravetz
> <mike.kravetz@oracle.com>
> Subject: Re: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
> 
> On Tue, 16 Aug 2022 10:21:00 +0800 Haiyue Wang <haiyue.wang@intel.com> wrote:
> 
> > Not all huge page APIs support FOLL_GET option, so move_pages() syscall
> > will fail to get the page node information for some huge pages.
> >
> > Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
> > return NULL page for FOLL_GET when calling move_pages() syscall with the
> > NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.
> >
> > Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
> >       Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev
> >
> > But these huge page APIs don't support FOLL_GET:
> >   1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
> 
> Let's tell the s390 maintainers.
> 
> >   2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
> >      It will cause WARN_ON_ONCE for FOLL_GET.
> 
> ia64 doesn't have maintainers :( Can we come up with something local to
> arch/ia64 for this?

The 'follow_huge_addr' itself just has interest on "FOLL_WRITE"
struct page *
follow_huge_addr(struct mm_struct *mm, unsigned long address,
			      int write)

And arch/ia64 defines this function 17 years ago ...

But I found that "WARN_ON_ONCE for FOLL_GET" was introduced on 2005-10-29
by commit:

[PATCH] mm: follow_page with inner ptlock

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=deceb6cd17e6dfafe4c4f81b1b4153bc41b2cb70

-	page = follow_huge_addr(mm, address, write);
-	if (! IS_ERR(page))
-		return page;
+	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
+	if (!IS_ERR(page)) {
+		BUG_ON(flags & FOLL_GET);
+		goto out;
+	}

> 
> >   3. follow_huge_pgd() in mm/hugetlb.c
> 
> Hi, Mike.
> 


> >  		}
> 
> I would be better to fix this for real at those three client code sites?

Then 5.19 will break for a while to wait for the final BIG patch ?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-17  3:31       ` Wang, Haiyue
@ 2022-08-17  5:43         ` Andrew Morton
  2022-08-17  5:47           ` Wang, Haiyue
                             ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Andrew Morton @ 2022-08-17  5:43 UTC (permalink / raw)
  To: Wang, Haiyue
  Cc: linux-mm, linux-kernel, david, apopple, linmiaohe, Huang, Ying,
	songmuchun, naoya.horiguchi, alex.sierra, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Mike Kravetz

On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:

> > >  		}
> > 
> > I would be better to fix this for real at those three client code sites?
> 
> Then 5.19 will break for a while to wait for the final BIG patch ?

If that's the proposal then your [1/2] should have had a cc:stable and
changelog words describing the plan for 6.0.

But before we do that I'd like to see at least a prototype of the final
fixes to s390 and hugetlb, so we can assess those as preferable for
backporting.  I don't think they'll be terribly intrusive or risky?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-17  5:43         ` Andrew Morton
@ 2022-08-17  5:47           ` Wang, Haiyue
  2022-08-17 17:26           ` Mike Kravetz
  2022-08-18 11:51           ` Gerald Schaefer
  2 siblings, 0 replies; 62+ messages in thread
From: Wang, Haiyue @ 2022-08-17  5:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, david, apopple, linmiaohe, Huang, Ying,
	songmuchun, naoya.horiguchi, alex.sierra, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Mike Kravetz

> -----Original Message-----
> From: Andrew Morton <akpm@linux-foundation.org>
> Sent: Wednesday, August 17, 2022 13:43
> To: Wang, Haiyue <haiyue.wang@intel.com>
> Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; david@redhat.com; apopple@nvidia.com;
> linmiaohe@huawei.com; Huang, Ying <ying.huang@intel.com>; songmuchun@bytedance.com;
> naoya.horiguchi@linux.dev; alex.sierra@amd.com; Heiko Carstens <hca@linux.ibm.com>; Vasily Gorbik
> <gor@linux.ibm.com>; Alexander Gordeev <agordeev@linux.ibm.com>; Christian Borntraeger
> <borntraeger@linux.ibm.com>; Sven Schnelle <svens@linux.ibm.com>; Mike Kravetz
> <mike.kravetz@oracle.com>
> Subject: Re: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
> 
> On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
> 
> > > >  		}
> > >
> > > I would be better to fix this for real at those three client code sites?
> >
> > Then 5.19 will break for a while to wait for the final BIG patch ?
> 
> If that's the proposal then your [1/2] should have had a cc:stable and
> changelog words describing the plan for 6.0.
> 
> But before we do that I'd like to see at least a prototype of the final
> fixes to s390 and hugetlb, so we can assess those as preferable for

Got it, make sense. ;-)

> backporting.  I don't think they'll be terribly intrusive or risky?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-17  5:43         ` Andrew Morton
  2022-08-17  5:47           ` Wang, Haiyue
@ 2022-08-17 17:26           ` Mike Kravetz
  2022-08-17 21:58             ` Mike Kravetz
  2022-08-19 11:22             ` Michael Ellerman
  2022-08-18 11:51           ` Gerald Schaefer
  2 siblings, 2 replies; 62+ messages in thread
From: Mike Kravetz @ 2022-08-17 17:26 UTC (permalink / raw)
  To: Andrew Morton, Michael Ellerman
  Cc: Wang, Haiyue, linux-mm, linux-kernel, david, apopple, linmiaohe,
	Huang, Ying, songmuchun, naoya.horiguchi, alex.sierra,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle

On 08/16/22 22:43, Andrew Morton wrote:
> On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
> 
> > > >  		}
> > > 
> > > I would be better to fix this for real at those three client code sites?
> > 
> > Then 5.19 will break for a while to wait for the final BIG patch ?
> 
> If that's the proposal then your [1/2] should have had a cc:stable and
> changelog words describing the plan for 6.0.
> 
> But before we do that I'd like to see at least a prototype of the final
> fixes to s390 and hugetlb, so we can assess those as preferable for
> backporting.  I don't think they'll be terribly intrusive or risky?

I will start on adding follow_huge_pgd() support.  Although, I may need
some help with verification from the powerpc folks, as that is the only
architecture which supports hugetlb pages at that level.

mpe any suggestions?

When Naoya recently modified follow_huge_pud, it ended up looking very
much like follow_huge_pmd (as it should).  I expect follow_huge_pgd
will be similar.  In the end we may want to factor out the common code.
However, for a backport it would be better to just modify follow_huge_pgd
and do the cleanup/condensing common code later.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-17 17:26           ` Mike Kravetz
@ 2022-08-17 21:58             ` Mike Kravetz
  2022-08-18  0:32               ` Wang, Haiyue
  2022-08-19 11:22             ` Michael Ellerman
  1 sibling, 1 reply; 62+ messages in thread
From: Mike Kravetz @ 2022-08-17 21:58 UTC (permalink / raw)
  To: Andrew Morton, Michael Ellerman
  Cc: Wang, Haiyue, linux-mm, linux-kernel, david, apopple, linmiaohe,
	Huang, Ying, songmuchun, naoya.horiguchi, alex.sierra,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle

On 08/17/22 10:26, Mike Kravetz wrote:
> On 08/16/22 22:43, Andrew Morton wrote:
> > On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
> > 
> > > > >  		}
> > > > 
> > > > I would be better to fix this for real at those three client code sites?
> > > 
> > > Then 5.19 will break for a while to wait for the final BIG patch ?
> > 
> > If that's the proposal then your [1/2] should have had a cc:stable and
> > changelog words describing the plan for 6.0.
> > 
> > But before we do that I'd like to see at least a prototype of the final
> > fixes to s390 and hugetlb, so we can assess those as preferable for
> > backporting.  I don't think they'll be terribly intrusive or risky?
> 
> I will start on adding follow_huge_pgd() support.  Although, I may need
> some help with verification from the powerpc folks, as that is the only
> architecture which supports hugetlb pages at that level.
> 
> mpe any suggestions?

From 4925a98a6857dbb5a23bd97063ded2648863e65e Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Wed, 17 Aug 2022 14:32:10 -0700
Subject: [PATCH] hugetlb: make follow_huge_pgd support FOLL_GET

The existing version of follow_huge_pgd was very primitive and only
provided limited functionality.  Specifically, it did not support
FOLL_GET.  Update follow_huge_pgd with modifications similar to those
made for follow_huge_pud in commit 3a194f3f8ad0 ("mm/hugetlb: make
pud_huge() and follow_huge_pud() aware of non-present pud entry").

Note, common code should be factored out of follow_huge_p*d routines.
This will be done in future modifications.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ea1c7bfa1cc3..6f32d2bd1ca9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7055,10 +7055,38 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
 struct page * __weak
 follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int flags)
 {
-	if (flags & (FOLL_GET | FOLL_PIN))
+	struct page *page = NULL;
+	spinlock_t *ptl;
+	pte_t pte;
+
+	if (WARN_ON_ONCE(flags & FOLL_PIN))
 		return NULL;
 
-	return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT);
+retry:
+	ptl = huge_pte_lock(hstate_sizelog(PGDIR_SHIFT), mm, (pte_t *)pgd);
+	if (!pgd_huge(*pgd))
+		goto out;
+	pte = huge_ptep_get((pte_t *)pgd);
+	if (pte_present(pte)) {
+		page = pgd_page(*pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT);
+		if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
+			page = NULL;
+			goto out;
+		}
+	} else {
+		if (is_hugetlb_entry_migration(pte)) {
+			spin_unlock(ptl);
+			__migration_entry_wait(mm, (pte_t *)pgd, ptl);
+			goto retry;
+		}
+		/*
+		 * hwpoisoned entry is treated as no_page_table in
+		 * follow_page_mask().
+		 */
+	}
+out:
+	spin_unlock(ptl);
+	return page;
 }
 
 int isolate_hugetlb(struct page *page, struct list_head *list)
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* RE: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-17 21:58             ` Mike Kravetz
@ 2022-08-18  0:32               ` Wang, Haiyue
  0 siblings, 0 replies; 62+ messages in thread
From: Wang, Haiyue @ 2022-08-18  0:32 UTC (permalink / raw)
  To: Mike Kravetz, Andrew Morton, Michael Ellerman
  Cc: linux-mm, linux-kernel, david, apopple, linmiaohe, Huang, Ying,
	songmuchun, naoya.horiguchi, alex.sierra, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle

> -----Original Message-----
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Sent: Thursday, August 18, 2022 05:58
> To: Andrew Morton <akpm@linux-foundation.org>; Michael Ellerman <mpe@ellerman.id.au>
> Cc: Wang, Haiyue <haiyue.wang@intel.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
> david@redhat.com; apopple@nvidia.com; linmiaohe@huawei.com; Huang, Ying <ying.huang@intel.com>;
> songmuchun@bytedance.com; naoya.horiguchi@linux.dev; alex.sierra@amd.com; Heiko Carstens
> <hca@linux.ibm.com>; Vasily Gorbik <gor@linux.ibm.com>; Alexander Gordeev <agordeev@linux.ibm.com>;
> Christian Borntraeger <borntraeger@linux.ibm.com>; Sven Schnelle <svens@linux.ibm.com>
> Subject: Re: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
> 
> On 08/17/22 10:26, Mike Kravetz wrote:
> > On 08/16/22 22:43, Andrew Morton wrote:
> > > On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
> > >
> > > > > >  		}
> > > > >
> > > > > I would be better to fix this for real at those three client code sites?
> > > >
> > > > Then 5.19 will break for a while to wait for the final BIG patch ?
> > >
> > > If that's the proposal then your [1/2] should have had a cc:stable and
> > > changelog words describing the plan for 6.0.
> > >
> > > But before we do that I'd like to see at least a prototype of the final
> > > fixes to s390 and hugetlb, so we can assess those as preferable for
> > > backporting.  I don't think they'll be terribly intrusive or risky?
> >
> > I will start on adding follow_huge_pgd() support.  Although, I may need
> > some help with verification from the powerpc folks, as that is the only
> > architecture which supports hugetlb pages at that level.
> >
> > mpe any suggestions?
> 
> From 4925a98a6857dbb5a23bd97063ded2648863e65e Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Wed, 17 Aug 2022 14:32:10 -0700
> Subject: [PATCH] hugetlb: make follow_huge_pgd support FOLL_GET
> 
> The existing version of follow_huge_pgd was very primitive and only
> provided limited functionality.  Specifically, it did not support
> FOLL_GET.  Update follow_huge_pgd with modifications similar to those
> made for follow_huge_pud in commit 3a194f3f8ad0 ("mm/hugetlb: make
> pud_huge() and follow_huge_pud() aware of non-present pud entry").
> 
> Note, common code should be factored out of follow_huge_p*d routines.
> This will be done in future modifications.
> 

I found "Anshuman Khandual <khandual@linux.vnet.ibm.com>" submit the similar
patch on "Apr 2016 11:07:37 +0530"

[PATCH 03/10] mm/hugetlb: Protect follow_huge_(pud|pgd) functions from race
https://lore.kernel.org/all/1460007464-26726-4-git-send-email-khandual@linux.vnet.ibm.com/

> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 32 ++++++++++++++++++++++++++++++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ea1c7bfa1cc3..6f32d2bd1ca9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -7055,10 +7055,38 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
>  struct page * __weak
>  follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int flags)
>  {
> -	if (flags & (FOLL_GET | FOLL_PIN))
> +	struct page *page = NULL;
> +	spinlock_t *ptl;
> +	pte_t pte;
> +
> +	if (WARN_ON_ONCE(flags & FOLL_PIN))
>  		return NULL;
> 
> -	return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT);
> +retry:
> +	ptl = huge_pte_lock(hstate_sizelog(PGDIR_SHIFT), mm, (pte_t *)pgd);
> +	if (!pgd_huge(*pgd))
> +		goto out;
> +	pte = huge_ptep_get((pte_t *)pgd);
> +	if (pte_present(pte)) {
> +		page = pgd_page(*pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT);
> +		if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
> +			page = NULL;
> +			goto out;
> +		}
> +	} else {
> +		if (is_hugetlb_entry_migration(pte)) {
> +			spin_unlock(ptl);
> +			__migration_entry_wait(mm, (pte_t *)pgd, ptl);
> +			goto retry;
> +		}
> +		/*
> +		 * hwpoisoned entry is treated as no_page_table in
> +		 * follow_page_mask().
> +		 */
> +	}
> +out:
> +	spin_unlock(ptl);
> +	return page;
>  }
> 
>  int isolate_hugetlb(struct page *page, struct list_head *list)
> --
> 2.37.1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-17  5:43         ` Andrew Morton
  2022-08-17  5:47           ` Wang, Haiyue
  2022-08-17 17:26           ` Mike Kravetz
@ 2022-08-18 11:51           ` Gerald Schaefer
  2022-08-18 11:57             ` Gerald Schaefer
  2 siblings, 1 reply; 62+ messages in thread
From: Gerald Schaefer @ 2022-08-18 11:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wang, Haiyue, linux-mm, linux-kernel, david, apopple, linmiaohe,
	Huang, Ying, songmuchun, naoya.horiguchi, alex.sierra,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Mike Kravetz

On Tue, 16 Aug 2022 22:43:22 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
> 
> > > >  		}
> > > 
> > > I would be better to fix this for real at those three client code sites?
> > 
> > Then 5.19 will break for a while to wait for the final BIG patch ?
> 
> If that's the proposal then your [1/2] should have had a cc:stable and
> changelog words describing the plan for 6.0.
> 
> But before we do that I'd like to see at least a prototype of the final
> fixes to s390 and hugetlb, so we can assess those as preferable for
> backporting.  I don't think they'll be terribly intrusive or risky?
> 

The private follow_huge_pud() for s390 is just some leftover, and the
only reason is / was that the generic version was using pte_page()
instead of pud_page(), which would not work for s390. See also commit
97534127012f ("mm/hugetlb: use pmd_page() in follow_huge_pmd()").

Since commit 3a194f3f8ad01 ("mm/hugetlb: make pud_huge() and
follow_huge_pud() aware of non-present pud entry") made
follow_huge_pud() behave similar to follow_huge_pmd(), in particular
also adding pud_page(), we can now switch to the generic version.

Note that we cannot support migration / hwpoison for hugetlb or THP,
because of different layout for PTE and PMD/PUD on s390. The generic
swp_entry functions all require proper PTEs, which wouldn't work on
PMD/PUD entries. In theory, at least for hugetlb, due to the "fake
PTE" conversion logic in huge_ptep_get(), we might be able to also
fake swp_entries, but the other problem is that we do not have enough
free bits in the PMD/PUD, so there probably will never be migration
support for huge pages on s390.

Anyway, that should not matter wrt to switching to the generic
follow_huge_pud(), because is_hugetlb_entry_migration() should always
return false, and no special change to pud_huge() check should be
needed like on x86.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-18 11:51           ` Gerald Schaefer
@ 2022-08-18 11:57             ` Gerald Schaefer
  0 siblings, 0 replies; 62+ messages in thread
From: Gerald Schaefer @ 2022-08-18 11:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wang, Haiyue, linux-mm, linux-kernel, david, apopple, linmiaohe,
	Huang, Ying, songmuchun, naoya.horiguchi, alex.sierra,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Mike Kravetz, linux-s390

On Thu, 18 Aug 2022 13:51:49 +0200
Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:

> On Tue, 16 Aug 2022 22:43:22 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
> > 
> > > > >  		}
> > > > 
> > > > I would be better to fix this for real at those three client code sites?
> > > 
> > > Then 5.19 will break for a while to wait for the final BIG patch ?
> > 
> > If that's the proposal then your [1/2] should have had a cc:stable and
> > changelog words describing the plan for 6.0.
> > 
> > But before we do that I'd like to see at least a prototype of the final
> > fixes to s390 and hugetlb, so we can assess those as preferable for
> > backporting.  I don't think they'll be terribly intrusive or risky?
> > 
> 
> The private follow_huge_pud() for s390 is just some leftover, and the
> only reason is / was that the generic version was using pte_page()
> instead of pud_page(), which would not work for s390. See also commit
> 97534127012f ("mm/hugetlb: use pmd_page() in follow_huge_pmd()").
> 
> Since commit 3a194f3f8ad01 ("mm/hugetlb: make pud_huge() and
> follow_huge_pud() aware of non-present pud entry") made
> follow_huge_pud() behave similar to follow_huge_pmd(), in particular
> also adding pud_page(), we can now switch to the generic version.
> 
> Note that we cannot support migration / hwpoison for hugetlb or THP,
> because of different layout for PTE and PMD/PUD on s390. The generic
> swp_entry functions all require proper PTEs, which wouldn't work on
> PMD/PUD entries. In theory, at least for hugetlb, due to the "fake
> PTE" conversion logic in huge_ptep_get(), we might be able to also
> fake swp_entries, but the other problem is that we do not have enough
> free bits in the PMD/PUD, so there probably will never be migration
> support for huge pages on s390.
> 
> Anyway, that should not matter wrt to switching to the generic
> follow_huge_pud(), because is_hugetlb_entry_migration() should always
> return false, and no special change to pud_huge() check should be
> needed like on x86.

From ce0150cd6f80425c702ccdc4cd8a511c47e99b67 Mon Sep 17 00:00:00 2001
From: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Date: Thu, 18 Aug 2022 13:19:23 +0200
Subject: [PATCH] s390/hugetlb: switch to generic version of follow_huge_pud()

When pud-sized hugepages were introduced for s390, the generic version
of follow_huge_pud() was using pte_page() instead of pud_page(). This
would be wrong for s390, see also commit 97534127012f ("mm/hugetlb: use
pmd_page() in follow_huge_pmd()"). Therefore, and probably because not
all archs were supporting pud_page() at that time, a private version of
follow_huge_pud() was added for s390, correctly using pud_page().

Since commit 3a194f3f8ad01 ("mm/hugetlb: make pud_huge() and
follow_huge_pud() aware of non-present pud entry"), the generic version
of follow_huge_pud() is now also using pud_page(), and in general
behaves similar to follow_huge_pmd().

Therefore we can now switch to the generic version and get rid of the
s390-specific follow_huge_pud().

Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/s390/mm/hugetlbpage.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
index 10e51ef9c79a..c299a18273ff 100644
--- a/arch/s390/mm/hugetlbpage.c
+++ b/arch/s390/mm/hugetlbpage.c
@@ -237,16 +237,6 @@ int pud_huge(pud_t pud)
 	return pud_large(pud);
 }
 
-struct page *
-follow_huge_pud(struct mm_struct *mm, unsigned long address,
-		pud_t *pud, int flags)
-{
-	if (flags & FOLL_GET)
-		return NULL;
-
-	return pud_page(*pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
-}
-
 bool __init arch_hugetlb_valid_size(unsigned long size)
 {
 	if (MACHINE_HAS_EDAT1 && size == PMD_SIZE)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-17 17:26           ` Mike Kravetz
  2022-08-17 21:58             ` Mike Kravetz
@ 2022-08-19 11:22             ` Michael Ellerman
  2022-08-19 16:55               ` Mike Kravetz
  1 sibling, 1 reply; 62+ messages in thread
From: Michael Ellerman @ 2022-08-19 11:22 UTC (permalink / raw)
  To: Mike Kravetz, Andrew Morton
  Cc: Wang, Haiyue, linux-mm, linux-kernel, david, apopple, linmiaohe,
	Huang, Ying, songmuchun, naoya.horiguchi, alex.sierra,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, linuxppc-dev

Mike Kravetz <mike.kravetz@oracle.com> writes:
> On 08/16/22 22:43, Andrew Morton wrote:
>> On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
>>
>> > > >  		}
>> > >
>> > > I would be better to fix this for real at those three client code sites?
>> >
>> > Then 5.19 will break for a while to wait for the final BIG patch ?
>>
>> If that's the proposal then your [1/2] should have had a cc:stable and
>> changelog words describing the plan for 6.0.
>>
>> But before we do that I'd like to see at least a prototype of the final
>> fixes to s390 and hugetlb, so we can assess those as preferable for
>> backporting.  I don't think they'll be terribly intrusive or risky?
>
> I will start on adding follow_huge_pgd() support.  Although, I may need
> some help with verification from the powerpc folks, as that is the only
> architecture which supports hugetlb pages at that level.
>
> mpe any suggestions?

I'm happy to test.

I have a system where I can allocate 1GB huge pages.

I'm not sure how to actually test this path though. I hacked up the
vm/migration.c test to allocate 1GB hugepages, but I can't see it going
through follow_huge_pgd() (using ftrace).

Maybe I hacked it up badly, I'll have a closer look on Monday. But if
you have any tips on how to trigger that path let me know :)

cheers

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-19 11:22             ` Michael Ellerman
@ 2022-08-19 16:55               ` Mike Kravetz
  2022-08-26 13:07                 ` Michael Ellerman
  0 siblings, 1 reply; 62+ messages in thread
From: Mike Kravetz @ 2022-08-19 16:55 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Andrew Morton, Wang, Haiyue, linux-mm, linux-kernel, david,
	apopple, linmiaohe, Huang, Ying, songmuchun, naoya.horiguchi,
	alex.sierra, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, linuxppc-dev

On 08/19/22 21:22, Michael Ellerman wrote:
> Mike Kravetz <mike.kravetz@oracle.com> writes:
> > On 08/16/22 22:43, Andrew Morton wrote:
> >> On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
> >>
> >> > > >  		}
> >> > >
> >> > > I would be better to fix this for real at those three client code sites?
> >> >
> >> > Then 5.19 will break for a while to wait for the final BIG patch ?
> >>
> >> If that's the proposal then your [1/2] should have had a cc:stable and
> >> changelog words describing the plan for 6.0.
> >>
> >> But before we do that I'd like to see at least a prototype of the final
> >> fixes to s390 and hugetlb, so we can assess those as preferable for
> >> backporting.  I don't think they'll be terribly intrusive or risky?
> >
> > I will start on adding follow_huge_pgd() support.  Although, I may need
> > some help with verification from the powerpc folks, as that is the only
> > architecture which supports hugetlb pages at that level.
> >
> > mpe any suggestions?
> 
> I'm happy to test.
> 
> I have a system where I can allocate 1GB huge pages.
> 
> I'm not sure how to actually test this path though. I hacked up the
> vm/migration.c test to allocate 1GB hugepages, but I can't see it going
> through follow_huge_pgd() (using ftrace).

I thing you needed to use 16GB to trigger this code path.  Anshuman introduced
support for page offline (and migration) at this level in commit 94310cbcaa3c
("mm/madvise: enable (soft|hard) offline of HugeTLB pages at PGD level").
When asked about the use case, he mentioned:

"Yes, its in the context of 16GB pages on POWER8 system where all the
 gigantic pages are pre allocated from the platform and passed on to
 the kernel through the device tree. We dont allocate these gigantic
 pages on runtime."

-- 
Mike Kravetz

> 
> Maybe I hacked it up badly, I'll have a closer look on Monday. But if
> you have any tips on how to trigger that path let me know :)
> 
> cheers

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 2/2] mm: fix the handling Non-LRU pages returned by follow_page
  2022-08-17  2:34     ` Miaohe Lin
@ 2022-08-23 10:07       ` David Hildenbrand
  2022-08-23 13:26         ` Wang, Haiyue
  0 siblings, 1 reply; 62+ messages in thread
From: David Hildenbrand @ 2022-08-23 10:07 UTC (permalink / raw)
  To: Miaohe Lin, Haiyue Wang, linux-mm, linux-kernel
  Cc: akpm, apopple, ying.huang, songmuchun, naoya.horiguchi,
	alex.sierra, Felix Kuehling

On 17.08.22 04:34, Miaohe Lin wrote:
> On 2022/8/16 10:21, Haiyue Wang wrote:
>> The handling Non-LRU pages returned by follow_page() jumps directly, it
>> doesn't call put_page() to handle the reference count, since 'FOLL_GET'
>> flag for follow_page() has get_page() called. Fix the zone device page
>> check by handling the page reference count correctly before returning.
>>
>> And as David reviewed, "device pages are never PageKsm pages". Drop this
>> zone device page check for break_ksm().
>>
>> Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
>> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
>> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
>> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
> 
> Thanks for your fixing. LGTM with one nit below. But I have no strong opinion on it.
> So with or without fixing below nit:
> 
> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> 
>> ---
>>  mm/huge_memory.c |  4 ++--
>>  mm/ksm.c         | 12 +++++++++---
>>  mm/migrate.c     | 19 ++++++++++++-------
>>  3 files changed, 23 insertions(+), 12 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 8a7c1b344abe..b2ba17c3dcd7 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2963,10 +2963,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>>  		/* FOLL_DUMP to ignore special (like zero) pages */
>>  		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
>>  
>> -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
>> +		if (IS_ERR_OR_NULL(page))
>>  			continue;
>>  
>> -		if (!is_transparent_hugepage(page))
>> +		if (is_zone_device_page(page) || !is_transparent_hugepage(page))
> 
> !is_transparent_hugepage should already do the work here? IIRC, zone_device_page can't be
> a transhuge page anyway. And only transparent_hugepage is cared here.

I agree.

Can we avoid sending a new version of a patch series as reply to another
patch series (previous version)?


Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [PATCH v6 2/2] mm: fix the handling Non-LRU pages returned by follow_page
  2022-08-23 10:07       ` David Hildenbrand
@ 2022-08-23 13:26         ` Wang, Haiyue
  2022-08-23 13:27           ` David Hildenbrand
  0 siblings, 1 reply; 62+ messages in thread
From: Wang, Haiyue @ 2022-08-23 13:26 UTC (permalink / raw)
  To: David Hildenbrand, Miaohe Lin, linux-mm, linux-kernel
  Cc: akpm, apopple, Huang, Ying, songmuchun, naoya.horiguchi,
	alex.sierra, Felix Kuehling

> -----Original Message-----
> From: David Hildenbrand <david@redhat.com>
> Sent: Tuesday, August 23, 2022 18:07
> To: Miaohe Lin <linmiaohe@huawei.com>; Wang, Haiyue <haiyue.wang@intel.com>; linux-mm@kvack.org;
> linux-kernel@vger.kernel.org
> Cc: akpm@linux-foundation.org; apopple@nvidia.com; Huang, Ying <ying.huang@intel.com>;
> songmuchun@bytedance.com; naoya.horiguchi@linux.dev; alex.sierra@amd.com; Felix Kuehling
> <Felix.Kuehling@amd.com>
> Subject: Re: [PATCH v6 2/2] mm: fix the handling Non-LRU pages returned by follow_page
> 
> On 17.08.22 04:34, Miaohe Lin wrote:
> > On 2022/8/16 10:21, Haiyue Wang wrote:
> >> The handling Non-LRU pages returned by follow_page() jumps directly, it
> >> doesn't call put_page() to handle the reference count, since 'FOLL_GET'
> >> flag for follow_page() has get_page() called. Fix the zone device page
> >> check by handling the page reference count correctly before returning.
> >>
> >> And as David reviewed, "device pages are never PageKsm pages". Drop this
> >> zone device page check for break_ksm().
> >>
> >> Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
> >> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> >> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> >> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
> >
> > Thanks for your fixing. LGTM with one nit below. But I have no strong opinion on it.
> > So with or without fixing below nit:
> >
> > Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> >
> >> ---
> >>  mm/huge_memory.c |  4 ++--
> >>  mm/ksm.c         | 12 +++++++++---
> >>  mm/migrate.c     | 19 ++++++++++++-------
> >>  3 files changed, 23 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 8a7c1b344abe..b2ba17c3dcd7 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -2963,10 +2963,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
> >>  		/* FOLL_DUMP to ignore special (like zero) pages */
> >>  		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> >>
> >> -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> >> +		if (IS_ERR_OR_NULL(page))
> >>  			continue;
> >>
> >> -		if (!is_transparent_hugepage(page))
> >> +		if (is_zone_device_page(page) || !is_transparent_hugepage(page))
> >
> > !is_transparent_hugepage should already do the work here? IIRC, zone_device_page can't be
> > a transhuge page anyway. And only transparent_hugepage is cared here.
> 
> I agree.

OK, will remove it in next version.

> 
> Can we avoid sending a new version of a patch series as reply to another
> patch series (previous version)?

Don't use '--in-reply-to=' when running 'git send-email' ? 

> 
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 
> --
> Thanks,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 2/2] mm: fix the handling Non-LRU pages returned by follow_page
  2022-08-23 13:26         ` Wang, Haiyue
@ 2022-08-23 13:27           ` David Hildenbrand
  2022-08-23 13:29             ` Wang, Haiyue
  0 siblings, 1 reply; 62+ messages in thread
From: David Hildenbrand @ 2022-08-23 13:27 UTC (permalink / raw)
  To: Wang, Haiyue, Miaohe Lin, linux-mm, linux-kernel
  Cc: akpm, apopple, Huang, Ying, songmuchun, naoya.horiguchi,
	alex.sierra, Felix Kuehling

On 23.08.22 15:26, Wang, Haiyue wrote:
>> -----Original Message-----
>> From: David Hildenbrand <david@redhat.com>
>> Sent: Tuesday, August 23, 2022 18:07
>> To: Miaohe Lin <linmiaohe@huawei.com>; Wang, Haiyue <haiyue.wang@intel.com>; linux-mm@kvack.org;
>> linux-kernel@vger.kernel.org
>> Cc: akpm@linux-foundation.org; apopple@nvidia.com; Huang, Ying <ying.huang@intel.com>;
>> songmuchun@bytedance.com; naoya.horiguchi@linux.dev; alex.sierra@amd.com; Felix Kuehling
>> <Felix.Kuehling@amd.com>
>> Subject: Re: [PATCH v6 2/2] mm: fix the handling Non-LRU pages returned by follow_page
>>
>> On 17.08.22 04:34, Miaohe Lin wrote:
>>> On 2022/8/16 10:21, Haiyue Wang wrote:
>>>> The handling Non-LRU pages returned by follow_page() jumps directly, it
>>>> doesn't call put_page() to handle the reference count, since 'FOLL_GET'
>>>> flag for follow_page() has get_page() called. Fix the zone device page
>>>> check by handling the page reference count correctly before returning.
>>>>
>>>> And as David reviewed, "device pages are never PageKsm pages". Drop this
>>>> zone device page check for break_ksm().
>>>>
>>>> Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
>>>> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
>>>> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
>>>> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>>
>>> Thanks for your fixing. LGTM with one nit below. But I have no strong opinion on it.
>>> So with or without fixing below nit:
>>>
>>> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
>>>
>>>> ---
>>>>  mm/huge_memory.c |  4 ++--
>>>>  mm/ksm.c         | 12 +++++++++---
>>>>  mm/migrate.c     | 19 ++++++++++++-------
>>>>  3 files changed, 23 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 8a7c1b344abe..b2ba17c3dcd7 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -2963,10 +2963,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>>>>  		/* FOLL_DUMP to ignore special (like zero) pages */
>>>>  		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
>>>>
>>>> -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
>>>> +		if (IS_ERR_OR_NULL(page))
>>>>  			continue;
>>>>
>>>> -		if (!is_transparent_hugepage(page))
>>>> +		if (is_zone_device_page(page) || !is_transparent_hugepage(page))
>>>
>>> !is_transparent_hugepage should already do the work here? IIRC, zone_device_page can't be
>>> a transhuge page anyway. And only transparent_hugepage is cared here.
>>
>> I agree.
> 
> OK, will remove it in next version.
> 
>>
>> Can we avoid sending a new version of a patch series as reply to another
>> patch series (previous version)?
> 
> Don't use '--in-reply-to=' when running 'git send-email' ? 

Yes. That makes it easier top identify versions of patch series, without
having to dig down into an ever-growing thread.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [PATCH v6 2/2] mm: fix the handling Non-LRU pages returned by follow_page
  2022-08-23 13:27           ` David Hildenbrand
@ 2022-08-23 13:29             ` Wang, Haiyue
  0 siblings, 0 replies; 62+ messages in thread
From: Wang, Haiyue @ 2022-08-23 13:29 UTC (permalink / raw)
  To: David Hildenbrand, Miaohe Lin, linux-mm, linux-kernel
  Cc: akpm, apopple, Huang, Ying, songmuchun, naoya.horiguchi,
	alex.sierra, Felix Kuehling

> -----Original Message-----
> From: David Hildenbrand <david@redhat.com>
> Sent: Tuesday, August 23, 2022 21:28
> To: Wang, Haiyue <haiyue.wang@intel.com>; Miaohe Lin <linmiaohe@huawei.com>; linux-mm@kvack.org;
> linux-kernel@vger.kernel.org
> Cc: akpm@linux-foundation.org; apopple@nvidia.com; Huang, Ying <ying.huang@intel.com>;
> songmuchun@bytedance.com; naoya.horiguchi@linux.dev; alex.sierra@amd.com; Felix Kuehling
> <Felix.Kuehling@amd.com>
> Subject: Re: [PATCH v6 2/2] mm: fix the handling Non-LRU pages returned by follow_page
> 
> On 23.08.22 15:26, Wang, Haiyue wrote:
> >> -----Original Message-----
> >> From: David Hildenbrand <david@redhat.com>
> >> Sent: Tuesday, August 23, 2022 18:07
> >> To: Miaohe Lin <linmiaohe@huawei.com>; Wang, Haiyue <haiyue.wang@intel.com>; linux-mm@kvack.org;
> >> linux-kernel@vger.kernel.org
> >> Cc: akpm@linux-foundation.org; apopple@nvidia.com; Huang, Ying <ying.huang@intel.com>;
> >> songmuchun@bytedance.com; naoya.horiguchi@linux.dev; alex.sierra@amd.com; Felix Kuehling
> >> <Felix.Kuehling@amd.com>
> >> Subject: Re: [PATCH v6 2/2] mm: fix the handling Non-LRU pages returned by follow_page
> >>
> >> On 17.08.22 04:34, Miaohe Lin wrote:
> >>> On 2022/8/16 10:21, Haiyue Wang wrote:
> >>>> The handling Non-LRU pages returned by follow_page() jumps directly, it
> >>>> doesn't call put_page() to handle the reference count, since 'FOLL_GET'
> >>>> flag for follow_page() has get_page() called. Fix the zone device page
> >>>> check by handling the page reference count correctly before returning.
> >>>>
> >>>> And as David reviewed, "device pages are never PageKsm pages". Drop this
> >>>> zone device page check for break_ksm().
> >>>>
> >>>> Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
> >>>> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> >>>> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> >>>> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
> >>>
> >>> Thanks for your fixing. LGTM with one nit below. But I have no strong opinion on it.
> >>> So with or without fixing below nit:
> >>>
> >>> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> >>>
> >>>> ---
> >>>>  mm/huge_memory.c |  4 ++--
> >>>>  mm/ksm.c         | 12 +++++++++---
> >>>>  mm/migrate.c     | 19 ++++++++++++-------
> >>>>  3 files changed, 23 insertions(+), 12 deletions(-)
> >>>>
> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>> index 8a7c1b344abe..b2ba17c3dcd7 100644
> >>>> --- a/mm/huge_memory.c
> >>>> +++ b/mm/huge_memory.c
> >>>> @@ -2963,10 +2963,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
> >>>>  		/* FOLL_DUMP to ignore special (like zero) pages */
> >>>>  		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> >>>>
> >>>> -		if (IS_ERR_OR_NULL(page) || is_zone_device_page(page))
> >>>> +		if (IS_ERR_OR_NULL(page))
> >>>>  			continue;
> >>>>
> >>>> -		if (!is_transparent_hugepage(page))
> >>>> +		if (is_zone_device_page(page) || !is_transparent_hugepage(page))
> >>>
> >>> !is_transparent_hugepage should already do the work here? IIRC, zone_device_page can't be
> >>> a transhuge page anyway. And only transparent_hugepage is cared here.
> >>
> >> I agree.
> >
> > OK, will remove it in next version.
> >
> >>
> >> Can we avoid sending a new version of a patch series as reply to another
> >> patch series (previous version)?
> >
> > Don't use '--in-reply-to=' when running 'git send-email' ?
> 
> Yes. That makes it easier top identify versions of patch series, without
> having to dig down into an ever-growing thread.

Got it, thanks. ;-)

> 
> --
> Thanks,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
  2022-08-19 16:55               ` Mike Kravetz
@ 2022-08-26 13:07                 ` Michael Ellerman
  0 siblings, 0 replies; 62+ messages in thread
From: Michael Ellerman @ 2022-08-26 13:07 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Andrew Morton, Wang, Haiyue, linux-mm, linux-kernel, david,
	apopple, linmiaohe, Huang, Ying, songmuchun, naoya.horiguchi,
	alex.sierra, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, linuxppc-dev,
	Aneesh Kumar K.V

Mike Kravetz <mike.kravetz@oracle.com> writes:
> On 08/19/22 21:22, Michael Ellerman wrote:
>> Mike Kravetz <mike.kravetz@oracle.com> writes:
>> > On 08/16/22 22:43, Andrew Morton wrote:
>> >> On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
>> >>
>> >> > > >  		}
>> >> > >
>> >> > > I would be better to fix this for real at those three client code sites?
>> >> >
>> >> > Then 5.19 will break for a while to wait for the final BIG patch ?
>> >>
>> >> If that's the proposal then your [1/2] should have had a cc:stable and
>> >> changelog words describing the plan for 6.0.
>> >>
>> >> But before we do that I'd like to see at least a prototype of the final
>> >> fixes to s390 and hugetlb, so we can assess those as preferable for
>> >> backporting.  I don't think they'll be terribly intrusive or risky?
>> >
>> > I will start on adding follow_huge_pgd() support.  Although, I may need
>> > some help with verification from the powerpc folks, as that is the only
>> > architecture which supports hugetlb pages at that level.
>> >
>> > mpe any suggestions?
>>
>> I'm happy to test.
>>
>> I have a system where I can allocate 1GB huge pages.
>>
>> I'm not sure how to actually test this path though. I hacked up the
>> vm/migration.c test to allocate 1GB hugepages, but I can't see it going
>> through follow_huge_pgd() (using ftrace).
>
> I thing you needed to use 16GB to trigger this code path.  Anshuman introduced
> support for page offline (and migration) at this level in commit 94310cbcaa3c
> ("mm/madvise: enable (soft|hard) offline of HugeTLB pages at PGD level").
> When asked about the use case, he mentioned:
>
> "Yes, its in the context of 16GB pages on POWER8 system where all the
>  gigantic pages are pre allocated from the platform and passed on to
>  the kernel through the device tree. We dont allocate these gigantic
>  pages on runtime."

That was true, but isn't anymore.

I must have been insufficently caffeinated the other day. On our newer
machines 1GB is the largest huge page size, but it's obviously way too
small to sit at the PGD level. So that was a waste of my time :)

We used to support 16GB at the PGD level, but we reworked the page table
geometry a few years ago, and now they sit at the PUD level on machines
that support 16GB pages:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ba95b5d0359609b4ec8010f77c40ab3c595a6ac6

Note the author :}

So the good news is we no longer have any configuration where a huge
page entry is expected in the PGD. So we can drop our pgd_huge()
definitions, and ours are the last non-zero definitions, so it can all
go away I think.

I'll send a patch to remove the powerpc pgd_huge() definitions after
I've run it through some tests.

cheers

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2022-08-26 13:07 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-12  8:49 [PATCH v1] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
2022-08-13 23:28 ` Andrew Morton
2022-08-14  6:20   ` Wang, Haiyue
2022-08-14  6:49     ` Wang, Haiyue
2022-08-14 14:05 ` [PATCH v2 0/3] fix follow_page related issues Haiyue Wang
2022-08-14 14:05   ` [PATCH v2 1/3] mm: revert handling Non-LRU pages returned by follow_page Haiyue Wang
2022-08-14 16:30     ` David Hildenbrand
2022-08-15  1:02       ` Wang, Haiyue
2022-08-14 14:05   ` [PATCH v2 2/3] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
2022-08-14 14:05   ` [PATCH v2 3/3] mm: handling Non-LRU pages returned by follow_page Haiyue Wang
2022-08-14 16:34     ` David Hildenbrand
2022-08-15  1:03       ` Wang, Haiyue
2022-08-15  1:03 ` [PATCH v3 0/2] fix follow_page related issues Haiyue Wang
2022-08-15  1:03   ` [PATCH v3 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
2022-08-15  1:59     ` Huang, Ying
2022-08-15  2:10       ` Wang, Haiyue
2022-08-15  2:15         ` Wang, Haiyue
2022-08-15  2:51           ` Huang, Ying
2022-08-15  1:03   ` [PATCH v3 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
2022-08-15  1:39     ` Huang, Ying
2022-08-15  1:46       ` Wang, Haiyue
2022-08-15  1:59 ` [PATCH v4 0/2] fix follow_page related issues Haiyue Wang
2022-08-15  1:59   ` [PATCH v4 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
2022-08-15  4:28     ` Alistair Popple
2022-08-15  4:40       ` Wang, Haiyue
2022-08-15  5:16         ` Alistair Popple
2022-08-15  5:20           ` Wang, Haiyue
2022-08-15  5:35             ` Alistair Popple
2022-08-15  5:37               ` Wang, Haiyue
2022-08-15  1:59   ` [PATCH v4 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
2022-08-15  7:02 ` [PATCH v5 0/2] fix follow_page related issues Haiyue Wang
2022-08-15  7:02   ` [PATCH v5 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
2022-08-15  7:40     ` Huang, Ying
2022-08-15  7:02   ` [PATCH v5 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
2022-08-15  7:50     ` Huang, Ying
2022-08-15 14:28     ` Felix Kuehling
2022-08-16  0:00     ` Alistair Popple
2022-08-16  1:12       ` Wang, Haiyue
2022-08-16  2:45         ` Alistair Popple
2022-08-16  2:20 ` [PATCH v6 0/2] fix follow_page related issues Haiyue Wang
2022-08-16  2:21   ` [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page Haiyue Wang
2022-08-16  8:54     ` Baolin Wang
2022-08-17  0:58     ` Andrew Morton
2022-08-17  3:31       ` Wang, Haiyue
2022-08-17  5:43         ` Andrew Morton
2022-08-17  5:47           ` Wang, Haiyue
2022-08-17 17:26           ` Mike Kravetz
2022-08-17 21:58             ` Mike Kravetz
2022-08-18  0:32               ` Wang, Haiyue
2022-08-19 11:22             ` Michael Ellerman
2022-08-19 16:55               ` Mike Kravetz
2022-08-26 13:07                 ` Michael Ellerman
2022-08-18 11:51           ` Gerald Schaefer
2022-08-18 11:57             ` Gerald Schaefer
2022-08-17  2:12     ` Miaohe Lin
2022-08-16  2:21   ` [PATCH v6 2/2] mm: fix the handling Non-LRU pages returned by follow_page Haiyue Wang
2022-08-16  4:42     ` Alistair Popple
2022-08-17  2:34     ` Miaohe Lin
2022-08-23 10:07       ` David Hildenbrand
2022-08-23 13:26         ` Wang, Haiyue
2022-08-23 13:27           ` David Hildenbrand
2022-08-23 13:29             ` Wang, Haiyue

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).