All of lore.kernel.org
 help / color / mirror / Atom feed
* [BISECTED] first bad commit is c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move")
       [not found] <Y9k9Jl9wIaUFZS30@hyeyoo>
@ 2023-01-31 17:35 ` Hyeonggon Yoo
  2023-01-31 22:45   ` Andrew Morton
  2023-02-01 23:28   ` Huang, Ying
  0 siblings, 2 replies; 8+ messages in thread
From: Hyeonggon Yoo @ 2023-01-31 17:35 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Huang Ying, Zi Yan, Yang Shi, Baolin Wang, Oscar Salvador,
	Matthew Wilcox, Bharata B Rao, Alistair Popple, haoxin,
	Minchan Kim

On Wed, Feb 01, 2023 at 01:09:10AM +0900, Hyeonggon Yoo wrote:
> I've observed random list_del corruption on mm-unstable,
> where HEAD is commit d69862e693c069f4
> ("mm/migrate: convert putback_movable_pages() to use folios").
> 
> The issue can be easily reproduced by stressing MM multiple times:
> 	stress-ng --bigheap 0 --timeout 300
> 
> The compiler is gcc 12.2.1 and config, dmesg are included as attachment.
> I will try to bisect but can't promise quick resolution :)


The first bad commits appears to be:
c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move")

the first bad commit _probably_ be earlier, but this is quite
easy to reproduce so at this point I think above is the real bad commit.

Here's bisect log:

[hyeyoo@hyeyoo linux]$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# bad: [d69862e693c069f4f67fc55c159ce5f6c6def42f] mm/migrate: convert putback_movable_pages() to use folios
git bisect bad d69862e693c069f4f67fc55c159ce5f6c6def42f
# status: waiting for good commit(s), bad commit known
# good: [5dc4c995db9eb45f6373a956eb1f69460e69e6d4] Linux 6.2-rc4
git bisect good 5dc4c995db9eb45f6373a956eb1f69460e69e6d4
# good: [b81cc59c6835bb84e72f1ce516121f25780a42cb] shmem: convert shmem_write_end() to use a folio
git bisect good b81cc59c6835bb84e72f1ce516121f25780a42cb
# bad: [1e2a127908bdbc11065bc08200d0add096d96245] mm: multi-gen LRU: improve walk_pmd_range()
git bisect bad 1e2a127908bdbc11065bc08200d0add096d96245c203c6d5b3f0597
# good: [4d2eba3fd7c0b3ada6b474e4f13dc1238670ea91] ceph: convert ceph_writepages_start() to use filemap_get_folios_tag()
git bisect good 4d2eba3fd7c0b3ada6b474e4f13dc1238670ea91
# good: [b22d808ef6c68b6fa2b5a97f5d02e83ab8fb732d] migrate_pages: separate hugetlb folios migration
git bisect good b22d808ef6c68b6fa2b5a97f5d02e83ab8fb732d
# bad: [6d8b6b69ee6a24a89f2552a12e43b7c9b3a7635d] mm/hugetlb: fix get_hwpoison_hugetlb_folio() stub
git bisect bad 6d8b6b69ee6a24a89f2552a12e43b7c9b3a7635d
# bad: [71d4be767f5fde1ac5df2ade1f654f19c24a0a3e] mm/damon/core: skip apply schemes if empty
git bisect bad 71d4be767f5fde1ac5df2ade1f654f19c24a0a3e
# bad: [68611caff673fc59c17f00dcd0cab2704987828d] migrate_pages: move migrate_folio_unmap()
git bisect bad 68611caff673fc59c17f00dcd0cab2704987828d
# good: [7199465b2be061f1b2bd6b5a1a5a96a822f658b9] migrate_pages: split unmap_and_move() to _unmap() and _move()
git bisect good 7199465b2be061f1b2bd6b5a1a5a96a822f658b9
# bad: [c203c6d5b3f0597a15137b6394fa715542df78c8] migrate_pages: batch _unmap and _move
git bisect bad c203c6d5b3f0597a15137b6394fa715542df78c8
# first bad commit: [c203c6d5b3f0597a15137b6394fa715542df78c8] migrate_pages: batch _unmap and _move


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BISECTED] first bad commit is c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move")
  2023-01-31 17:35 ` [BISECTED] first bad commit is c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move") Hyeonggon Yoo
@ 2023-01-31 22:45   ` Andrew Morton
  2023-02-01 23:28   ` Huang, Ying
  1 sibling, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2023-01-31 22:45 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: linux-mm, Huang Ying, Zi Yan, Yang Shi, Baolin Wang,
	Oscar Salvador, Matthew Wilcox, Bharata B Rao, Alistair Popple,
	haoxin, Minchan Kim

On Wed, 1 Feb 2023 02:35:55 +0900 Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote:

> On Wed, Feb 01, 2023 at 01:09:10AM +0900, Hyeonggon Yoo wrote:
> > I've observed random list_del corruption on mm-unstable,
> > where HEAD is commit d69862e693c069f4
> > ("mm/migrate: convert putback_movable_pages() to use folios").
> > 
> > The issue can be easily reproduced by stressing MM multiple times:
> > 	stress-ng --bigheap 0 --timeout 300
> > 
> > The compiler is gcc 12.2.1 and config, dmesg are included as attachment.
> > I will try to bisect but can't promise quick resolution :)
> 
> 
> The first bad commits appears to be:
> c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move")
> 
> the first bad commit _probably_ be earlier, but this is quite
> easy to reproduce so at this point I think above is the real bad commit.

Thanks.  I'll disable this series for now, so it won't appear in linux-next.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BISECTED] first bad commit is c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move")
  2023-01-31 17:35 ` [BISECTED] first bad commit is c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move") Hyeonggon Yoo
  2023-01-31 22:45   ` Andrew Morton
@ 2023-02-01 23:28   ` Huang, Ying
  2023-02-02 23:17     ` Huang, Ying
  1 sibling, 1 reply; 8+ messages in thread
From: Huang, Ying @ 2023-02-01 23:28 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: linux-mm, akpm, Zi Yan, Yang Shi, Baolin Wang, Oscar Salvador,
	Matthew Wilcox, Bharata B Rao, Alistair Popple, haoxin,
	Minchan Kim

Hi, Hyeonggon,

Hyeonggon Yoo <42.hyeyoo@gmail.com> writes:

> On Wed, Feb 01, 2023 at 01:09:10AM +0900, Hyeonggon Yoo wrote:
>> I've observed random list_del corruption on mm-unstable,
>> where HEAD is commit d69862e693c069f4
>> ("mm/migrate: convert putback_movable_pages() to use folios").
>> 
>> The issue can be easily reproduced by stressing MM multiple times:
>> 	stress-ng --bigheap 0 --timeout 300
>> 
>> The compiler is gcc 12.2.1 and config, dmesg are included as attachment.
>> I will try to bisect but can't promise quick resolution :)
>
>
> The first bad commits appears to be:
> c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move")
>
> the first bad commit _probably_ be earlier, but this is quite
> easy to reproduce so at this point I think above is the real bad commit.

Thank you very much for reporting the bug.  I'm in travel now but I will
try to find some time to reproduce and debug it.

Best Regards,
Huang, Ying

> Here's bisect log:
>
> [hyeyoo@hyeyoo linux]$ git bisect log
> git bisect start
> # status: waiting for both good and bad commits
> # bad: [d69862e693c069f4f67fc55c159ce5f6c6def42f] mm/migrate: convert putback_movable_pages() to use folios
> git bisect bad d69862e693c069f4f67fc55c159ce5f6c6def42f
> # status: waiting for good commit(s), bad commit known
> # good: [5dc4c995db9eb45f6373a956eb1f69460e69e6d4] Linux 6.2-rc4
> git bisect good 5dc4c995db9eb45f6373a956eb1f69460e69e6d4
> # good: [b81cc59c6835bb84e72f1ce516121f25780a42cb] shmem: convert shmem_write_end() to use a folio
> git bisect good b81cc59c6835bb84e72f1ce516121f25780a42cb
> # bad: [1e2a127908bdbc11065bc08200d0add096d96245] mm: multi-gen LRU: improve walk_pmd_range()
> git bisect bad 1e2a127908bdbc11065bc08200d0add096d96245c203c6d5b3f0597
> # good: [4d2eba3fd7c0b3ada6b474e4f13dc1238670ea91] ceph: convert ceph_writepages_start() to use filemap_get_folios_tag()
> git bisect good 4d2eba3fd7c0b3ada6b474e4f13dc1238670ea91
> # good: [b22d808ef6c68b6fa2b5a97f5d02e83ab8fb732d] migrate_pages: separate hugetlb folios migration
> git bisect good b22d808ef6c68b6fa2b5a97f5d02e83ab8fb732d
> # bad: [6d8b6b69ee6a24a89f2552a12e43b7c9b3a7635d] mm/hugetlb: fix get_hwpoison_hugetlb_folio() stub
> git bisect bad 6d8b6b69ee6a24a89f2552a12e43b7c9b3a7635d
> # bad: [71d4be767f5fde1ac5df2ade1f654f19c24a0a3e] mm/damon/core: skip apply schemes if empty
> git bisect bad 71d4be767f5fde1ac5df2ade1f654f19c24a0a3e
> # bad: [68611caff673fc59c17f00dcd0cab2704987828d] migrate_pages: move migrate_folio_unmap()
> git bisect bad 68611caff673fc59c17f00dcd0cab2704987828d
> # good: [7199465b2be061f1b2bd6b5a1a5a96a822f658b9] migrate_pages: split unmap_and_move() to _unmap() and _move()
> git bisect good 7199465b2be061f1b2bd6b5a1a5a96a822f658b9
> # bad: [c203c6d5b3f0597a15137b6394fa715542df78c8] migrate_pages: batch _unmap and _move
> git bisect bad c203c6d5b3f0597a15137b6394fa715542df78c8
> # first bad commit: [c203c6d5b3f0597a15137b6394fa715542df78c8] migrate_pages: batch _unmap and _move


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BISECTED] first bad commit is c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move")
  2023-02-01 23:28   ` Huang, Ying
@ 2023-02-02 23:17     ` Huang, Ying
  2023-02-03 14:17       ` Hyeonggon Yoo
  0 siblings, 1 reply; 8+ messages in thread
From: Huang, Ying @ 2023-02-02 23:17 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: linux-mm, akpm, Zi Yan, Yang Shi, Baolin Wang, Oscar Salvador,
	Matthew Wilcox, Bharata B Rao, Alistair Popple, haoxin,
	Minchan Kim

"Huang, Ying" <ying.huang@intel.com> writes:

> Hi, Hyeonggon,
>
> Hyeonggon Yoo <42.hyeyoo@gmail.com> writes:
>
>> On Wed, Feb 01, 2023 at 01:09:10AM +0900, Hyeonggon Yoo wrote:
>>> I've observed random list_del corruption on mm-unstable,
>>> where HEAD is commit d69862e693c069f4
>>> ("mm/migrate: convert putback_movable_pages() to use folios").
>>> 
>>> The issue can be easily reproduced by stressing MM multiple times:
>>> 	stress-ng --bigheap 0 --timeout 300
>>> 
>>> The compiler is gcc 12.2.1 and config, dmesg are included as attachment.
>>> I will try to bisect but can't promise quick resolution :)
>>
>>
>> The first bad commits appears to be:
>> c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move")
>>
>> the first bad commit _probably_ be earlier, but this is quite
>> easy to reproduce so at this point I think above is the real bad commit.
>
> Thank you very much for reporting the bug.  I'm in travel now but I will
> try to find some time to reproduce and debug it.

Still haven't reproduced the issue.  But after reviewing the code, I
found a bug in the code, which may cause list corruption.  Can you try
the debug patch below?

Best Regards,
Huang, Ying

-------------------------------8<-------------------------------------
From a4eef847fe4f6e50b6c3f69651c1dfdeb4b23bc4 Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@intel.com>
Date: Fri, 3 Feb 2023 07:12:24 +0800
Subject: [PATCH] dbg: fix list corruption for -EAGAIN

---
 mm/migrate.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 143d96775b4d..4205a0297ef8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1230,11 +1230,11 @@ static int __migrate_folio_move(struct folio *src, struct folio *dst,
 
 	rc = move_to_new_folio(dst, src, mode);
 
-	if (rc != -EAGAIN)
+	if (rc != -EAGAIN) {
 		list_del(&dst->lru);
-
-	if (unlikely(!is_lru))
-		goto out_unlock_both;
+		if (unlikely(!is_lru))
+			goto out_unlock_both;
+	}
 
 	/*
 	 * When successful, push dst to LRU immediately: so that if it
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [BISECTED] first bad commit is c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move")
  2023-02-02 23:17     ` Huang, Ying
@ 2023-02-03 14:17       ` Hyeonggon Yoo
  2023-02-03 15:02         ` Huang, Ying
  0 siblings, 1 reply; 8+ messages in thread
From: Hyeonggon Yoo @ 2023-02-03 14:17 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Zi Yan, Yang Shi, Baolin Wang, Oscar Salvador,
	Matthew Wilcox, Bharata B Rao, Alistair Popple, haoxin,
	Minchan Kim

On Fri, Feb 03, 2023 at 07:17:14AM +0800, Huang, Ying wrote:
> "Huang, Ying" <ying.huang@intel.com> writes:
> 
> > Hi, Hyeonggon,
> >
> > Hyeonggon Yoo <42.hyeyoo@gmail.com> writes:
> >
> >> On Wed, Feb 01, 2023 at 01:09:10AM +0900, Hyeonggon Yoo wrote:
> >>> I've observed random list_del corruption on mm-unstable,
> >>> where HEAD is commit d69862e693c069f4
> >>> ("mm/migrate: convert putback_movable_pages() to use folios").
> >>> 
> >>> The issue can be easily reproduced by stressing MM multiple times:
> >>> 	stress-ng --bigheap 0 --timeout 300
> >>> 
> >>> The compiler is gcc 12.2.1 and config, dmesg are included as attachment.
> >>> I will try to bisect but can't promise quick resolution :)
> >>
> >>
> >> The first bad commits appears to be:
> >> c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move")
> >>
> >> the first bad commit _probably_ be earlier, but this is quite
> >> easy to reproduce so at this point I think above is the real bad commit.
> >
> > Thank you very much for reporting the bug.  I'm in travel now but I will
> > try to find some time to reproduce and debug it.
> 
> Still haven't reproduced the issue.  But after reviewing the code, I
> found a bug in the code, which may cause list corruption.  Can you try
> the debug patch below?

Unfortunately my home server seems to be broken again :(
That means I only have access to VMs and not a real machine now.

FYI it was not reproduced on KVM but reproduced on real machine.
Could you try checking on your machine with the config I attached? [1]

Sorry to bother your travel!

[1] https://marc.info/?l=linux-mm&m=167518135116956

Thanks,
Hyeonggon

> Best Regards,
> Huang, Ying
> 
> -------------------------------8<-------------------------------------
> From a4eef847fe4f6e50b6c3f69651c1dfdeb4b23bc4 Mon Sep 17 00:00:00 2001
> From: Huang Ying <ying.huang@intel.com>
> Date: Fri, 3 Feb 2023 07:12:24 +0800
> Subject: [PATCH] dbg: fix list corruption for -EAGAIN
> 
> ---
>  mm/migrate.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 143d96775b4d..4205a0297ef8 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1230,11 +1230,11 @@ static int __migrate_folio_move(struct folio *src, struct folio *dst,
>  
>  	rc = move_to_new_folio(dst, src, mode);
>  
> -	if (rc != -EAGAIN)
> +	if (rc != -EAGAIN) {
>  		list_del(&dst->lru);
> -
> -	if (unlikely(!is_lru))
> -		goto out_unlock_both;
> +		if (unlikely(!is_lru))
> +			goto out_unlock_both;
> +	}
>  
>  	/*
>  	 * When successful, push dst to LRU immediately: so that if it
> -- 
> 2.35.1
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BISECTED] first bad commit is c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move")
  2023-02-03 14:17       ` Hyeonggon Yoo
@ 2023-02-03 15:02         ` Huang, Ying
  2023-02-05 14:38           ` Hyeonggon Yoo
  0 siblings, 1 reply; 8+ messages in thread
From: Huang, Ying @ 2023-02-03 15:02 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: linux-mm, akpm, Zi Yan, Yang Shi, Baolin Wang, Oscar Salvador,
	Matthew Wilcox, Bharata B Rao, Alistair Popple, haoxin,
	Minchan Kim

Hyeonggon Yoo <42.hyeyoo@gmail.com> writes:

> On Fri, Feb 03, 2023 at 07:17:14AM +0800, Huang, Ying wrote:
>> "Huang, Ying" <ying.huang@intel.com> writes:
>> 
>> > Hi, Hyeonggon,
>> >
>> > Hyeonggon Yoo <42.hyeyoo@gmail.com> writes:
>> >
>> >> On Wed, Feb 01, 2023 at 01:09:10AM +0900, Hyeonggon Yoo wrote:
>> >>> I've observed random list_del corruption on mm-unstable,
>> >>> where HEAD is commit d69862e693c069f4
>> >>> ("mm/migrate: convert putback_movable_pages() to use folios").
>> >>> 
>> >>> The issue can be easily reproduced by stressing MM multiple times:
>> >>> 	stress-ng --bigheap 0 --timeout 300
>> >>> 
>> >>> The compiler is gcc 12.2.1 and config, dmesg are included as attachment.
>> >>> I will try to bisect but can't promise quick resolution :)
>> >>
>> >>
>> >> The first bad commits appears to be:
>> >> c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move")
>> >>
>> >> the first bad commit _probably_ be earlier, but this is quite
>> >> easy to reproduce so at this point I think above is the real bad commit.
>> >
>> > Thank you very much for reporting the bug.  I'm in travel now but I will
>> > try to find some time to reproduce and debug it.
>> 
>> Still haven't reproduced the issue.  But after reviewing the code, I
>> found a bug in the code, which may cause list corruption.  Can you try
>> the debug patch below?
>
> Unfortunately my home server seems to be broken again :(
> That means I only have access to VMs and not a real machine now.
>
> FYI it was not reproduced on KVM but reproduced on real machine.
> Could you try checking on your machine with the config I attached? [1]

Thank you very much for testing!

> Sorry to bother your travel!

Never mind.  Your report helps me very much!

> [1] https://marc.info/?l=linux-mm&m=167518135116956

I have reproduced the bug successfully!  And I can reproduce the bug
with the previous debug patch too, although the reproduction rate isn't
high.

And in my test, the following patch can fix the bug.

It appears that zswap code will touch dst->lru during moving page.

Best Regards,
Huang, Ying

-------------------------8<----------------------------------
From b2e3f4aab16d8af0033286fde669b46e7467c7ec Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@intel.com>
Date: Fri, 3 Feb 2023 22:03:24 +0800
Subject: [PATCH] dbg,migrate_pages: restore destination folio state before
 move

---
 mm/migrate.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 143d96775b4d..fa7212330cb6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1225,13 +1225,19 @@ static int __migrate_folio_move(struct folio *src, struct folio *dst,
 	int page_was_mapped = 0;
 	struct anon_vma *anon_vma = NULL;
 	bool is_lru = !__PageMovable(&src->page);
+	struct list_head *prev;
 
 	__migrate_folio_extract(dst, &page_was_mapped, &anon_vma);
+	prev = dst->lru.prev;
+	list_del(&dst->lru);
 
 	rc = move_to_new_folio(dst, src, mode);
 
-	if (rc != -EAGAIN)
-		list_del(&dst->lru);
+	if (rc == -EAGAIN) {
+		list_add(&dst->lru, prev);
+		__migrate_folio_record(dst, page_was_mapped, anon_vma);
+		return rc;
+	}
 
 	if (unlikely(!is_lru))
 		goto out_unlock_both;
@@ -1251,11 +1257,6 @@ static int __migrate_folio_move(struct folio *src, struct folio *dst,
 			lru_add_drain();
 	}
 
-	if (rc == -EAGAIN) {
-		__migrate_folio_record(dst, page_was_mapped, anon_vma);
-		return rc;
-	}
-
 	if (page_was_mapped)
 		remove_migration_ptes(src,
 			rc == MIGRATEPAGE_SUCCESS ? dst : src, false);
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [BISECTED] first bad commit is c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move")
  2023-02-03 15:02         ` Huang, Ying
@ 2023-02-05 14:38           ` Hyeonggon Yoo
  2023-02-06  6:25             ` Huang, Ying
  0 siblings, 1 reply; 8+ messages in thread
From: Hyeonggon Yoo @ 2023-02-05 14:38 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Zi Yan, Yang Shi, Baolin Wang, Oscar Salvador,
	Matthew Wilcox, Bharata B Rao, Alistair Popple, haoxin,
	Minchan Kim

On Fri, Feb 03, 2023 at 11:02:46PM +0800, Huang, Ying wrote:
> Hyeonggon Yoo <42.hyeyoo@gmail.com> writes:
> 
> > On Fri, Feb 03, 2023 at 07:17:14AM +0800, Huang, Ying wrote:
> >> "Huang, Ying" <ying.huang@intel.com> writes:
> >> 
> >> > Hi, Hyeonggon,
> >> >
> >> > Hyeonggon Yoo <42.hyeyoo@gmail.com> writes:
> >> >
> >> >> On Wed, Feb 01, 2023 at 01:09:10AM +0900, Hyeonggon Yoo wrote:
> >> >>> I've observed random list_del corruption on mm-unstable,
> >> >>> where HEAD is commit d69862e693c069f4
> >> >>> ("mm/migrate: convert putback_movable_pages() to use folios").
> >> >>> 
> >> >>> The issue can be easily reproduced by stressing MM multiple times:
> >> >>> 	stress-ng --bigheap 0 --timeout 300
> >> >>> 
> >> >>> The compiler is gcc 12.2.1 and config, dmesg are included as attachment.
> >> >>> I will try to bisect but can't promise quick resolution :)
> >> >>
> >> >>
> >> >> The first bad commits appears to be:
> >> >> c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move")
> >> >>
> >> >> the first bad commit _probably_ be earlier, but this is quite
> >> >> easy to reproduce so at this point I think above is the real bad commit.
> >> >
> >> > Thank you very much for reporting the bug.  I'm in travel now but I will
> >> > try to find some time to reproduce and debug it.
> >> 
> >> Still haven't reproduced the issue.  But after reviewing the code, I
> >> found a bug in the code, which may cause list corruption.  Can you try
> >> the debug patch below?
> >
> > Unfortunately my home server seems to be broken again :(
> > That means I only have access to VMs and not a real machine now.
> >
> > FYI it was not reproduced on KVM but reproduced on real machine.
> > Could you try checking on your machine with the config I attached? [1]
> 
> Thank you very much for testing!
>
> > Sorry to bother your travel!
> 
> Never mind.  Your report helps me very much!
> 
> > [1] https://marc.info/?l=linux-mm&m=167518135116956
> 
> I have reproduced the bug successfully!  And I can reproduce the bug
> with the previous debug patch too, although the reproduction rate isn't
> high.
> 
> And in my test, the following patch can fix the bug.
> 
> It appears that zswap code will touch dst->lru during moving page.

After setting swap space I was able to reproduce even on VM.

> -------------------------8<----------------------------------
> From b2e3f4aab16d8af0033286fde669b46e7467c7ec Mon Sep 17 00:00:00 2001
> From: Huang Ying <ying.huang@intel.com>
> Date: Fri, 3 Feb 2023 22:03:24 +0800
> Subject: [PATCH] dbg,migrate_pages: restore destination folio state before
>  move
> 
> ---
>  mm/migrate.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)


This fixes the bug on my test:

Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Thanks for such a quick fix!

> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 143d96775b4d..fa7212330cb6 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1225,13 +1225,19 @@ static int __migrate_folio_move(struct folio *src, struct folio *dst,
>  	int page_was_mapped = 0;
>  	struct anon_vma *anon_vma = NULL;
>  	bool is_lru = !__PageMovable(&src->page);
> +	struct list_head *prev;
>  
>  	__migrate_folio_extract(dst, &page_was_mapped, &anon_vma);
> +	prev = dst->lru.prev;
> +	list_del(&dst->lru);

BTW may be silly questions,
 
- How can zswap touch dst->lru during moving page, is there no lock
  that prevents this to happen?

- Does this race (?) happen only during moving page?
  (I mean, why is it safe to perform list_del()/list_add() before and
  after moving page?)

>  
>  	rc = move_to_new_folio(dst, src, mode);
>  
> -	if (rc != -EAGAIN)
> -		list_del(&dst->lru);
> +	if (rc == -EAGAIN) {
> +		list_add(&dst->lru, prev);
> +		__migrate_folio_record(dst, page_was_mapped, anon_vma);
> +		return rc;
> +	}
>
>  
>  	if (unlikely(!is_lru))
>  		goto out_unlock_both;
> @@ -1251,11 +1257,6 @@ static int __migrate_folio_move(struct folio *src, struct folio *dst,
>  			lru_add_drain();
>  	}
>  
> -	if (rc == -EAGAIN) {
> -		__migrate_folio_record(dst, page_was_mapped, anon_vma);
> -		return rc;
> -	}
> -
>  	if (page_was_mapped)
>  		remove_migration_ptes(src,
>  			rc == MIGRATEPAGE_SUCCESS ? dst : src, false);
> -- 
> 2.35.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BISECTED] first bad commit is c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move")
  2023-02-05 14:38           ` Hyeonggon Yoo
@ 2023-02-06  6:25             ` Huang, Ying
  0 siblings, 0 replies; 8+ messages in thread
From: Huang, Ying @ 2023-02-06  6:25 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: linux-mm, akpm, Zi Yan, Yang Shi, Baolin Wang, Oscar Salvador,
	Matthew Wilcox, Bharata B Rao, Alistair Popple, haoxin,
	Minchan Kim

Hyeonggon Yoo <42.hyeyoo@gmail.com> writes:

> On Fri, Feb 03, 2023 at 11:02:46PM +0800, Huang, Ying wrote:
>> Hyeonggon Yoo <42.hyeyoo@gmail.com> writes:
>> 
>> > On Fri, Feb 03, 2023 at 07:17:14AM +0800, Huang, Ying wrote:
>> >> "Huang, Ying" <ying.huang@intel.com> writes:
>> >> 
>> >> > Hi, Hyeonggon,
>> >> >
>> >> > Hyeonggon Yoo <42.hyeyoo@gmail.com> writes:
>> >> >
>> >> >> On Wed, Feb 01, 2023 at 01:09:10AM +0900, Hyeonggon Yoo wrote:
>> >> >>> I've observed random list_del corruption on mm-unstable,
>> >> >>> where HEAD is commit d69862e693c069f4
>> >> >>> ("mm/migrate: convert putback_movable_pages() to use folios").
>> >> >>> 
>> >> >>> The issue can be easily reproduced by stressing MM multiple times:
>> >> >>> 	stress-ng --bigheap 0 --timeout 300
>> >> >>> 
>> >> >>> The compiler is gcc 12.2.1 and config, dmesg are included as attachment.
>> >> >>> I will try to bisect but can't promise quick resolution :)
>> >> >>
>> >> >>
>> >> >> The first bad commits appears to be:
>> >> >> c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move")
>> >> >>
>> >> >> the first bad commit _probably_ be earlier, but this is quite
>> >> >> easy to reproduce so at this point I think above is the real bad commit.
>> >> >
>> >> > Thank you very much for reporting the bug.  I'm in travel now but I will
>> >> > try to find some time to reproduce and debug it.
>> >> 
>> >> Still haven't reproduced the issue.  But after reviewing the code, I
>> >> found a bug in the code, which may cause list corruption.  Can you try
>> >> the debug patch below?
>> >
>> > Unfortunately my home server seems to be broken again :(
>> > That means I only have access to VMs and not a real machine now.
>> >
>> > FYI it was not reproduced on KVM but reproduced on real machine.
>> > Could you try checking on your machine with the config I attached? [1]
>> 
>> Thank you very much for testing!
>>
>> > Sorry to bother your travel!
>> 
>> Never mind.  Your report helps me very much!
>> 
>> > [1] https://marc.info/?l=linux-mm&m=167518135116956
>> 
>> I have reproduced the bug successfully!  And I can reproduce the bug
>> with the previous debug patch too, although the reproduction rate isn't
>> high.
>> 
>> And in my test, the following patch can fix the bug.
>> 
>> It appears that zswap code will touch dst->lru during moving page.
>
> After setting swap space I was able to reproduce even on VM.
>
>> -------------------------8<----------------------------------
>> From b2e3f4aab16d8af0033286fde669b46e7467c7ec Mon Sep 17 00:00:00 2001
>> From: Huang Ying <ying.huang@intel.com>
>> Date: Fri, 3 Feb 2023 22:03:24 +0800
>> Subject: [PATCH] dbg,migrate_pages: restore destination folio state before
>>  move
>> 
>> ---
>>  mm/migrate.c | 15 ++++++++-------
>>  1 file changed, 8 insertions(+), 7 deletions(-)
>
>
> This fixes the bug on my test:
>
> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Thanks for such a quick fix!

Thank you very much!

>> 
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 143d96775b4d..fa7212330cb6 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1225,13 +1225,19 @@ static int __migrate_folio_move(struct folio *src, struct folio *dst,
>>  	int page_was_mapped = 0;
>>  	struct anon_vma *anon_vma = NULL;
>>  	bool is_lru = !__PageMovable(&src->page);
>> +	struct list_head *prev;
>>  
>>  	__migrate_folio_extract(dst, &page_was_mapped, &anon_vma);
>> +	prev = dst->lru.prev;
>> +	list_del(&dst->lru);
>
> BTW may be silly questions,
>  
> - How can zswap touch dst->lru during moving page, is there no lock
>   that prevents this to happen?
>
> - Does this race (?) happen only during moving page?
>   (I mean, why is it safe to perform list_del()/list_add() before and
>   after moving page?)

This isn't a race condition.  In the following code path,

  __migrate_folio_move()
    move_to_new_folio()
      mops->migrate_page() // z3fold_page_migrate()
        list_add(&newpage->lru, &pool->lru)

newpage->lru will be changed during move_to_new_folio().  While the
original code assumes that newpage->lru will not be changed.

Best Regards,
Huang, Ying

>>  
>>  	rc = move_to_new_folio(dst, src, mode);
>>  
>> -	if (rc != -EAGAIN)
>> -		list_del(&dst->lru);
>> +	if (rc == -EAGAIN) {
>> +		list_add(&dst->lru, prev);
>> +		__migrate_folio_record(dst, page_was_mapped, anon_vma);
>> +		return rc;
>> +	}
>>
>>  
>>  	if (unlikely(!is_lru))
>>  		goto out_unlock_both;
>> @@ -1251,11 +1257,6 @@ static int __migrate_folio_move(struct folio *src, struct folio *dst,
>>  			lru_add_drain();
>>  	}
>>  
>> -	if (rc == -EAGAIN) {
>> -		__migrate_folio_record(dst, page_was_mapped, anon_vma);
>> -		return rc;
>> -	}
>> -
>>  	if (page_was_mapped)
>>  		remove_migration_ptes(src,
>>  			rc == MIGRATEPAGE_SUCCESS ? dst : src, false);
>> -- 
>> 2.35.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-02-06  6:26 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Y9k9Jl9wIaUFZS30@hyeyoo>
2023-01-31 17:35 ` [BISECTED] first bad commit is c203c6d5b3f0597 ("migrate_pages: batch _unmap and _move") Hyeonggon Yoo
2023-01-31 22:45   ` Andrew Morton
2023-02-01 23:28   ` Huang, Ying
2023-02-02 23:17     ` Huang, Ying
2023-02-03 14:17       ` Hyeonggon Yoo
2023-02-03 15:02         ` Huang, Ying
2023-02-05 14:38           ` Hyeonggon Yoo
2023-02-06  6:25             ` Huang, Ying

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.