* PROBLEM: sparc64 random crashes starting w/ Linux 6.1 (regression) @ 2023-01-29 2:17 Nick Bowler 2023-01-29 22:14 ` Peter Xu 2023-01-30 9:37 ` Linux kernel regression tracking (#adding) 0 siblings, 2 replies; 9+ messages in thread From: Nick Bowler @ 2023-01-29 2:17 UTC (permalink / raw) To: linux-kernel, sparclinux, regressions; +Cc: Peter Xu Hi, Starting with Linux 6.1.y, my sparc64 (Sun Ultra 60) system is very unstable, with userspace processes randomly crashing with all kinds of different weird errors. The same problem occurs on 6.2-rc5. Linux 6.0.y is OK. Usually, it manifests with ssh connections just suddenly dropping out like this: malloc(): unaligned tcache chunk detected Connection to alectrona closed. but other kinds of failures (random segfaults, bus errors, etc.) are seen too. I have not ever seen the kernel itself oops or anything like that, there are no abnormal kernel log messages of any kind; except for the normal ones that get printed when processes segfault, like this one: [ 563.085851] zsh[2073]: segfault at 10 ip 00000000f7a7c09c (rpc 00000000f7a7c0a0) sp 00000000ff8f5e08 error 1 in libc.so.6[f7960000+1b2000] I was able to reproduce this fairly reliably by using GNU ddrescue to dump a disk from the dvd drive -- things usually go awry after a minute or two. So I was able to bisect to this commit: 2e3468778dbe3ec389a10c21a703bb8e5be5cfbc is the first bad commit commit 2e3468778dbe3ec389a10c21a703bb8e5be5cfbc Author: Peter Xu <peterx@redhat.com> Date: Thu Aug 11 12:13:29 2022 -0400 mm: remember young/dirty bit for page migrations This does not revert cleanly on master, but I ran my test on the immediately preceding commit (0ccf7f168e17: "mm/thp: carry over dirty bit when thp splits on pmd") extra times and I am unable to get this one to crash, so reasonably confident in this bisection result... Let me know if you need any more info! Thanks, Nick ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PROBLEM: sparc64 random crashes starting w/ Linux 6.1 (regression) 2023-01-29 2:17 PROBLEM: sparc64 random crashes starting w/ Linux 6.1 (regression) Nick Bowler @ 2023-01-29 22:14 ` Peter Xu 2023-01-30 1:36 ` Nick Bowler 2023-01-31 1:46 ` Nick Bowler 2023-01-30 9:37 ` Linux kernel regression tracking (#adding) 1 sibling, 2 replies; 9+ messages in thread From: Peter Xu @ 2023-01-29 22:14 UTC (permalink / raw) To: Nick Bowler; +Cc: linux-kernel, sparclinux, regressions, Andrew Morton On Sat, Jan 28, 2023 at 09:17:31PM -0500, Nick Bowler wrote: > Hi, Hi, Nick, > > Starting with Linux 6.1.y, my sparc64 (Sun Ultra 60) system is very > unstable, with userspace processes randomly crashing with all kinds of > different weird errors. The same problem occurs on 6.2-rc5. Linux > 6.0.y is OK. > > Usually, it manifests with ssh connections just suddenly dropping out > like this: > > malloc(): unaligned tcache chunk detected > Connection to alectrona closed. > > but other kinds of failures (random segfaults, bus errors, etc.) are > seen too. > > I have not ever seen the kernel itself oops or anything like that, there > are no abnormal kernel log messages of any kind; except for the normal > ones that get printed when processes segfault, like this one: > > [ 563.085851] zsh[2073]: segfault at 10 ip 00000000f7a7c09c (rpc > 00000000f7a7c0a0) sp 00000000ff8f5e08 error 1 in > libc.so.6[f7960000+1b2000] > > I was able to reproduce this fairly reliably by using GNU ddrescue to > dump a disk from the dvd drive -- things usually go awry after a minute > or two. So I was able to bisect to this commit: > > 2e3468778dbe3ec389a10c21a703bb8e5be5cfbc is the first bad commit > commit 2e3468778dbe3ec389a10c21a703bb8e5be5cfbc > Author: Peter Xu <peterx@redhat.com> > Date: Thu Aug 11 12:13:29 2022 -0400 > > mm: remember young/dirty bit for page migrations > > This does not revert cleanly on master, but I ran my test on the > immediately preceding commit (0ccf7f168e17: "mm/thp: carry over dirty > bit when thp splits on pmd") extra times and I am unable to get this > one to crash, so reasonably confident in this bisection result... There's a similar report previously but interestingly it was exactly reported against commit 0ccf7f168e17, which was the one you reported all good: https://lore.kernel.org/all/20221021160603.GA23307@u164.east.ru/ It's probably because for some reason the thp split didn't really happen in your system (maybe thp disabled?) or it should break too. It also means 624a2c94f5b7a didn't really fix all the issues. So I assumed that's the only issue we had after verified with 624a2c94f5b7a on two existing reproducers and we assumed all issues fixed. However then with this report I looked into the whole set and I did notice the page migration code actually has similar problem. Sorry I should have noticed this even earlier. So very likely the previous two reports came from environment where page migration is either rare or not enabled. And now I suspect your system has page migration enabled. Could you try below patch to see whether it fixes your problem? It should cover the last piece of possible issue with dirty bit on sparc after that patchset. It's based on latest master branch (commit ab072681eabe1ce0). ---8<--- diff --git a/mm/huge_memory.c b/mm/huge_memory.c index abe6cfd92ffa..f15ea5b389f6 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3272,15 +3272,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot)); if (pmd_swp_soft_dirty(*pvmw->pmd)) pmde = pmd_mksoft_dirty(pmde); - if (is_writable_migration_entry(entry)) - pmde = maybe_pmd_mkwrite(pmde, vma); if (pmd_swp_uffd_wp(*pvmw->pmd)) - pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde)); + pmde = pmd_mkuffd_wp(pmde); if (!is_migration_entry_young(entry)) pmde = pmd_mkold(pmde); /* NOTE: this may contain setting soft-dirty on some archs */ if (PageDirty(new) && is_migration_entry_dirty(entry)) pmde = pmd_mkdirty(pmde); + if (is_writable_migration_entry(entry)) + pmde = maybe_pmd_mkwrite(pmde, vma); + else + pmde = pmd_wrprotect(pmde); if (PageAnon(new)) { rmap_t rmap_flags = RMAP_COMPOUND; diff --git a/mm/migrate.c b/mm/migrate.c index a4d3fc65085f..cc5455614e01 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -224,6 +224,8 @@ static bool remove_migration_pte(struct folio *folio, pte = maybe_mkwrite(pte, vma); else if (pte_swp_uffd_wp(*pvmw.pte)) pte = pte_mkuffd_wp(pte); + else + pte = pte_wrprotect(pte); if (folio_test_anon(folio) && !is_readable_migration_entry(entry)) rmap_flags |= RMAP_EXCLUSIVE; ---8<--- Thanks, -- Peter Xu ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: PROBLEM: sparc64 random crashes starting w/ Linux 6.1 (regression) 2023-01-29 22:14 ` Peter Xu @ 2023-01-30 1:36 ` Nick Bowler 2023-01-31 1:46 ` Nick Bowler 1 sibling, 0 replies; 9+ messages in thread From: Nick Bowler @ 2023-01-30 1:36 UTC (permalink / raw) To: Peter Xu; +Cc: linux-kernel, sparclinux, regressions, Andrew Morton On 2023-01-29, Peter Xu <peterx@redhat.com> wrote: > There's a similar report previously but interestingly it was exactly > reported against commit 0ccf7f168e17, which was the one you reported all > good: > > https://lore.kernel.org/all/20221021160603.GA23307@u164.east.ru/ > > It's probably because for some reason the thp split didn't really happen in > your system (maybe thp disabled?) or it should break too. This seems an accurate assessment: CONFIG_TRANSPARENT_HUGEPAGE is not set > It also means 624a2c94f5b7a didn't really fix all the issues. So I assumed > that's the only issue we had after verified with 624a2c94f5b7a on two > existing reproducers and we assumed all issues fixed. > > However then with this report I looked into the whole set and I did notice > the page migration code actually has similar problem. Sorry I should have > noticed this even earlier. So very likely the previous two reports came > from environment where page migration is either rare or not enabled. And > now I suspect your system has page migration enabled. I'd say that sounds correct too: I have CONFIG_COMPACTION=y which sets CONFIG_MIGRATION=y > Could you try below patch to see whether it fixes your problem? It should > cover the last piece of possible issue with dirty bit on sparc after that > patchset. It's based on latest master branch (commit ab072681eabe1ce0). I applied this on top of 6.2-rc6 and will give this a spin now. Thanks, Nick ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PROBLEM: sparc64 random crashes starting w/ Linux 6.1 (regression) 2023-01-29 22:14 ` Peter Xu 2023-01-30 1:36 ` Nick Bowler @ 2023-01-31 1:46 ` Nick Bowler 2023-02-15 14:49 ` Linux regression tracking (Thorsten Leemhuis) 1 sibling, 1 reply; 9+ messages in thread From: Nick Bowler @ 2023-01-31 1:46 UTC (permalink / raw) To: Peter Xu; +Cc: linux-kernel, sparclinux, regressions, Andrew Morton On 2023-01-29, Peter Xu <peterx@redhat.com> wrote: > On Sat, Jan 28, 2023 at 09:17:31PM -0500, Nick Bowler wrote: >> Starting with Linux 6.1.y, my sparc64 (Sun Ultra 60) system is very >> unstable, with userspace processes randomly crashing with all kinds of >> different weird errors. The same problem occurs on 6.2-rc5. Linux >> 6.0.y is OK. [...] > Could you try below patch to see whether it fixes your problem? It should > cover the last piece of possible issue with dirty bit on sparc after that > patchset. It's based on latest master branch (commit ab072681eabe1ce0). Haven't seen any failures yet, so it seems this patch on top of 6.2-rc6 makes things much better. I'll keep running this for a while to see if any other problems come up. Thanks, Nick ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PROBLEM: sparc64 random crashes starting w/ Linux 6.1 (regression) 2023-01-31 1:46 ` Nick Bowler @ 2023-02-15 14:49 ` Linux regression tracking (Thorsten Leemhuis) 2023-02-15 15:21 ` Peter Xu 0 siblings, 1 reply; 9+ messages in thread From: Linux regression tracking (Thorsten Leemhuis) @ 2023-02-15 14:49 UTC (permalink / raw) To: Nick Bowler, Peter Xu Cc: linux-kernel, sparclinux, regressions, Andrew Morton On 31.01.23 02:46, Nick Bowler wrote: > On 2023-01-29, Peter Xu <peterx@redhat.com> wrote: >> On Sat, Jan 28, 2023 at 09:17:31PM -0500, Nick Bowler wrote: >>> Starting with Linux 6.1.y, my sparc64 (Sun Ultra 60) system is very >>> unstable, with userspace processes randomly crashing with all kinds of >>> different weird errors. The same problem occurs on 6.2-rc5. Linux >>> 6.0.y is OK. > [...] >> Could you try below patch to see whether it fixes your problem? It should >> cover the last piece of possible issue with dirty bit on sparc after that >> patchset. It's based on latest master branch (commit ab072681eabe1ce0). > > Haven't seen any failures yet, so it seems this patch on top of 6.2-rc6 > makes things much better. > > I'll keep running this for a while to see if any other problems come up. Nick, I assume no other problems showed up? In that case Peter could send the patch in for merging. Or did you do that already? Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page. #regzbot ignore-activity ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PROBLEM: sparc64 random crashes starting w/ Linux 6.1 (regression) 2023-02-15 14:49 ` Linux regression tracking (Thorsten Leemhuis) @ 2023-02-15 15:21 ` Peter Xu 2023-02-16 5:32 ` Nick Bowler 0 siblings, 1 reply; 9+ messages in thread From: Peter Xu @ 2023-02-15 15:21 UTC (permalink / raw) To: Linux regressions mailing list Cc: Nick Bowler, linux-kernel, sparclinux, Andrew Morton On Wed, Feb 15, 2023 at 03:49:56PM +0100, Linux regression tracking (Thorsten Leemhuis) wrote: > On 31.01.23 02:46, Nick Bowler wrote: > > On 2023-01-29, Peter Xu <peterx@redhat.com> wrote: > >> On Sat, Jan 28, 2023 at 09:17:31PM -0500, Nick Bowler wrote: > >>> Starting with Linux 6.1.y, my sparc64 (Sun Ultra 60) system is very > >>> unstable, with userspace processes randomly crashing with all kinds of > >>> different weird errors. The same problem occurs on 6.2-rc5. Linux > >>> 6.0.y is OK. > > [...] > >> Could you try below patch to see whether it fixes your problem? It should > >> cover the last piece of possible issue with dirty bit on sparc after that > >> patchset. It's based on latest master branch (commit ab072681eabe1ce0). > > > > Haven't seen any failures yet, so it seems this patch on top of 6.2-rc6 > > makes things much better. > > > > I'll keep running this for a while to see if any other problems come up. > > Nick, I assume no other problems showed up? > > In that case Peter could send the patch in for merging. Or did you do > that already? Thanks for raising this again. Nop, I'm just waiting for a final ack from Nick to make sure that nothing went wrong after the longer run. -- Peter Xu ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PROBLEM: sparc64 random crashes starting w/ Linux 6.1 (regression) 2023-02-15 15:21 ` Peter Xu @ 2023-02-16 5:32 ` Nick Bowler 2023-02-16 15:33 ` Peter Xu 0 siblings, 1 reply; 9+ messages in thread From: Nick Bowler @ 2023-02-16 5:32 UTC (permalink / raw) To: Peter Xu Cc: Linux regressions mailing list, linux-kernel, sparclinux, Andrew Morton On 2023-02-15, Peter Xu <peterx@redhat.com> wrote: > On Wed, Feb 15, 2023 at 03:49:56PM +0100, Linux regression tracking > (Thorsten Leemhuis) wrote: >> On 31.01.23 02:46, Nick Bowler wrote: >> > I'll keep running this for a while to see if any other problems come >> > up. >> >> Nick, I assume no other problems showed up? >> >> In that case Peter could send the patch in for merging. Or did you do >> that already? > > Thanks for raising this again. Nop, I'm just waiting for a final ack from > Nick to make sure that nothing went wrong after the longer run. Oh, yes, it wasn't so much a "run" as just continuing to use the computer normally. Everything seems stable enough. Cheers, Nick ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PROBLEM: sparc64 random crashes starting w/ Linux 6.1 (regression) 2023-02-16 5:32 ` Nick Bowler @ 2023-02-16 15:33 ` Peter Xu 0 siblings, 0 replies; 9+ messages in thread From: Peter Xu @ 2023-02-16 15:33 UTC (permalink / raw) To: Nick Bowler Cc: Linux regressions mailing list, linux-kernel, sparclinux, Andrew Morton On Thu, Feb 16, 2023 at 12:32:54AM -0500, Nick Bowler wrote: > On 2023-02-15, Peter Xu <peterx@redhat.com> wrote: > > On Wed, Feb 15, 2023 at 03:49:56PM +0100, Linux regression tracking > > (Thorsten Leemhuis) wrote: > >> On 31.01.23 02:46, Nick Bowler wrote: > >> > I'll keep running this for a while to see if any other problems come > >> > up. > >> > >> Nick, I assume no other problems showed up? > >> > >> In that case Peter could send the patch in for merging. Or did you do > >> that already? > > > > Thanks for raising this again. Nop, I'm just waiting for a final ack from > > Nick to make sure that nothing went wrong after the longer run. > > Oh, yes, it wasn't so much a "run" as just continuing to use the > computer normally. > > Everything seems stable enough. Thanks Nick. I've just posted a formal patch with you copied. There's a slight tweak due to rebasing to the latest akpm tree, but I still attached your tested-by for appreciations on the help, and I assume it should have the same functional change. -- Peter Xu ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PROBLEM: sparc64 random crashes starting w/ Linux 6.1 (regression) 2023-01-29 2:17 PROBLEM: sparc64 random crashes starting w/ Linux 6.1 (regression) Nick Bowler 2023-01-29 22:14 ` Peter Xu @ 2023-01-30 9:37 ` Linux kernel regression tracking (#adding) 1 sibling, 0 replies; 9+ messages in thread From: Linux kernel regression tracking (#adding) @ 2023-01-30 9:37 UTC (permalink / raw) To: Nick Bowler, linux-kernel, sparclinux, regressions; +Cc: Peter Xu [TLDR: I'm adding this report to the list of tracked Linux kernel regressions; the text you find below is based on a few templates paragraphs you might have encountered already in similar form. See link in footer if these mails annoy you.] On 29.01.23 03:17, Nick Bowler wrote: > > Starting with Linux 6.1.y, my sparc64 (Sun Ultra 60) system is very > unstable, with userspace processes randomly crashing with all kinds of > different weird errors. The same problem occurs on 6.2-rc5. Linux > 6.0.y is OK. > > Usually, it manifests with ssh connections just suddenly dropping out > like this: > > malloc(): unaligned tcache chunk detected > Connection to alectrona closed. > > but other kinds of failures (random segfaults, bus errors, etc.) are > seen too. > > I have not ever seen the kernel itself oops or anything like that, there > are no abnormal kernel log messages of any kind; except for the normal > ones that get printed when processes segfault, like this one: > > [ 563.085851] zsh[2073]: segfault at 10 ip 00000000f7a7c09c (rpc > 00000000f7a7c0a0) sp 00000000ff8f5e08 error 1 in > libc.so.6[f7960000+1b2000] > > I was able to reproduce this fairly reliably by using GNU ddrescue to > dump a disk from the dvd drive -- things usually go awry after a minute > or two. So I was able to bisect to this commit: > > 2e3468778dbe3ec389a10c21a703bb8e5be5cfbc is the first bad commit > commit 2e3468778dbe3ec389a10c21a703bb8e5be5cfbc > Author: Peter Xu <peterx@redhat.com> > Date: Thu Aug 11 12:13:29 2022 -0400 > > mm: remember young/dirty bit for page migrations > > This does not revert cleanly on master, but I ran my test on the > immediately preceding commit (0ccf7f168e17: "mm/thp: carry over dirty > bit when thp splits on pmd") extra times and I am unable to get this > one to crash, so reasonably confident in this bisection result... > > Let me know if you need any more info! Thanks for the report. To be sure the issue doesn't fall through the cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression tracking bot: #regzbot ^introduced 2e3468778dbe3ec3 #regzbot title sparc64: random crashes #regzbot ignore-activity This isn't a regression? This issue or a fix for it are already discussed somewhere else? It was fixed already? You want to clarify when the regression started to happen? Or point out I got the title or something else totally wrong? Then just reply and tell me -- ideally while also telling regzbot about it, as explained by the page listed in the footer of this mail. Developers: When fixing the issue, remember to add 'Link:' tags pointing to the report (the parent of this mail). See page linked in footer for details. Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr That page also explains what to do if mails like this annoy you. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-02-16 15:34 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-01-29 2:17 PROBLEM: sparc64 random crashes starting w/ Linux 6.1 (regression) Nick Bowler 2023-01-29 22:14 ` Peter Xu 2023-01-30 1:36 ` Nick Bowler 2023-01-31 1:46 ` Nick Bowler 2023-02-15 14:49 ` Linux regression tracking (Thorsten Leemhuis) 2023-02-15 15:21 ` Peter Xu 2023-02-16 5:32 ` Nick Bowler 2023-02-16 15:33 ` Peter Xu 2023-01-30 9:37 ` Linux kernel regression tracking (#adding)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).