All of lore.kernel.org
 help / color / mirror / Atom feed
* Alpha: rare random memory corruption/segfault in user space bisected
@ 2022-05-06 21:21 Michael Cree
  2022-05-07  1:56 ` Hillf Danton
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Cree @ 2022-05-06 21:21 UTC (permalink / raw)
  To: linux-alpha; +Cc: linux-kernel, Joonsoo Kim, Andrew Morton

Alpha kernel has been exhibiting rare and random memory
corruptions/segaults in user space since the 5.9.y kernel.  First seen
on the Debian Ports build daemon when running 5.10.y kernel resulting
in the occasional (one or two a day) build failures with gcc ICEs either
due to self detected corrupt memory structures or segfaults.  Have been
running 5.8.y kernel without such problems for over six months.

Tried bisecting last year but went off track with incorrect good/bad
determinations due to rare nature of bug.  After trying a 5.16.y kernel
early this year and seen the bug is still present retried the bisection
and have got to:

aae466b0052e1888edd1d7f473d4310d64936196 is the first bad commit
commit aae466b0052e1888edd1d7f473d4310d64936196
Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date:   Tue Aug 11 18:30:50 2020 -0700

    mm/swap: implement workingset detection for anonymous LRU


Pretty confident this is the bad commit as the kernel built to the parent
commit (3852f6768ede54...) has not failed in four days running. Always have
seen the failure within one day of running in past.

Cheers,
Michael.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Alpha: rare random memory corruption/segfault in user space bisected
  2022-05-06 21:21 Alpha: rare random memory corruption/segfault in user space bisected Michael Cree
@ 2022-05-07  1:56 ` Hillf Danton
  2022-05-07 18:27   ` Yu Zhao
  0 siblings, 1 reply; 7+ messages in thread
From: Hillf Danton @ 2022-05-07  1:56 UTC (permalink / raw)
  To: Michael Cree; +Cc: linux-mm, linux-kernel

On Sat, 7 May 2022 09:21:25 +1200 Michael Cree wrote:
> Alpha kernel has been exhibiting rare and random memory
> corruptions/segaults in user space since the 5.9.y kernel.  First seen
> on the Debian Ports build daemon when running 5.10.y kernel resulting
> in the occasional (one or two a day) build failures with gcc ICEs either
> due to self detected corrupt memory structures or segfaults.  Have been
> running 5.8.y kernel without such problems for over six months.
> 
> Tried bisecting last year but went off track with incorrect good/bad
> determinations due to rare nature of bug.  After trying a 5.16.y kernel
> early this year and seen the bug is still present retried the bisection
> and have got to:
> 
> aae466b0052e1888edd1d7f473d4310d64936196 is the first bad commit
> commit aae466b0052e1888edd1d7f473d4310d64936196
> Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Date:   Tue Aug 11 18:30:50 2020 -0700
> 
>     mm/swap: implement workingset detection for anonymous LRU
> 
> 
> Pretty confident this is the bad commit as the kernel built to the parent
> commit (3852f6768ede54...) has not failed in four days running. Always have
> seen the failure within one day of running in past.

See if the fix to the syzbot bisection [1] is not a cure to your issue.

[1] https://lore.kernel.org/lkml/000000000000625fa705dd1802e3@google.com/


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Alpha: rare random memory corruption/segfault in user space bisected
  2022-05-07  1:56 ` Hillf Danton
@ 2022-05-07 18:27   ` Yu Zhao
  2022-05-11 20:36     ` Michael Cree
  0 siblings, 1 reply; 7+ messages in thread
From: Yu Zhao @ 2022-05-07 18:27 UTC (permalink / raw)
  To: Michael Cree; +Cc: Linux-MM, linux-kernel, Hillf Danton, Joonsoo Kim

[-- Attachment #1: Type: text/plain, Size: 1877 bytes --]

On Fri, May 6, 2022 at 6:57 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On Sat, 7 May 2022 09:21:25 +1200 Michael Cree wrote:
> > Alpha kernel has been exhibiting rare and random memory
> > corruptions/segaults in user space since the 5.9.y kernel.  First seen
> > on the Debian Ports build daemon when running 5.10.y kernel resulting
> > in the occasional (one or two a day) build failures with gcc ICEs either
> > due to self detected corrupt memory structures or segfaults.  Have been
> > running 5.8.y kernel without such problems for over six months.
> >
> > Tried bisecting last year but went off track with incorrect good/bad
> > determinations due to rare nature of bug.  After trying a 5.16.y kernel
> > early this year and seen the bug is still present retried the bisection
> > and have got to:
> >
> > aae466b0052e1888edd1d7f473d4310d64936196 is the first bad commit
> > commit aae466b0052e1888edd1d7f473d4310d64936196
> > Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Date:   Tue Aug 11 18:30:50 2020 -0700
> >
> >     mm/swap: implement workingset detection for anonymous LRU

This commit seems innocent to me. While not ruling out anything, i.e.,
this commit, compiler, qemu, userspace itself, etc., my wild guess is
the problem is memory barrier related. Two lock/unlock pairs, which
imply two full barriers, were removed. This is not a small deal on
Alpha, since it imposes no constraints on cache coherency, AFAIK.

Can you please try the attached patch on top of this commit? Thanks!

> > Pretty confident this is the bad commit as the kernel built to the parent
> > commit (3852f6768ede54...) has not failed in four days running. Always have
> > seen the failure within one day of running in past.
>
> See if the fix to the syzbot bisection [1] is not a cure to your issue.
>
> [1] https://lore.kernel.org/lkml/000000000000625fa705dd1802e3@google.com/

[-- Attachment #2: test.diff --]
[-- Type: application/octet-stream, Size: 653 bytes --]

diff --git a/mm/memory.c b/mm/memory.c
index de311fc7639e..f1cf07416cf4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3150,6 +3150,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 					goto out_page;
 				}
 
+				smp_mb();
+
 				shadow = get_shadow_from_swap_cache(entry);
 				if (shadow)
 					workingset_refault(page, shadow);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b73aabdfd35a..310d4049cdf3 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -499,6 +499,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		goto fail_unlock;
 	}
 
+	smp_mb();
+
 	if (shadow)
 		workingset_refault(page, shadow);
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Alpha: rare random memory corruption/segfault in user space bisected
  2022-05-07 18:27   ` Yu Zhao
@ 2022-05-11 20:36     ` Michael Cree
  2022-05-23 20:56       ` Yu Zhao
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Cree @ 2022-05-11 20:36 UTC (permalink / raw)
  To: Yu Zhao; +Cc: Linux-MM, linux-kernel, Hillf Danton, Joonsoo Kim

On Sat, May 07, 2022 at 11:27:15AM -0700, Yu Zhao wrote:
> On Fri, May 6, 2022 at 6:57 PM Hillf Danton <hdanton@sina.com> wrote:
> >
> > On Sat, 7 May 2022 09:21:25 +1200 Michael Cree wrote:
> > > Alpha kernel has been exhibiting rare and random memory
> > > corruptions/segaults in user space since the 5.9.y kernel.  First seen
> > > on the Debian Ports build daemon when running 5.10.y kernel resulting
> > > in the occasional (one or two a day) build failures with gcc ICEs either
> > > due to self detected corrupt memory structures or segfaults.  Have been
> > > running 5.8.y kernel without such problems for over six months.
> > >
> > > Tried bisecting last year but went off track with incorrect good/bad
> > > determinations due to rare nature of bug.  After trying a 5.16.y kernel
> > > early this year and seen the bug is still present retried the bisection
> > > and have got to:
> > >
> > > aae466b0052e1888edd1d7f473d4310d64936196 is the first bad commit
> > > commit aae466b0052e1888edd1d7f473d4310d64936196
> > > Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > Date:   Tue Aug 11 18:30:50 2020 -0700
> > >
> > >     mm/swap: implement workingset detection for anonymous LRU
> 
> This commit seems innocent to me. While not ruling out anything, i.e.,
> this commit, compiler, qemu, userspace itself, etc., my wild guess is
> the problem is memory barrier related. Two lock/unlock pairs, which
> imply two full barriers, were removed. This is not a small deal on
> Alpha, since it imposes no constraints on cache coherency, AFAIK.
> 
> Can you please try the attached patch on top of this commit? Thanks!

Thanks, I have that running now for a day without any problem showing
up, but that's not long enough to be sure it has fixed the problem. Will
get back to you after another day or two of testing.

Cheers,
Michael.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Alpha: rare random memory corruption/segfault in user space bisected
  2022-05-11 20:36     ` Michael Cree
@ 2022-05-23 20:56       ` Yu Zhao
  2022-05-30  8:25         ` Michael Cree
  0 siblings, 1 reply; 7+ messages in thread
From: Yu Zhao @ 2022-05-23 20:56 UTC (permalink / raw)
  To: Michael Cree; +Cc: Linux-MM, linux-kernel, Hillf Danton, Joonsoo Kim

On Wed, May 11, 2022 at 2:37 PM Michael Cree <mcree@orcon.net.nz> wrote:
>
> On Sat, May 07, 2022 at 11:27:15AM -0700, Yu Zhao wrote:
> > On Fri, May 6, 2022 at 6:57 PM Hillf Danton <hdanton@sina.com> wrote:
> > >
> > > On Sat, 7 May 2022 09:21:25 +1200 Michael Cree wrote:
> > > > Alpha kernel has been exhibiting rare and random memory
> > > > corruptions/segaults in user space since the 5.9.y kernel.  First seen
> > > > on the Debian Ports build daemon when running 5.10.y kernel resulting
> > > > in the occasional (one or two a day) build failures with gcc ICEs either
> > > > due to self detected corrupt memory structures or segfaults.  Have been
> > > > running 5.8.y kernel without such problems for over six months.
> > > >
> > > > Tried bisecting last year but went off track with incorrect good/bad
> > > > determinations due to rare nature of bug.  After trying a 5.16.y kernel
> > > > early this year and seen the bug is still present retried the bisection
> > > > and have got to:
> > > >
> > > > aae466b0052e1888edd1d7f473d4310d64936196 is the first bad commit
> > > > commit aae466b0052e1888edd1d7f473d4310d64936196
> > > > Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > > Date:   Tue Aug 11 18:30:50 2020 -0700
> > > >
> > > >     mm/swap: implement workingset detection for anonymous LRU
> >
> > This commit seems innocent to me. While not ruling out anything, i.e.,
> > this commit, compiler, qemu, userspace itself, etc., my wild guess is
> > the problem is memory barrier related. Two lock/unlock pairs, which
> > imply two full barriers, were removed. This is not a small deal on
> > Alpha, since it imposes no constraints on cache coherency, AFAIK.
> >
> > Can you please try the attached patch on top of this commit? Thanks!
>
> Thanks, I have that running now for a day without any problem showing
> up, but that's not long enough to be sure it has fixed the problem. Will
> get back to you after another day or two of testing.

Any luck? Thanks!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Alpha: rare random memory corruption/segfault in user space bisected
  2022-05-23 20:56       ` Yu Zhao
@ 2022-05-30  8:25         ` Michael Cree
  2022-06-08  0:20           ` Yu Zhao
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Cree @ 2022-05-30  8:25 UTC (permalink / raw)
  To: Yu Zhao; +Cc: Linux-MM, linux-kernel, Hillf Danton, Joonsoo Kim

On Mon, May 23, 2022 at 02:56:12PM -0600, Yu Zhao wrote:
> On Wed, May 11, 2022 at 2:37 PM Michael Cree <mcree@orcon.net.nz> wrote:
> >
> > On Sat, May 07, 2022 at 11:27:15AM -0700, Yu Zhao wrote:
> > > On Fri, May 6, 2022 at 6:57 PM Hillf Danton <hdanton@sina.com> wrote:
> > > >
> > > > On Sat, 7 May 2022 09:21:25 +1200 Michael Cree wrote:
> > > > > Alpha kernel has been exhibiting rare and random memory
> > > > > corruptions/segaults in user space since the 5.9.y kernel.  First seen
> > > > > on the Debian Ports build daemon when running 5.10.y kernel resulting
> > > > > in the occasional (one or two a day) build failures with gcc ICEs either
> > > > > due to self detected corrupt memory structures or segfaults.  Have been
> > > > > running 5.8.y kernel without such problems for over six months.
> > > > >
> > > > > Tried bisecting last year but went off track with incorrect good/bad
> > > > > determinations due to rare nature of bug.  After trying a 5.16.y kernel
> > > > > early this year and seen the bug is still present retried the bisection
> > > > > and have got to:
> > > > >
> > > > > aae466b0052e1888edd1d7f473d4310d64936196 is the first bad commit
> > > > > commit aae466b0052e1888edd1d7f473d4310d64936196
> > > > > Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > > > Date:   Tue Aug 11 18:30:50 2020 -0700
> > > > >
> > > > >     mm/swap: implement workingset detection for anonymous LRU
> > >
> > > This commit seems innocent to me. While not ruling out anything, i.e.,
> > > this commit, compiler, qemu, userspace itself, etc., my wild guess is
> > > the problem is memory barrier related. Two lock/unlock pairs, which
> > > imply two full barriers, were removed. This is not a small deal on
> > > Alpha, since it imposes no constraints on cache coherency, AFAIK.
> > >
> > > Can you please try the attached patch on top of this commit? Thanks!
> >
> > Thanks, I have that running now for a day without any problem showing
> > up, but that's not long enough to be sure it has fixed the problem. Will
> > get back to you after another day or two of testing.
> 
> Any luck? Thanks!

Sorry for the delay in replying.  Testing has taken longer due to an
unexpected hitch.  The patch proved to be good but for a double check I
retested the above commit without the patch but it now won't fail which
calls into question whether aae466b0052e188 is truly the bad commit. I
have gone back to the prior bad commit in the bisection (25788738eb9c)
and it failed again confirming it is bad.  So it looks like the first
bad commit is somewhere between aae466b0052e188 and 25788738eb9c (a
total of five commits inclusive, four if we take aae466b0052e188 as
good) and I am now building 471e78cc7687337abd1 and will test that.

Cheers,
Michael.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Alpha: rare random memory corruption/segfault in user space bisected
  2022-05-30  8:25         ` Michael Cree
@ 2022-06-08  0:20           ` Yu Zhao
  0 siblings, 0 replies; 7+ messages in thread
From: Yu Zhao @ 2022-06-08  0:20 UTC (permalink / raw)
  To: Michael Cree; +Cc: Linux-MM, linux-kernel, Hillf Danton, Joonsoo Kim

On Mon, May 30, 2022 at 2:25 AM Michael Cree <mcree@orcon.net.nz> wrote:
>
> On Mon, May 23, 2022 at 02:56:12PM -0600, Yu Zhao wrote:
> > On Wed, May 11, 2022 at 2:37 PM Michael Cree <mcree@orcon.net.nz> wrote:
> > >
> > > On Sat, May 07, 2022 at 11:27:15AM -0700, Yu Zhao wrote:
> > > > On Fri, May 6, 2022 at 6:57 PM Hillf Danton <hdanton@sina.com> wrote:
> > > > >
> > > > > On Sat, 7 May 2022 09:21:25 +1200 Michael Cree wrote:
> > > > > > Alpha kernel has been exhibiting rare and random memory
> > > > > > corruptions/segaults in user space since the 5.9.y kernel.  First seen
> > > > > > on the Debian Ports build daemon when running 5.10.y kernel resulting
> > > > > > in the occasional (one or two a day) build failures with gcc ICEs either
> > > > > > due to self detected corrupt memory structures or segfaults.  Have been
> > > > > > running 5.8.y kernel without such problems for over six months.
> > > > > >
> > > > > > Tried bisecting last year but went off track with incorrect good/bad
> > > > > > determinations due to rare nature of bug.  After trying a 5.16.y kernel
> > > > > > early this year and seen the bug is still present retried the bisection
> > > > > > and have got to:
> > > > > >
> > > > > > aae466b0052e1888edd1d7f473d4310d64936196 is the first bad commit
> > > > > > commit aae466b0052e1888edd1d7f473d4310d64936196
> > > > > > Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > > > > Date:   Tue Aug 11 18:30:50 2020 -0700
> > > > > >
> > > > > >     mm/swap: implement workingset detection for anonymous LRU
> > > >
> > > > This commit seems innocent to me. While not ruling out anything, i.e.,
> > > > this commit, compiler, qemu, userspace itself, etc., my wild guess is
> > > > the problem is memory barrier related. Two lock/unlock pairs, which
> > > > imply two full barriers, were removed. This is not a small deal on
> > > > Alpha, since it imposes no constraints on cache coherency, AFAIK.
> > > >
> > > > Can you please try the attached patch on top of this commit? Thanks!
> > >
> > > Thanks, I have that running now for a day without any problem showing
> > > up, but that's not long enough to be sure it has fixed the problem. Will
> > > get back to you after another day or two of testing.
> >
> > Any luck? Thanks!
>
> Sorry for the delay in replying.  Testing has taken longer due to an
> unexpected hitch.  The patch proved to be good but for a double check I
> retested the above commit without the patch but it now won't fail which
> calls into question whether aae466b0052e188 is truly the bad commit. I
> have gone back to the prior bad commit in the bisection (25788738eb9c)
> and it failed again confirming it is bad.  So it looks like the first
> bad commit is somewhere between aae466b0052e188 and 25788738eb9c (a
> total of five commits inclusive, four if we take aae466b0052e188 as
> good) and I am now building 471e78cc7687337abd1 and will test that.

No worries. Thanks for the update.

Were swap devices used when the ICEs happened? If so,
1) What kind of swap devices, e.g., zram, block device, etc.?
2) aae466b0052e188 might have made the kernel swap more frequently and
thus the problem easier to reproduce. Assuming this is the case, then
setting swappiness to 200 might help reproduce the problem.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-06-08  2:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-06 21:21 Alpha: rare random memory corruption/segfault in user space bisected Michael Cree
2022-05-07  1:56 ` Hillf Danton
2022-05-07 18:27   ` Yu Zhao
2022-05-11 20:36     ` Michael Cree
2022-05-23 20:56       ` Yu Zhao
2022-05-30  8:25         ` Michael Cree
2022-06-08  0:20           ` Yu Zhao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.