* Regression on todays tip/master (commit 16f70beccf43)
@ 2020-07-23 13:37 Joerg Roedel
2020-07-23 14:46 ` Thomas Gleixner
0 siblings, 1 reply; 8+ messages in thread
From: Joerg Roedel @ 2020-07-23 13:37 UTC (permalink / raw)
To: x86, Peter Zijlstra, Arnaldo Carvalho de Melo, Andy Lutomirski,
Dave Hansen
Cc: linux-kernel
Hi,
while testing the SEV-ES patches on todays tip/master I triggered the BUG
below:
[ 137.629660] ------------[ cut here ]------------
[ 137.630769] kernel BUG at kernel/signal.c:1917!
[ 137.631796] invalid opcode: 0000 [#1] SMP NOPTI
[ 137.632822] CPU: 3 PID: 28596 Comm: test_syscall_vd Not tainted 5.8.0-rc6-tip+ #3
[ 137.634495] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 137.636236] RIP: 0010:do_notify_parent+0x25c/0x290
[ 137.637311] Code: 41 89 c5 41 83 e5 01 45 31 c0 b9 01 00 00 00 48 8d 74 24 10 44 89 e7 48 8b 95 f0 04 00 00 e8 1b f5 ff ff e9 5a ff ff ff 0f 0b <0f> 0b 48 39 bf 18 05 00 00 75 17 48 8b 97 88 05 00 00 48 8d 87 88
[ 137.640453] RSP: 0018:ffffc13942197e10 EFLAGS: 00010002
[ 137.641246] RAX: 0000000000000008 RBX: ffff9cd98b5c5c40 RCX: 0000000000000040
[ 137.642329] RDX: ffff9cd99fa9dc40 RSI: 0000000000000011 RDI: ffff9cd98b5c5c40
[ 137.643397] RBP: ffff9cd98b5c5c40 R08: 0000000000000000 R09: 0000000000000000
[ 137.644467] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc13942197ea8
[ 137.645536] R13: ffff9cd98b5c6138 R14: 0000000000000001 R15: ffff9cd947de9ec0
[ 137.646621] FS: 0000000000000000(0000) GS:ffff9cd9baec0000(0000) knlGS:00000000f7c72700
[ 137.647833] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 137.648695] CR2: 00000000f7e74f24 CR3: 0000800043a0a000 CR4: 00000000003506e0
[ 137.649790] DR0: 0000000000406188 DR1: 000000000040130a DR2: 0000000000000000
[ 137.650861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 137.652055] Call Trace:
[ 137.652464] ? perf_iterate_sb+0x142/0x1e0
[ 137.653097] do_exit+0x991/0xaf0
[ 137.653610] ? ptrace_notify+0x4e/0x70
[ 137.654183] do_group_exit+0x3a/0xa0
[ 137.654731] __ia32_sys_exit_group+0x14/0x20
[ 137.655382] do_syscall_32_irqs_on+0x45/0x60
[ 137.656035] do_fast_syscall_32+0x67/0xe0
[ 137.656650] entry_SYSCALL_compat_after_hwframe+0x45/0x4d
[ 137.657466] RIP: 0023:0xf7fb5569
[ 137.657972] Code: Bad RIP value.
[ 137.658468] RSP: 002b:00000000ff9c0efc EFLAGS: 00200296 ORIG_RAX: 00000000000000fc
[ 137.659598] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000000000
[ 137.660667] RDX: 00000000ff9c0eec RSI: 00000000f7e5b6b8 RDI: 00000000f7e5b6b8
[ 137.661750] RBP: 00000000f7e5dc48 R08: 0000000000000000 R09: 0000000000000000
[ 137.662815] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 137.663882] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 137.664948] Modules linked in:
[ 137.665419] ---[ end trace ed97590b8bdea54b ]---
This is from a guest kernel which runs _without_ my SEV-ES patches, so
built from plain tip/master branch.
The guest had 4 VCPUs and ran 4 instances of the in-kernel x86-selftests
in a loop, together with 'perf top -e cycles:k'. As you can see in the
time-stamps, the issue triggered pretty quickly.
Please let me know if you need more information or testing from my side.
Thanks,
Joerg
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression on todays tip/master (commit 16f70beccf43)
2020-07-23 13:37 Regression on todays tip/master (commit 16f70beccf43) Joerg Roedel
@ 2020-07-23 14:46 ` Thomas Gleixner
2020-07-23 14:52 ` Joerg Roedel
0 siblings, 1 reply; 8+ messages in thread
From: Thomas Gleixner @ 2020-07-23 14:46 UTC (permalink / raw)
To: Joerg Roedel, x86, Peter Zijlstra, Arnaldo Carvalho de Melo,
Andy Lutomirski, Dave Hansen
Cc: linux-kernel
Joerg Roedel <joro@8bytes.org> writes:
> while testing the SEV-ES patches on todays tip/master I triggered the BUG
> below:
>
> [ 137.629660] ------------[ cut here ]------------
> [ 137.630769] kernel BUG at kernel/signal.c:1917!
> [ 137.631796] invalid opcode: 0000 [#1] SMP NOPTI
> [ 137.632822] CPU: 3 PID: 28596 Comm: test_syscall_vd Not tainted 5.8.0-rc6-tip+ #3
> [ 137.634495] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> [ 137.636236] RIP: 0010:do_notify_parent+0x25c/0x290
> The guest had 4 VCPUs and ran 4 instances of the in-kernel x86-selftests
> in a loop, together with 'perf top -e cycles:k'. As you can see in the
> time-stamps, the issue triggered pretty quickly.
>
> Please let me know if you need more information or testing from my side.
Any chance to bisect this?
Thanks,
tglx
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression on todays tip/master (commit 16f70beccf43)
2020-07-23 14:46 ` Thomas Gleixner
@ 2020-07-23 14:52 ` Joerg Roedel
2020-07-24 13:28 ` Ingo Molnar
0 siblings, 1 reply; 8+ messages in thread
From: Joerg Roedel @ 2020-07-23 14:52 UTC (permalink / raw)
To: Thomas Gleixner
Cc: x86, Peter Zijlstra, Arnaldo Carvalho de Melo, Andy Lutomirski,
Dave Hansen, linux-kernel
On Thu, Jul 23, 2020 at 04:46:04PM +0200, Thomas Gleixner wrote:
> Joerg Roedel <joro@8bytes.org> writes:
> > while testing the SEV-ES patches on todays tip/master I triggered the BUG
> > below:
> >
> > [ 137.629660] ------------[ cut here ]------------
> > [ 137.630769] kernel BUG at kernel/signal.c:1917!
> > [ 137.631796] invalid opcode: 0000 [#1] SMP NOPTI
> > [ 137.632822] CPU: 3 PID: 28596 Comm: test_syscall_vd Not tainted 5.8.0-rc6-tip+ #3
> > [ 137.634495] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > [ 137.636236] RIP: 0010:do_notify_parent+0x25c/0x290
> > The guest had 4 VCPUs and ran 4 instances of the in-kernel x86-selftests
> > in a loop, together with 'perf top -e cycles:k'. As you can see in the
> > time-stamps, the issue triggered pretty quickly.
> >
> > Please let me know if you need more information or testing from my side.
>
> Any chance to bisect this?
Yes, will try. I am currently testing plain -rc6, it seems to be fine.
Bisecting is next.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression on todays tip/master (commit 16f70beccf43)
2020-07-23 14:52 ` Joerg Roedel
@ 2020-07-24 13:28 ` Ingo Molnar
2020-07-24 14:50 ` Joerg Roedel
2020-07-25 10:38 ` Ingo Molnar
0 siblings, 2 replies; 8+ messages in thread
From: Ingo Molnar @ 2020-07-24 13:28 UTC (permalink / raw)
To: Joerg Roedel
Cc: Thomas Gleixner, x86, Peter Zijlstra, Arnaldo Carvalho de Melo,
Andy Lutomirski, Dave Hansen, linux-kernel
* Joerg Roedel <joro@8bytes.org> wrote:
> On Thu, Jul 23, 2020 at 04:46:04PM +0200, Thomas Gleixner wrote:
> > Joerg Roedel <joro@8bytes.org> writes:
> > > while testing the SEV-ES patches on todays tip/master I triggered the BUG
> > > below:
> > >
> > > [ 137.629660] ------------[ cut here ]------------
> > > [ 137.630769] kernel BUG at kernel/signal.c:1917!
> > > [ 137.631796] invalid opcode: 0000 [#1] SMP NOPTI
> > > [ 137.632822] CPU: 3 PID: 28596 Comm: test_syscall_vd Not tainted 5.8.0-rc6-tip+ #3
> > > [ 137.634495] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > > [ 137.636236] RIP: 0010:do_notify_parent+0x25c/0x290
> > > The guest had 4 VCPUs and ran 4 instances of the in-kernel x86-selftests
> > > in a loop, together with 'perf top -e cycles:k'. As you can see in the
> > > time-stamps, the issue triggered pretty quickly.
> > >
> > > Please let me know if you need more information or testing from my side.
> >
> > Any chance to bisect this?
>
> Yes, will try. I am currently testing plain -rc6, it seems to be fine.
> Bisecting is next.
Given that you are perf stress-testing the box, some recent perf
commit would be the primary suspect - before doing a full bisect you
might want to try current perf/core (2ac5413e5edc) and its upstream
base: v5.8-rc3, to narrow it down.
But in principle any other commit could be the cause as well, the
assert suggests memory corruption - I don't think we changed anything
in the signal code.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression on todays tip/master (commit 16f70beccf43)
2020-07-24 13:28 ` Ingo Molnar
@ 2020-07-24 14:50 ` Joerg Roedel
2020-07-24 15:35 ` Joerg Roedel
2020-07-25 10:38 ` Ingo Molnar
1 sibling, 1 reply; 8+ messages in thread
From: Joerg Roedel @ 2020-07-24 14:50 UTC (permalink / raw)
To: Ingo Molnar
Cc: Thomas Gleixner, x86, Peter Zijlstra, Arnaldo Carvalho de Melo,
Andy Lutomirski, Dave Hansen, linux-kernel
On Fri, Jul 24, 2020 at 03:28:02PM +0200, Ingo Molnar wrote:
> Given that you are perf stress-testing the box, some recent perf
> commit would be the primary suspect - before doing a full bisect you
> might want to try current perf/core (2ac5413e5edc) and its upstream
> base: v5.8-rc3, to narrow it down.
>
> But in principle any other commit could be the cause as well, the
> assert suggests memory corruption - I don't think we changed anything
> in the signal code.
I tried to bisec, but it didn't yield something useful yet. The outcome
was commit
commit 1abdfe706a579a702799fce465bceb9fb01d407c
Author: Alex Belits <abelits@marvell.com>
Date: Thu Jun 25 18:34:41 2020 -0400
lib: Restrict cpumask_local_spread to houskeeping CPUs
But it looks totally unrelated to the backtrace I am seeing, and
reverting it didn't fix the problem.
Next thing is, I can reliable reproduce it with yesterdays tip/master
(commit 16f70beccf43), but did not see it with tip/master pulled today
(commit c02699cd25e8) yet.
To trigger it is sufficient to run the test_syscall_vdso_32 self-test in
a loop, ideally multiple $times, where $times > `nproc`. It usually
triggers withing the first 5 minutes in my test VMs. It turned out that
a running perf is not needed to trigger it.
Regards,
Joerg
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression on todays tip/master (commit 16f70beccf43)
2020-07-24 14:50 ` Joerg Roedel
@ 2020-07-24 15:35 ` Joerg Roedel
0 siblings, 0 replies; 8+ messages in thread
From: Joerg Roedel @ 2020-07-24 15:35 UTC (permalink / raw)
To: Ingo Molnar
Cc: Thomas Gleixner, x86, Peter Zijlstra, Arnaldo Carvalho de Melo,
Andy Lutomirski, Dave Hansen, linux-kernel
On Fri, Jul 24, 2020 at 04:50:53PM +0200, Joerg Roedel wrote:
> Next thing is, I can reliable reproduce it with yesterdays tip/master
> (commit 16f70beccf43), but did not see it with tip/master pulled today
> (commit c02699cd25e8) yet.
Next bisection try ended with this log:
# bad: [16f70beccf43f9eb481ff8485974bc00ab2267d8] Merge branch 'core/debugobjects'
# good: [ba47d845d715a010f7b51f6f89bae32845e6acb7] Linux 5.8-rc6
git bisect start '16f70beccf43' 'v5.8-rc6'
# good: [e9c8c19545f31ec817507ee11884356799f14917] Merge branch 'perf/core'
git bisect good e9c8c19545f31ec817507ee11884356799f14917
# bad: [8e62572d2c29e1c484f5d4e7c016793209de2efb] Merge branch 'linus'
git bisect bad 8e62572d2c29e1c484f5d4e7c016793209de2efb
# good: [c085fb8774671e83f6199a8e838fbc0e57094029] perf/x86/intel/lbr: Support XSAVES for arch LBR read
git bisect good c085fb8774671e83f6199a8e838fbc0e57094029
# good: [220dbf4aaa5b574f67ce23fa4d7b0104515bc60e] Merge branch 'WIP.core/headers'
git bisect good 220dbf4aaa5b574f67ce23fa4d7b0104515bc60e
# bad: [9d246053a69196c7c27068870e9b4b66ac536f68] sched: Add a tracepoint to track rq->nr_running
git bisect bad 9d246053a69196c7c27068870e9b4b66ac536f68
# bad: [46609ce227039fd192e0ecc7d940bed587fd2c78] sched/uclamp: Protect uclamp fast path code with static key
git bisect bad 46609ce227039fd192e0ecc7d940bed587fd2c78
# bad: [85c2ce9104eb93517db2037699471c517e81f9b4] sched, vmlinux.lds: Increase STRUCT_ALIGNMENT to 64 bytes for GCC-4.9
git bisect bad 85c2ce9104eb93517db2037699471c517e81f9b4
# bad: [faa2fd7cbad4609d06d7904c0a80cf2f8cd23678] Merge branch 'sched/urgent'
git bisect bad faa2fd7cbad4609d06d7904c0a80cf2f8cd23678
# first bad commit: [faa2fd7cbad4609d06d7904c0a80cf2f8cd23678] Merge branch 'sched/urgent'
So it ends at a merge commit. I am not going to debug this further as it
is no longer reproducible with todays tip/master. It probably was just a
bug in a conflict resolution when merging branches or something like
that.
Regards,
Joerg
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression on todays tip/master (commit 16f70beccf43)
2020-07-24 13:28 ` Ingo Molnar
2020-07-24 14:50 ` Joerg Roedel
@ 2020-07-25 10:38 ` Ingo Molnar
2020-07-25 18:56 ` Joerg Roedel
1 sibling, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2020-07-25 10:38 UTC (permalink / raw)
To: Joerg Roedel
Cc: Thomas Gleixner, x86, Peter Zijlstra, Arnaldo Carvalho de Melo,
Andy Lutomirski, Dave Hansen, linux-kernel
* Ingo Molnar <mingo@kernel.org> wrote:
>
> * Joerg Roedel <joro@8bytes.org> wrote:
>
> > On Thu, Jul 23, 2020 at 04:46:04PM +0200, Thomas Gleixner wrote:
> > > Joerg Roedel <joro@8bytes.org> writes:
> > > > while testing the SEV-ES patches on todays tip/master I triggered the BUG
> > > > below:
> > > >
> > > > [ 137.629660] ------------[ cut here ]------------
> > > > [ 137.630769] kernel BUG at kernel/signal.c:1917!
> > > > [ 137.631796] invalid opcode: 0000 [#1] SMP NOPTI
> > > > [ 137.632822] CPU: 3 PID: 28596 Comm: test_syscall_vd Not tainted 5.8.0-rc6-tip+ #3
> > > > [ 137.634495] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > > > [ 137.636236] RIP: 0010:do_notify_parent+0x25c/0x290
> > > > The guest had 4 VCPUs and ran 4 instances of the in-kernel x86-selftests
> > > > in a loop, together with 'perf top -e cycles:k'. As you can see in the
> > > > time-stamps, the issue triggered pretty quickly.
> > > >
> > > > Please let me know if you need more information or testing from my side.
> > >
> > > Any chance to bisect this?
> >
> > Yes, will try. I am currently testing plain -rc6, it seems to be fine.
> > Bisecting is next.
>
> Given that you are perf stress-testing the box, some recent perf
> commit would be the primary suspect - before doing a full bisect you
> might want to try current perf/core (2ac5413e5edc) and its upstream
> base: v5.8-rc3, to narrow it down.
>
> But in principle any other commit could be the cause as well, the
> assert suggests memory corruption - I don't think we changed anything
> in the signal code.
On a second thought, I think this recent bug might have been the
culprit:
d136122f5845: ("sched: Fix race against ptrace_freeze_trace()")
Fixed in tip:sched/urgent - this is why it went away in your testing
perhaps?
I'm sending this fix to Linus today.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression on todays tip/master (commit 16f70beccf43)
2020-07-25 10:38 ` Ingo Molnar
@ 2020-07-25 18:56 ` Joerg Roedel
0 siblings, 0 replies; 8+ messages in thread
From: Joerg Roedel @ 2020-07-25 18:56 UTC (permalink / raw)
To: Ingo Molnar
Cc: Thomas Gleixner, x86, Peter Zijlstra, Arnaldo Carvalho de Melo,
Andy Lutomirski, Dave Hansen, linux-kernel
On Sat, Jul 25, 2020 at 12:38:50PM +0200, Ingo Molnar wrote:
> On a second thought, I think this recent bug might have been the
> culprit:
>
> d136122f5845: ("sched: Fix race against ptrace_freeze_trace()")
>
> Fixed in tip:sched/urgent - this is why it went away in your testing
> perhaps?
Indeed, tip/master with this commit reverted triggers the issue again,
so it appears to be the same problem. But it leaves the question why I
couldn't trigger it with plain v5.8-rc6.
Thanks,
Joerg
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2020-07-25 18:56 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-23 13:37 Regression on todays tip/master (commit 16f70beccf43) Joerg Roedel
2020-07-23 14:46 ` Thomas Gleixner
2020-07-23 14:52 ` Joerg Roedel
2020-07-24 13:28 ` Ingo Molnar
2020-07-24 14:50 ` Joerg Roedel
2020-07-24 15:35 ` Joerg Roedel
2020-07-25 10:38 ` Ingo Molnar
2020-07-25 18:56 ` Joerg Roedel
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.