All of lore.kernel.org
 help / color / mirror / Atom feed
* Regression on todays tip/master (commit 16f70beccf43)
@ 2020-07-23 13:37 Joerg Roedel
  2020-07-23 14:46 ` Thomas Gleixner
  0 siblings, 1 reply; 8+ messages in thread
From: Joerg Roedel @ 2020-07-23 13:37 UTC (permalink / raw)
  To: x86, Peter Zijlstra, Arnaldo Carvalho de Melo, Andy Lutomirski,
	Dave Hansen
  Cc: linux-kernel

Hi,

while testing the SEV-ES patches on todays tip/master I triggered the BUG
below:

[  137.629660] ------------[ cut here ]------------
[  137.630769] kernel BUG at kernel/signal.c:1917!
[  137.631796] invalid opcode: 0000 [#1] SMP NOPTI
[  137.632822] CPU: 3 PID: 28596 Comm: test_syscall_vd Not tainted 5.8.0-rc6-tip+ #3
[  137.634495] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[  137.636236] RIP: 0010:do_notify_parent+0x25c/0x290
[  137.637311] Code: 41 89 c5 41 83 e5 01 45 31 c0 b9 01 00 00 00 48 8d 74 24 10 44 89 e7 48 8b 95 f0 04 00 00 e8 1b f5 ff ff e9 5a ff ff ff 0f 0b <0f> 0b 48 39 bf 18 05 00 00 75 17 48 8b 97 88 05 00 00 48 8d 87 88
[  137.640453] RSP: 0018:ffffc13942197e10 EFLAGS: 00010002
[  137.641246] RAX: 0000000000000008 RBX: ffff9cd98b5c5c40 RCX: 0000000000000040
[  137.642329] RDX: ffff9cd99fa9dc40 RSI: 0000000000000011 RDI: ffff9cd98b5c5c40
[  137.643397] RBP: ffff9cd98b5c5c40 R08: 0000000000000000 R09: 0000000000000000
[  137.644467] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc13942197ea8
[  137.645536] R13: ffff9cd98b5c6138 R14: 0000000000000001 R15: ffff9cd947de9ec0
[  137.646621] FS:  0000000000000000(0000) GS:ffff9cd9baec0000(0000) knlGS:00000000f7c72700
[  137.647833] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[  137.648695] CR2: 00000000f7e74f24 CR3: 0000800043a0a000 CR4: 00000000003506e0
[  137.649790] DR0: 0000000000406188 DR1: 000000000040130a DR2: 0000000000000000
[  137.650861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[  137.652055] Call Trace:
[  137.652464]  ? perf_iterate_sb+0x142/0x1e0
[  137.653097]  do_exit+0x991/0xaf0
[  137.653610]  ? ptrace_notify+0x4e/0x70
[  137.654183]  do_group_exit+0x3a/0xa0
[  137.654731]  __ia32_sys_exit_group+0x14/0x20
[  137.655382]  do_syscall_32_irqs_on+0x45/0x60
[  137.656035]  do_fast_syscall_32+0x67/0xe0
[  137.656650]  entry_SYSCALL_compat_after_hwframe+0x45/0x4d
[  137.657466] RIP: 0023:0xf7fb5569
[  137.657972] Code: Bad RIP value.
[  137.658468] RSP: 002b:00000000ff9c0efc EFLAGS: 00200296 ORIG_RAX: 00000000000000fc
[  137.659598] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000000000
[  137.660667] RDX: 00000000ff9c0eec RSI: 00000000f7e5b6b8 RDI: 00000000f7e5b6b8
[  137.661750] RBP: 00000000f7e5dc48 R08: 0000000000000000 R09: 0000000000000000
[  137.662815] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  137.663882] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  137.664948] Modules linked in:
[  137.665419] ---[ end trace ed97590b8bdea54b ]---

This is from a guest kernel which runs _without_ my SEV-ES patches, so
built from plain tip/master branch.

The guest had 4 VCPUs and ran 4 instances of the in-kernel x86-selftests
in a loop, together with 'perf top -e cycles:k'. As you can see in the
time-stamps, the issue triggered pretty quickly.

Please let me know if you need more information or testing from my side.

Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression on todays tip/master (commit 16f70beccf43)
  2020-07-23 13:37 Regression on todays tip/master (commit 16f70beccf43) Joerg Roedel
@ 2020-07-23 14:46 ` Thomas Gleixner
  2020-07-23 14:52   ` Joerg Roedel
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Gleixner @ 2020-07-23 14:46 UTC (permalink / raw)
  To: Joerg Roedel, x86, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Andy Lutomirski, Dave Hansen
  Cc: linux-kernel

Joerg Roedel <joro@8bytes.org> writes:
> while testing the SEV-ES patches on todays tip/master I triggered the BUG
> below:
>
> [  137.629660] ------------[ cut here ]------------
> [  137.630769] kernel BUG at kernel/signal.c:1917!
> [  137.631796] invalid opcode: 0000 [#1] SMP NOPTI
> [  137.632822] CPU: 3 PID: 28596 Comm: test_syscall_vd Not tainted 5.8.0-rc6-tip+ #3
> [  137.634495] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> [  137.636236] RIP: 0010:do_notify_parent+0x25c/0x290
> The guest had 4 VCPUs and ran 4 instances of the in-kernel x86-selftests
> in a loop, together with 'perf top -e cycles:k'. As you can see in the
> time-stamps, the issue triggered pretty quickly.
>
> Please let me know if you need more information or testing from my side.

Any chance to bisect this?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression on todays tip/master (commit 16f70beccf43)
  2020-07-23 14:46 ` Thomas Gleixner
@ 2020-07-23 14:52   ` Joerg Roedel
  2020-07-24 13:28     ` Ingo Molnar
  0 siblings, 1 reply; 8+ messages in thread
From: Joerg Roedel @ 2020-07-23 14:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: x86, Peter Zijlstra, Arnaldo Carvalho de Melo, Andy Lutomirski,
	Dave Hansen, linux-kernel

On Thu, Jul 23, 2020 at 04:46:04PM +0200, Thomas Gleixner wrote:
> Joerg Roedel <joro@8bytes.org> writes:
> > while testing the SEV-ES patches on todays tip/master I triggered the BUG
> > below:
> >
> > [  137.629660] ------------[ cut here ]------------
> > [  137.630769] kernel BUG at kernel/signal.c:1917!
> > [  137.631796] invalid opcode: 0000 [#1] SMP NOPTI
> > [  137.632822] CPU: 3 PID: 28596 Comm: test_syscall_vd Not tainted 5.8.0-rc6-tip+ #3
> > [  137.634495] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > [  137.636236] RIP: 0010:do_notify_parent+0x25c/0x290
> > The guest had 4 VCPUs and ran 4 instances of the in-kernel x86-selftests
> > in a loop, together with 'perf top -e cycles:k'. As you can see in the
> > time-stamps, the issue triggered pretty quickly.
> >
> > Please let me know if you need more information or testing from my side.
> 
> Any chance to bisect this?

Yes, will try. I am currently testing plain -rc6, it seems to be fine.
Bisecting is next.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression on todays tip/master (commit 16f70beccf43)
  2020-07-23 14:52   ` Joerg Roedel
@ 2020-07-24 13:28     ` Ingo Molnar
  2020-07-24 14:50       ` Joerg Roedel
  2020-07-25 10:38       ` Ingo Molnar
  0 siblings, 2 replies; 8+ messages in thread
From: Ingo Molnar @ 2020-07-24 13:28 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, x86, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Andy Lutomirski, Dave Hansen, linux-kernel


* Joerg Roedel <joro@8bytes.org> wrote:

> On Thu, Jul 23, 2020 at 04:46:04PM +0200, Thomas Gleixner wrote:
> > Joerg Roedel <joro@8bytes.org> writes:
> > > while testing the SEV-ES patches on todays tip/master I triggered the BUG
> > > below:
> > >
> > > [  137.629660] ------------[ cut here ]------------
> > > [  137.630769] kernel BUG at kernel/signal.c:1917!
> > > [  137.631796] invalid opcode: 0000 [#1] SMP NOPTI
> > > [  137.632822] CPU: 3 PID: 28596 Comm: test_syscall_vd Not tainted 5.8.0-rc6-tip+ #3
> > > [  137.634495] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > > [  137.636236] RIP: 0010:do_notify_parent+0x25c/0x290
> > > The guest had 4 VCPUs and ran 4 instances of the in-kernel x86-selftests
> > > in a loop, together with 'perf top -e cycles:k'. As you can see in the
> > > time-stamps, the issue triggered pretty quickly.
> > >
> > > Please let me know if you need more information or testing from my side.
> > 
> > Any chance to bisect this?
> 
> Yes, will try. I am currently testing plain -rc6, it seems to be fine.
> Bisecting is next.

Given that you are perf stress-testing the box, some recent perf 
commit would be the primary suspect - before doing a full bisect you 
might want to try current perf/core (2ac5413e5edc) and its upstream 
base: v5.8-rc3, to narrow it down.

But in principle any other commit could be the cause as well, the 
assert suggests memory corruption - I don't think we changed anything 
in the signal code.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression on todays tip/master (commit 16f70beccf43)
  2020-07-24 13:28     ` Ingo Molnar
@ 2020-07-24 14:50       ` Joerg Roedel
  2020-07-24 15:35         ` Joerg Roedel
  2020-07-25 10:38       ` Ingo Molnar
  1 sibling, 1 reply; 8+ messages in thread
From: Joerg Roedel @ 2020-07-24 14:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, x86, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Andy Lutomirski, Dave Hansen, linux-kernel

On Fri, Jul 24, 2020 at 03:28:02PM +0200, Ingo Molnar wrote:
> Given that you are perf stress-testing the box, some recent perf 
> commit would be the primary suspect - before doing a full bisect you 
> might want to try current perf/core (2ac5413e5edc) and its upstream 
> base: v5.8-rc3, to narrow it down.
> 
> But in principle any other commit could be the cause as well, the 
> assert suggests memory corruption - I don't think we changed anything 
> in the signal code.

I tried to bisec, but it didn't yield something useful yet. The outcome
was commit

	commit 1abdfe706a579a702799fce465bceb9fb01d407c
	Author: Alex Belits <abelits@marvell.com>
	Date:   Thu Jun 25 18:34:41 2020 -0400

	    lib: Restrict cpumask_local_spread to houskeeping CPUs

But it looks totally unrelated to the backtrace I am seeing, and
reverting it didn't fix the problem.

Next thing is, I can reliable reproduce it with yesterdays tip/master
(commit 16f70beccf43), but did not see it with tip/master pulled today
(commit c02699cd25e8) yet.

To trigger it is sufficient to run the test_syscall_vdso_32 self-test in
a loop, ideally multiple $times, where $times > `nproc`. It usually
triggers withing the first 5 minutes in my test VMs. It turned out that
a running perf is not needed to trigger it.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression on todays tip/master (commit 16f70beccf43)
  2020-07-24 14:50       ` Joerg Roedel
@ 2020-07-24 15:35         ` Joerg Roedel
  0 siblings, 0 replies; 8+ messages in thread
From: Joerg Roedel @ 2020-07-24 15:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, x86, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Andy Lutomirski, Dave Hansen, linux-kernel

On Fri, Jul 24, 2020 at 04:50:53PM +0200, Joerg Roedel wrote:
> Next thing is, I can reliable reproduce it with yesterdays tip/master
> (commit 16f70beccf43), but did not see it with tip/master pulled today
> (commit c02699cd25e8) yet.

Next bisection try ended with this log:

# bad: [16f70beccf43f9eb481ff8485974bc00ab2267d8] Merge branch 'core/debugobjects'
# good: [ba47d845d715a010f7b51f6f89bae32845e6acb7] Linux 5.8-rc6
git bisect start '16f70beccf43' 'v5.8-rc6'
# good: [e9c8c19545f31ec817507ee11884356799f14917] Merge branch 'perf/core'
git bisect good e9c8c19545f31ec817507ee11884356799f14917
# bad: [8e62572d2c29e1c484f5d4e7c016793209de2efb] Merge branch 'linus'
git bisect bad 8e62572d2c29e1c484f5d4e7c016793209de2efb
# good: [c085fb8774671e83f6199a8e838fbc0e57094029] perf/x86/intel/lbr: Support XSAVES for arch LBR read
git bisect good c085fb8774671e83f6199a8e838fbc0e57094029
# good: [220dbf4aaa5b574f67ce23fa4d7b0104515bc60e] Merge branch 'WIP.core/headers'
git bisect good 220dbf4aaa5b574f67ce23fa4d7b0104515bc60e
# bad: [9d246053a69196c7c27068870e9b4b66ac536f68] sched: Add a tracepoint to track rq->nr_running
git bisect bad 9d246053a69196c7c27068870e9b4b66ac536f68
# bad: [46609ce227039fd192e0ecc7d940bed587fd2c78] sched/uclamp: Protect uclamp fast path code with static key
git bisect bad 46609ce227039fd192e0ecc7d940bed587fd2c78
# bad: [85c2ce9104eb93517db2037699471c517e81f9b4] sched, vmlinux.lds: Increase STRUCT_ALIGNMENT to 64 bytes for GCC-4.9
git bisect bad 85c2ce9104eb93517db2037699471c517e81f9b4
# bad: [faa2fd7cbad4609d06d7904c0a80cf2f8cd23678] Merge branch 'sched/urgent'
git bisect bad faa2fd7cbad4609d06d7904c0a80cf2f8cd23678
# first bad commit: [faa2fd7cbad4609d06d7904c0a80cf2f8cd23678] Merge branch 'sched/urgent'

So it ends at a merge commit. I am not going to debug this further as it
is no longer reproducible with todays tip/master. It probably was just a
bug in a conflict resolution when merging branches or something like
that.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression on todays tip/master (commit 16f70beccf43)
  2020-07-24 13:28     ` Ingo Molnar
  2020-07-24 14:50       ` Joerg Roedel
@ 2020-07-25 10:38       ` Ingo Molnar
  2020-07-25 18:56         ` Joerg Roedel
  1 sibling, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2020-07-25 10:38 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, x86, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Andy Lutomirski, Dave Hansen, linux-kernel


* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Joerg Roedel <joro@8bytes.org> wrote:
> 
> > On Thu, Jul 23, 2020 at 04:46:04PM +0200, Thomas Gleixner wrote:
> > > Joerg Roedel <joro@8bytes.org> writes:
> > > > while testing the SEV-ES patches on todays tip/master I triggered the BUG
> > > > below:
> > > >
> > > > [  137.629660] ------------[ cut here ]------------
> > > > [  137.630769] kernel BUG at kernel/signal.c:1917!
> > > > [  137.631796] invalid opcode: 0000 [#1] SMP NOPTI
> > > > [  137.632822] CPU: 3 PID: 28596 Comm: test_syscall_vd Not tainted 5.8.0-rc6-tip+ #3
> > > > [  137.634495] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > > > [  137.636236] RIP: 0010:do_notify_parent+0x25c/0x290
> > > > The guest had 4 VCPUs and ran 4 instances of the in-kernel x86-selftests
> > > > in a loop, together with 'perf top -e cycles:k'. As you can see in the
> > > > time-stamps, the issue triggered pretty quickly.
> > > >
> > > > Please let me know if you need more information or testing from my side.
> > > 
> > > Any chance to bisect this?
> > 
> > Yes, will try. I am currently testing plain -rc6, it seems to be fine.
> > Bisecting is next.
> 
> Given that you are perf stress-testing the box, some recent perf 
> commit would be the primary suspect - before doing a full bisect you 
> might want to try current perf/core (2ac5413e5edc) and its upstream 
> base: v5.8-rc3, to narrow it down.
> 
> But in principle any other commit could be the cause as well, the 
> assert suggests memory corruption - I don't think we changed anything 
> in the signal code.

On a second thought, I think this recent bug might have been the 
culprit:

  d136122f5845: ("sched: Fix race against ptrace_freeze_trace()")

Fixed in tip:sched/urgent - this is why it went away in your testing 
perhaps?

I'm sending this fix to Linus today.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression on todays tip/master (commit 16f70beccf43)
  2020-07-25 10:38       ` Ingo Molnar
@ 2020-07-25 18:56         ` Joerg Roedel
  0 siblings, 0 replies; 8+ messages in thread
From: Joerg Roedel @ 2020-07-25 18:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, x86, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Andy Lutomirski, Dave Hansen, linux-kernel

On Sat, Jul 25, 2020 at 12:38:50PM +0200, Ingo Molnar wrote:
> On a second thought, I think this recent bug might have been the 
> culprit:
> 
>   d136122f5845: ("sched: Fix race against ptrace_freeze_trace()")
> 
> Fixed in tip:sched/urgent - this is why it went away in your testing 
> perhaps?

Indeed, tip/master with this commit reverted triggers the issue again,
so it appears to be the same problem. But it leaves the question why I
couldn't trigger it with plain v5.8-rc6.

Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-07-25 18:56 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-23 13:37 Regression on todays tip/master (commit 16f70beccf43) Joerg Roedel
2020-07-23 14:46 ` Thomas Gleixner
2020-07-23 14:52   ` Joerg Roedel
2020-07-24 13:28     ` Ingo Molnar
2020-07-24 14:50       ` Joerg Roedel
2020-07-24 15:35         ` Joerg Roedel
2020-07-25 10:38       ` Ingo Molnar
2020-07-25 18:56         ` Joerg Roedel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.