linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* perf: fuzzer crashes immediately on AMD system
@ 2016-08-18 14:32 Vince Weaver
  2016-08-18 14:46 ` Vince Weaver
  0 siblings, 1 reply; 14+ messages in thread
From: Vince Weaver @ 2016-08-18 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: Borislav Petkov, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo


Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it
falls over more or less immediately.

This maps to variable_test_bit()
	called by ctx = find_get_context(pmu, task, event);
		in kernel/events/core.c:9467

It happens quickly enough I can probably track down the exact event that 
causes this, if needed.

[  101.970659] BUG: unable to handle kernel paging request at ffffffff8653d8a0
[  101.977676] IP: [<ffffffff810e4cb1>] find_get_context.isra.75+0x28/0x20f
[  101.984405] PGD 2807067 PUD 2808063 PMD 0 
[  101.988563] Oops: 0000 [#1] SMP
[  102.069521] CPU: 0 PID: 2205 Comm: perf_fuzzer Not tainted 4.8.0-rc2+ #27
[  102.076313] Hardware name: Hewlett-Packard HP Compaq Pro 6305 SFF/1850, BIOS K06 v02.57 08/16/2013
[  102.085268] task: ffff880223ae5000 task.stack: ffff880224ea8000
[  102.091188] RIP: 0010:[<ffffffff810e4cb1>]  [<ffffffff810e4cb1>] find_get_context.isra.75+0x28/0x20f
[  102.100339] RSP: 0018:ffff880224eabe20  EFLAGS: 00010246
[  102.105657] RAX: 000000002633e300 RBX: 0000000000000000 RCX: 000000002633e300
[  102.112795] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8180ea00
[  102.119929] RBP: ffffffff8180ea00 R08: 0000000000000004 R09: 0000000000000000
[  102.127063] R10: 0000000000000003 R11: 0000000000000246 R12: 000000002633e300
[  102.134196] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffff8180ea00
[  102.141327] FS:  00007f743b391700(0000) GS:ffff88022ec00000(0000) knlGS:0000000000000000
[  102.149416] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  102.155167] CR2: ffffffff8653d8a0 CR3: 00000002255b9000 CR4: 00000000000407f0
[  102.162309] Stack:
[  102.164323]  0000000000000000 00000000ffffffff ffff880223b9d800 ffff880224fdd000
[  102.171804]  ffff880223b9d800 0000000000000000 0000000000000000 0000000000000000
[  102.179284]  ffffffff8180ea00 ffffffff810e72be ffffffff00000002 ffff88022e0006c0
[  102.186765] Call Trace:
[  102.189216]  [<ffffffff810e72be>] ? SYSC_perf_event_open+0x525/0xa34
[  102.195579]  [<ffffffff8145251f>] ? entry_SYSCALL_64_fastpath+0x17/0x93
[  102.202203] Code: 41 5c c3 41 57 41 56 41 55 41 54 55 53 48 89 fd 48 89 f3 48 83 ec 18 48 85 f6 75 6c 83 3d 2f 2a 7f 00 00 41 89 cc 7f 1e 44 89 e0 <48> 0f a3 05 87 0f 7f 00 0f 92 c0 84 c0 75 26 48 c7 c0 ed ff ff 
[  102.222256] RIP  [<ffffffff810e4cb1>] find_get_context.isra.75+0x28/0x20f
[  102.229065]  RSP <ffff880224eabe20>
[  102.232556] CR2: ffffffff8653d8a0
[  102.235879] ---[ end trace fa649074c022bab1 ]---

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: perf: fuzzer crashes immediately on AMD system
  2016-08-18 14:32 perf: fuzzer crashes immediately on AMD system Vince Weaver
@ 2016-08-18 14:46 ` Vince Weaver
  2016-08-19 10:01   ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Vince Weaver @ 2016-08-18 14:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Borislav Petkov, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo

On Thu, 18 Aug 2016, Vince Weaver wrote:

> Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it
> falls over more or less immediately.
> 
> This maps to variable_test_bit()
> 	called by ctx = find_get_context(pmu, task, event);
> 		in kernel/events/core.c:9467
> 
> It happens quickly enough I can probably track down the exact event that 
> causes this, if needed.

I have a one line reproducer:

	perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: perf: fuzzer crashes immediately on AMD system
  2016-08-18 14:46 ` Vince Weaver
@ 2016-08-19 10:01   ` Peter Zijlstra
  2016-08-19 10:56     ` Peter Zijlstra
                       ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Peter Zijlstra @ 2016-08-19 10:01 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel, Borislav Petkov, Ingo Molnar,
	Arnaldo Carvalho de Melo, Huang Rui

On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote:
> On Thu, 18 Aug 2016, Vince Weaver wrote:
> 
> > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it
> > falls over more or less immediately.
> > 
> > This maps to variable_test_bit()
> > 	called by ctx = find_get_context(pmu, task, event);
> > 		in kernel/events/core.c:9467
> > 
> > It happens quickly enough I can probably track down the exact event that 
> > causes this, if needed.
> 
> I have a one line reproducer:
> 
> 	perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls

OK, cannot reproduce on my fam15h/model1h. I'll go dig through the
various manuals to see if I can spot the fail.

Huang could you either prod someone at AMD or do yourself, audit the AMD
perf code for all the various new models?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: perf: fuzzer crashes immediately on AMD system
  2016-08-19 10:01   ` Peter Zijlstra
@ 2016-08-19 10:56     ` Peter Zijlstra
  2016-08-19 15:03     ` Vince Weaver
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2016-08-19 10:56 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel, Borislav Petkov, Ingo Molnar,
	Arnaldo Carvalho de Melo, Huang Rui, Jacob Shin

On Fri, Aug 19, 2016 at 12:01:30PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote:
> > On Thu, 18 Aug 2016, Vince Weaver wrote:
> > 
> > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it
> > > falls over more or less immediately.
> > > 
> > > This maps to variable_test_bit()
> > > 	called by ctx = find_get_context(pmu, task, event);
> > > 		in kernel/events/core.c:9467
> > > 
> > > It happens quickly enough I can probably track down the exact event that 
> > > causes this, if needed.
> > 
> > I have a one line reproducer:
> > 
> > 	perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls
> 
> OK, cannot reproduce on my fam15h/model1h. I'll go dig through the
> various manuals to see if I can spot the fail.
> 
> Huang could you either prod someone at AMD or do yourself, audit the AMD
> perf code for all the various new models?

So this should obviously help a little in that it will limit the events
you can program into the hardware.

Not at all sure that is what you're hitting though, because I cannot for
the life of me figure how that would end up exploding in generic code.

---
 arch/x86/events/amd/uncore.c | 47 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 44 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
index e6131d4..8c314d7 100644
--- a/arch/x86/events/amd/uncore.c
+++ b/arch/x86/events/amd/uncore.c
@@ -174,8 +174,8 @@ static void amd_uncore_del(struct perf_event *event, int flags)
 
 static int amd_uncore_event_init(struct perf_event *event)
 {
-	struct amd_uncore *uncore;
 	struct hw_perf_event *hwc = &event->hw;
+	struct amd_uncore *uncore;
 
 	if (event->attr.type != event->pmu->type)
 		return -ENOENT;
@@ -215,6 +215,47 @@ static int amd_uncore_event_init(struct perf_event *event)
 	return 0;
 }
 
+static inline unsigned int amd_get_event_code(struct hw_perf_event *hwc)
+{
+	return ((hwc->config >> 24) & 0x0f00) | (hwc->config & 0x00ff);
+}
+
+static int amd_uncore_l2_event_init(struct perf_event *event)
+{
+	int ret = amd_uncore_event_init(event);
+	unsigned int event_code;
+
+	if (ret)
+		return ret;
+
+	/*
+	 * Fam16h L2I performance counter events are in the range: 0x060 - 0x07F
+	 */
+	event_code = amd_get_event_code(&event->hw);
+	if (event_code < 0x060 || event_code > 0x07F)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int amd_uncore_nb_event_init(struct perf_event *event)
+{
+	int ret = amd_uncore_event_init(event);
+	unsigned int event_code;
+
+	if (ret)
+		return ret;
+
+	/*
+	 * AMD NB events will have bits 0x0E0 set.
+	 */
+	event_code = amd_get_event_code(&event->hw);
+	if ((event_code & 0x0E0) != 0x0E0)
+		return -EINVAL;
+
+	return 0;
+}
+
 static ssize_t amd_uncore_attr_show_cpumask(struct device *dev,
 					    struct device_attribute *attr,
 					    char *buf)
@@ -266,7 +307,7 @@ static struct pmu amd_nb_pmu = {
 	.task_ctx_nr	= perf_invalid_context,
 	.attr_groups	= amd_uncore_attr_groups,
 	.name		= "amd_nb",
-	.event_init	= amd_uncore_event_init,
+	.event_init	= amd_uncore_nb_event_init,
 	.add		= amd_uncore_add,
 	.del		= amd_uncore_del,
 	.start		= amd_uncore_start,
@@ -278,7 +319,7 @@ static struct pmu amd_l2_pmu = {
 	.task_ctx_nr	= perf_invalid_context,
 	.attr_groups	= amd_uncore_attr_groups,
 	.name		= "amd_l2",
-	.event_init	= amd_uncore_event_init,
+	.event_init	= amd_uncore_l2_event_init,
 	.add		= amd_uncore_add,
 	.del		= amd_uncore_del,
 	.start		= amd_uncore_start,

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: perf: fuzzer crashes immediately on AMD system
  2016-08-19 10:01   ` Peter Zijlstra
  2016-08-19 10:56     ` Peter Zijlstra
@ 2016-08-19 15:03     ` Vince Weaver
  2016-08-19 16:38       ` Vince Weaver
  2016-08-20  4:44     ` Vince Weaver
  2016-08-22 11:16     ` Huang Rui
  3 siblings, 1 reply; 14+ messages in thread
From: Vince Weaver @ 2016-08-19 15:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, linux-kernel, Borislav Petkov, Ingo Molnar,
	Arnaldo Carvalho de Melo, Huang Rui

On Fri, 19 Aug 2016, Peter Zijlstra wrote:

> On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote:
> > On Thu, 18 Aug 2016, Vince Weaver wrote:
> > 
> > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it
> > > falls over more or less immediately.
> > > 
> > > This maps to variable_test_bit()
> > > 	called by ctx = find_get_context(pmu, task, event);
> > > 		in kernel/events/core.c:9467
> > > 
> > > It happens quickly enough I can probably track down the exact event that 
> > > causes this, if needed.
> > 
> > I have a one line reproducer:
> > 
> > 	perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls
> 
> OK, cannot reproduce on my fam15h/model1h. I'll go dig through the
> various manuals to see if I can spot the fail.
> 
> Huang could you either prod someone at AMD or do yourself, audit the AMD
> perf code for all the various new models?


OK, this is weird.  I rebooted (didn't patch the kernel, just rebooted) 
and I can't reproduce the original problem at all.

It was perfectly repeatable before I rebooted, dumped an OOPS message 
every time.

Sadly I don't have the fuzzer logs that originally triggered the bug (need 
more serial/USB cables.  Actually no, I need more null-modem adapters).

Let me look into this a bit more.

Vince

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: perf: fuzzer crashes immediately on AMD system
  2016-08-19 15:03     ` Vince Weaver
@ 2016-08-19 16:38       ` Vince Weaver
  0 siblings, 0 replies; 14+ messages in thread
From: Vince Weaver @ 2016-08-19 16:38 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Peter Zijlstra, linux-kernel, Borislav Petkov, Ingo Molnar,
	Arnaldo Carvalho de Melo, Huang Rui

On Fri, 19 Aug 2016, Vince Weaver wrote:

> OK, this is weird.  I rebooted (didn't patch the kernel, just rebooted) 
> and I can't reproduce the original problem at all.

I rebooted three more times (after perf_fuzzer turned up a more boring 
probably known dump, shown at end) and now I am hitting the original bug 
again.   Weird.  Let me see if I can figure out what is going on.



and for the record, the bug the fuzzer kicks out when it doesn't hit the 
weird one:

note this is sprinkled among thousands of
[ 3782.364287] BAD LUCK: lost 7650 message(s) from NMI context!


[ 3780.821837] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [perf_fuzzer:12074]
[ 3781.493831] CPU: 2 PID: 12074 Comm: perf_fuzzer Tainted: G             L  4.8.0-rc2+ #27
[ 3781.508478] Hardware name: Hewlett-Packard HP Compaq Pro 6305 SFF/1850, BIOS K06 v02.57 08/16/2013
[ 3781.524054] task: ffff8802232cf280 task.stack: ffff8802252c0000
[ 3781.542904] RIP: 0010:[<ffffffff810a1020>]  [<ffffffff810a1020>] smp_call_function_single+0xbb/0xca
[ 3781.558618] RSP: 0018:ffff8802252c3d78  EFLAGS: 00000202
[ 3781.570752] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[ 3781.584757] RDX: 0000000000000001 RSI: 00000000000008fb RDI: 0000000000000300
[ 3781.598819] RBP: 0000000000000001 R08: 0000000000000003 R09: 00007f0c0ea07700
[ 3781.612930] R10: 00007f0c0ea079d0 R11: 0000000000000206 R12: ffffffff810e226b
[ 3781.627107] R13: ffff8802252c3dc8 R14: ffff8802252c3d78 R15: 0000000000000000
[ 3781.641335] FS:  00007f0c0ea07700(0000) GS:ffff88022ed00000(0000) knlGS:0000000000000000
[ 3781.656573] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3781.669534] CR2: 00007f0c0e7d72c8 CR3: 00000002251d1000 CR4: 00000000000407e0
[ 3781.683929] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3781.698410] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000010602
[ 3781.712845] Stack:
[ 3781.747577]  0000000000000000 ffffffff810e226b ffff8802252c3dc8 0000000000000003
[ 3781.787434]  ffffe8ffffc87190 ffff880223fb7800 ffffffff810e5676 0000000000000000
[ 3781.827415]  ffffffff810e18df ffffffff810e16cd 0000000000000000 ffffffff810e13d2
[ 3781.841792] Call Trace:
[ 3781.851292]  [<ffffffff810e226b>] ? perf_cgroup_attach+0x34/0x34
[ 3781.864355]  [<ffffffff810e5676>] ? group_sched_out+0x70/0x70
[ 3781.877219]  [<ffffffff810e18df>] ? event_function_call+0xa8/0xa8
[ 3781.890345]  [<ffffffff810e16cd>] ? cpu_function_call+0x32/0x3b
[ 3781.903284]  [<ffffffff810e13d2>] ? perf_ctx_lock+0x1e/0x1e
[ 3781.915864]  [<ffffffff810e1880>] ? event_function_call+0x49/0xa8
[ 3781.928952]  [<ffffffff810e5676>] ? group_sched_out+0x70/0x70
[ 3781.941675]  [<ffffffff810e18df>] ? event_function_call+0xa8/0xa8
[ 3781.954734]  [<ffffffff810e15a0>] ? perf_event_for_each_child+0x53/0x8a
[ 3781.968295]  [<ffffffff810e7bea>] ? perf_ioctl+0x41d/0x495
[ 3781.980725]  [<ffffffff811515f5>] ? vfs_ioctl+0x16/0x23
[ 3781.992893]  [<ffffffff81151ae3>] ? do_vfs_ioctl+0x46e/0x519
[ 3782.005532]  [<ffffffff81052aad>] ? do_sigaltstack+0xe1/0x1b0
[ 3782.018184]  [<ffffffff81151bdc>] ? SyS_ioctl+0x4e/0x71
[ 3782.030319]  [<ffffffff8145251f>] ? entry_SYSCALL_64_fastpath+0x17/0x93
[ 3782.433996] Code: e2 01 74 04 f3 90 eb f4 83 48 18 01 4c 89 e9 4c 89 e2 4c 89 f6 89 ef e8 94 fe ff ff 85 db 74 0d 41 8b 56 18 80 e2 01 74 04 f3 90 <eb> f3 48 83 c4 20 5b 5d 41 5c 41 5d 41 5e c3 41 56 41 55 41 89 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: perf: fuzzer crashes immediately on AMD system
  2016-08-19 10:01   ` Peter Zijlstra
  2016-08-19 10:56     ` Peter Zijlstra
  2016-08-19 15:03     ` Vince Weaver
@ 2016-08-20  4:44     ` Vince Weaver
  2016-08-22 11:16     ` Huang Rui
  3 siblings, 0 replies; 14+ messages in thread
From: Vince Weaver @ 2016-08-20  4:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, linux-kernel, Borislav Petkov, Ingo Molnar,
	Arnaldo Carvalho de Melo, Huang Rui

On Fri, 19 Aug 2016, Peter Zijlstra wrote:

> On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote:
> > On Thu, 18 Aug 2016, Vince Weaver wrote:
> > 
> > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it
> > > falls over more or less immediately.
> > > 
> > > This maps to variable_test_bit()
> > > 	called by ctx = find_get_context(pmu, task, event);
> > > 		in kernel/events/core.c:9467
> > > 
> > > It happens quickly enough I can probably track down the exact event that 
> > > causes this, if needed.
> > 
> > I have a one line reproducer:
> > 
> > 	perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls
> 
> OK, cannot reproduce on my fam15h/model1h. I'll go dig through the
> various manuals to see if I can spot the fail.
> 
> Huang could you either prod someone at AMD or do yourself, audit the AMD
> perf code for all the various new models?

This is bizzarre, I can't make any sense of the crash.

To recap, the crash looks like this:
	BUG: unable to handle kernel paging request at ffffffff85e67600
	IP: [<ffffffff810e4cb1>] find_get_context.isra.75+0x28/0x20f

The code in question is this code:

	if (!cpu_online(cpu))

	which maps to 
	test_bit(cpumask_check(cpu), cpumask_bits((cpumask)));

	which assembles to

	ffffffff810e4ca9:       41 89 cc                mov    %ecx,%r12d
	ffffffff810e4cac:       7f 1e                   jg     ffffffff810e4ccc <find_get_context.isra.75+0x43>
	ffffffff810e4cae:       44 89 e0                mov    %r12d,%eax
*	ffffffff810e4cb1:       48 0f a3 05 87 0f 7f    bt     %rax,0x7f0f87(%rip)        # ffffffff818d5c40 <__cpu_online_mask>
	ffffffff810e4cb8:       00 
	ffffffff810e4cb9:       0f 92 c0                setb   %al
	ffffffff810e4cbc:       84 c0                   test   %al,%al

There is no way that 0x7f0f87(%rip) should ever possibly be the 
ffffffff85e67600 value that causes the fault.

Though oddly rax when the call happens (according to the oops message)
is RAX: 0000000022c8ce30 which seems nonsensical for a CPU number, but
shouldn't cause an invalid memory address.  Also oddly RDI matches
RAX but RCX doesn't which I think should be true with that assembly.

So very weird.  I even wrote a kernel module and dumped the raw kernel
memory to make sure the instruction stream didn't get overwritten somehow,
but as far as I can tell the code in memory matches the disassembly.

anyway I am out of time to look at this for now. 

Vince

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: perf: fuzzer crashes immediately on AMD system
  2016-08-19 10:01   ` Peter Zijlstra
                       ` (2 preceding siblings ...)
  2016-08-20  4:44     ` Vince Weaver
@ 2016-08-22 11:16     ` Huang Rui
  2016-08-23  1:02       ` Vince Weaver
  3 siblings, 1 reply; 14+ messages in thread
From: Huang Rui @ 2016-08-22 11:16 UTC (permalink / raw)
  To: Peter Zijlstra, Vince Weaver
  Cc: linux-kernel, Borislav Petkov, Ingo Molnar, Arnaldo Carvalho de Melo

Hi Peter, Vince

On Fri, Aug 19, 2016 at 12:01:30PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote:
> > On Thu, 18 Aug 2016, Vince Weaver wrote:
> > 
> > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it
> > > falls over more or less immediately.
> > > 
> > > This maps to variable_test_bit()
> > > 	called by ctx = find_get_context(pmu, task, event);
> > > 		in kernel/events/core.c:9467
> > > 
> > > It happens quickly enough I can probably track down the exact event that 
> > > causes this, if needed.
> > 
> > I have a one line reproducer:
> > 
> > 	perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls
> 
> OK, cannot reproduce on my fam15h/model1h. I'll go dig through the
> various manuals to see if I can spot the fail.
> 
> Huang could you either prod someone at AMD or do yourself, audit the AMD
> perf code for all the various new models?

Actually, there might be some NBPMC event changes between model 0h-fh and
model 10h-1fh. Below are the documents of these two processors:

http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf
http://support.amd.com/TechDocs/42300_15h_Mod_10h-1Fh_BKDG.pdf

In section 3.16, it describes usage of NB Performance Counter Events.

Hope it helps. :-)

Thanks,
Rui

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: perf: fuzzer crashes immediately on AMD system
  2016-08-22 11:16     ` Huang Rui
@ 2016-08-23  1:02       ` Vince Weaver
  2016-08-23  2:54         ` Vince Weaver
  0 siblings, 1 reply; 14+ messages in thread
From: Vince Weaver @ 2016-08-23  1:02 UTC (permalink / raw)
  To: Huang Rui
  Cc: Peter Zijlstra, Vince Weaver, linux-kernel, Borislav Petkov,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, 22 Aug 2016, Huang Rui wrote:

> Hi Peter, Vince
> 
> On Fri, Aug 19, 2016 at 12:01:30PM +0200, Peter Zijlstra wrote:
> > On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote:
> > > On Thu, 18 Aug 2016, Vince Weaver wrote:
> > > 
> > > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it
> > > > falls over more or less immediately.
> > > > 
> > > > This maps to variable_test_bit()
> > > > 	called by ctx = find_get_context(pmu, task, event);
> > > > 		in kernel/events/core.c:9467
> > > > 
> > > > It happens quickly enough I can probably track down the exact event that 
> > > > causes this, if needed.
> > > 
> > > I have a one line reproducer:
> > > 
> > > 	perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls
> > 
> > OK, cannot reproduce on my fam15h/model1h. I'll go dig through the
> > various manuals to see if I can spot the fail.
> > 
> > Huang could you either prod someone at AMD or do yourself, audit the AMD
> > perf code for all the various new models?
> 
> Actually, there might be some NBPMC event changes between model 0h-fh and
> model 10h-1fh. Below are the documents of these two processors:
> 
> http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf
> http://support.amd.com/TechDocs/42300_15h_Mod_10h-1Fh_BKDG.pdf
> 
> In section 3.16, it describes usage of NB Performance Counter Events.

I don't think it's the hardware that's causing the problem.

I've wasted a lot more time on it, and finally figured out how the "bt" 
instruction works, so the assembly more or less makes sense.

The problem is the per-cpu amd_uncore struct is being over-written with 
kernel memory addresses.

This makes uncore[0]->cpu a large number (it's often, but not always, the 
per-cpu address of uncore[1]->cpu) which leads to the GPF.

I can't figure out what piece of code is overwriting things though.

And to make things complicated, I think the 
	amd_uncore_find_online_sibling()
function is broken.  The code could really use more commenting, but I 
think it is designed so all siblings share one single amd_uncore 
structure, but in practice it looks like this doesn't work due to the way 
the list iterator works.

Vince

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: perf: fuzzer crashes immediately on AMD system
  2016-08-23  1:02       ` Vince Weaver
@ 2016-08-23  2:54         ` Vince Weaver
  2016-08-23  8:45           ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Vince Weaver @ 2016-08-23  2:54 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Huang Rui, Peter Zijlstra, linux-kernel, Borislav Petkov,
	Ingo Molnar, Arnaldo Carvalho de Melo

> > > > 
> > > > 	perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls
> 	amd_uncore_find_online_sibling()
> function is broken. 

and that's the problem.  uncore_find_online_sibling() does all kinds of 
wrong things including sticking active uncore structures in 
uncore->free_when_cpu_online

Then uncore_online() comes along and frees those structures.

Then some other part of the kernel comes and re-uses the free'd data.

Then when we try to start an event, all of the fields are invalid because 
the uncore pointer is pointing to re-used data.

I don't have a patch because I am not 100% clear on what 
uncore_find_online_sibling() is doing in the first place.

Vince

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: perf: fuzzer crashes immediately on AMD system
  2016-08-23  2:54         ` Vince Weaver
@ 2016-08-23  8:45           ` Peter Zijlstra
  2016-08-23 11:53             ` Vince Weaver
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2016-08-23  8:45 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Huang Rui, linux-kernel, Borislav Petkov, Ingo Molnar,
	Arnaldo Carvalho de Melo

On Mon, Aug 22, 2016 at 10:54:32PM -0400, Vince Weaver wrote:
> > > > > 
> > > > > 	perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls
> > 	amd_uncore_find_online_sibling()
> > function is broken. 
> 
> and that's the problem.  uncore_find_online_sibling() does all kinds of 
> wrong things including sticking active uncore structures in 
> uncore->free_when_cpu_online
> 
> Then uncore_online() comes along and frees those structures.
> 
> Then some other part of the kernel comes and re-uses the free'd data.
> 
> Then when we try to start an event, all of the fields are invalid because 
> the uncore pointer is pointing to re-used data.
> 
> I don't have a patch because I am not 100% clear on what 
> uncore_find_online_sibling() is doing in the first place.

Thanks for doing all that, I'll see if I can make sense of it.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: perf: fuzzer crashes immediately on AMD system
  2016-08-23  8:45           ` Peter Zijlstra
@ 2016-08-23 11:53             ` Vince Weaver
  2016-08-24  9:19               ` Ingo Molnar
  0 siblings, 1 reply; 14+ messages in thread
From: Vince Weaver @ 2016-08-23 11:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, Huang Rui, linux-kernel, Borislav Petkov,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Tue, 23 Aug 2016, Peter Zijlstra wrote:

> On Mon, Aug 22, 2016 at 10:54:32PM -0400, Vince Weaver wrote:
> > > > > > 
> > > > > > 	perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls
> > > 	amd_uncore_find_online_sibling()
> > > function is broken. 
> > 
> > and that's the problem.  uncore_find_online_sibling() does all kinds of 
> > wrong things including sticking active uncore structures in 
> > uncore->free_when_cpu_online
> > 
> > Then uncore_online() comes along and frees those structures.
> > 
> > Then some other part of the kernel comes and re-uses the free'd data.
> > 
> > Then when we try to start an event, all of the fields are invalid because 
> > the uncore pointer is pointing to re-used data.
> > 
> > I don't have a patch because I am not 100% clear on what 
> > uncore_find_online_sibling() is doing in the first place.
> 
> Thanks for doing all that, I'll see if I can make sense of it.

I should have provided more detail, was just tired after chasing the bug 
for so long.  I mostly found things by sprinkling printks everywhere.
Comenting out the call to kfree() in uncore_online() makes the code stop 
crashing (but perhaps causes a memory leak?)

In any case it's odd the problem didn't show up earlier, but maybe the 
recent changes to CPU hotplugging in that file exposed the issue.

Vince

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: perf: fuzzer crashes immediately on AMD system
  2016-08-23 11:53             ` Vince Weaver
@ 2016-08-24  9:19               ` Ingo Molnar
  2016-08-24 13:20                 ` Vince Weaver
  0 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2016-08-24  9:19 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Peter Zijlstra, Huang Rui, linux-kernel, Borislav Petkov,
	Ingo Molnar, Arnaldo Carvalho de Melo


* Vince Weaver <vincent.weaver@maine.edu> wrote:

> On Tue, 23 Aug 2016, Peter Zijlstra wrote:
> 
> > On Mon, Aug 22, 2016 at 10:54:32PM -0400, Vince Weaver wrote:
> > > > > > > 
> > > > > > > 	perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls
> > > > 	amd_uncore_find_online_sibling()
> > > > function is broken. 
> > > 
> > > and that's the problem.  uncore_find_online_sibling() does all kinds of 
> > > wrong things including sticking active uncore structures in 
> > > uncore->free_when_cpu_online
> > > 
> > > Then uncore_online() comes along and frees those structures.
> > > 
> > > Then some other part of the kernel comes and re-uses the free'd data.
> > > 
> > > Then when we try to start an event, all of the fields are invalid because 
> > > the uncore pointer is pointing to re-used data.
> > > 
> > > I don't have a patch because I am not 100% clear on what 
> > > uncore_find_online_sibling() is doing in the first place.
> > 
> > Thanks for doing all that, I'll see if I can make sense of it.
> 
> I should have provided more detail, was just tired after chasing the bug 
> for so long.  I mostly found things by sprinkling printks everywhere.
> Comenting out the call to kfree() in uncore_online() makes the code stop 
> crashing (but perhaps causes a memory leak?)

If there's no progress finding the root cause I'd be happy to exchange a crash for 
a leak ...

> In any case it's odd the problem didn't show up earlier, but maybe the 
> recent changes to CPU hotplugging in that file exposed the issue.

Yeah, we had lots of changes to CPU hotplugging recently.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: perf: fuzzer crashes immediately on AMD system
  2016-08-24  9:19               ` Ingo Molnar
@ 2016-08-24 13:20                 ` Vince Weaver
  0 siblings, 0 replies; 14+ messages in thread
From: Vince Weaver @ 2016-08-24 13:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vince Weaver, Peter Zijlstra, Huang Rui, linux-kernel,
	Borislav Petkov, Ingo Molnar, Arnaldo Carvalho de Melo

On Wed, 24 Aug 2016, Ingo Molnar wrote:
> If there's no progress finding the root cause I'd be happy to exchange a crash for 
> a leak ...

It's actually a crash of the program doing the perf_event_open() call, not 
a crash of the system (at least in my experience).

However, it's possible that if you have bad luck and if the kfree'd space 
is reused with just the right combination of values you could potentially 
end up crashing the system.

Vince

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-08-24 13:20 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-18 14:32 perf: fuzzer crashes immediately on AMD system Vince Weaver
2016-08-18 14:46 ` Vince Weaver
2016-08-19 10:01   ` Peter Zijlstra
2016-08-19 10:56     ` Peter Zijlstra
2016-08-19 15:03     ` Vince Weaver
2016-08-19 16:38       ` Vince Weaver
2016-08-20  4:44     ` Vince Weaver
2016-08-22 11:16     ` Huang Rui
2016-08-23  1:02       ` Vince Weaver
2016-08-23  2:54         ` Vince Weaver
2016-08-23  8:45           ` Peter Zijlstra
2016-08-23 11:53             ` Vince Weaver
2016-08-24  9:19               ` Ingo Molnar
2016-08-24 13:20                 ` Vince Weaver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).