Re: hit a KASan bug related to Perf during stress test

* Re: hit a KASan bug related to Perf during stress test
       [not found] <318B87A793BE164187D8851D6CE09D64371C8811@shsmsx102.ccr.corp.intel.com>
@ 2016-10-24  9:53 ` Peter Zijlstra
  2016-10-24 11:15   ` Oleg Nesterov
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-10-24  9:53 UTC (permalink / raw)
  To: Ni, BaoleX
  Cc: mingo, acme, linux-kernel, alexander.shishkin, Liu, Chuansheng,
	Oleg Nesterov

On Mon, Oct 24, 2016 at 09:35:46AM +0000, Ni, BaoleX wrote:
> 
> [32736.018823] BUG: KASan: use after free in task_tgid_nr_ns+0x35/0xb0 at addr ffff8800265568c0
> [32736.028309] Read of size 8 by task dumpsys/11268
> [32736.033511] =============================================================================
> [32736.042700] BUG task_struct (Tainted: G        W  O): kasan: bad access detected

'W' this wasn't the first WARN you got, this means this might be the
result of prior borkage.

Also, it says: "BUG task_struct", does that mean task_struct was the
object accessed after free?

> [32736.051002] -----------------------------------------------------------------------------
> [32736.051002] 
> [32736.061840] Disabling lock debugging due to kernel taint
> [32736.067830] INFO: Slab 0xffffea0000995400 objects=5 used=3 fp=0xffff880026550000 flags=0x4000000000004080
> [32736.078572] INFO: Object 0xffff880026556440 @offset=25664 fp=0x          (null)
> ...
> [32738.776936] CPU: 0 PID: 11268 Comm: dumpsys Tainted: G    B   W  O 3.14.70-x86_64-02260-g162539f #1
> [32738.787092] Hardware name: Insyde CherryTrail/T3 MRD, BIOS CHTMRD.A6.002.016 09/20/2016
> [32738.796082]  ffff880026550000 0000000000000086 0000000000000000 ffff880065e05a70
> [32738.796215]  ffffffff81fc9427 ffff880065803b40 ffff880026556440 ffff880065e05aa0
> [32738.796345]  ffffffff8123fe2d ffff880065803b40 ffffea0000995400 ffff880026556440
> [32738.796475] Call Trace:
> [32738.796510]  <NMI> 
> [32738.796585]  [<ffffffff81fc9427>] dump_stack+0x67/0x90
> [32738.802404]  [<ffffffff8123fe2d>] print_trailer+0xfd/0x170
> [32738.808603]  [<ffffffff81244f26>] object_err+0x36/0x40
> [32738.814417]  [<ffffffff812467ed>] kasan_report_error+0x1fd/0x3d0
> [32738.821193]  [<ffffffff81131b84>] ? __rcu_read_unlock+0x24/0x90
> [32738.827881]  [<ffffffff81fe0888>] ? preempt_count_sub+0x18/0xf0
> [32738.834565]  [<ffffffff811db32c>] ? perf_output_put_handle+0x5c/0x170
> [32738.841833]  [<ffffffff81246e70>] kasan_report+0x40/0x50
> [32738.847838]  [<ffffffff810d9975>] ? task_tgid_nr_ns+0x35/0xb0
> [32738.854327]  [<ffffffff81245d59>] __asan_load8+0x69/0xa0
> [32738.860333]  [<ffffffff811dba18>] ? perf_output_copy+0x88/0x120
> [32738.867020]  [<ffffffff810d9975>] task_tgid_nr_ns+0x35/0xb0

So here we did: perf_event_[pt]id(event, current);

How can _current_ not be valid anymore?

> [32738.873319]  [<ffffffff811cd5d8>] __perf_event_header__init_id+0xb8/0x200
> [32738.880970]  [<ffffffff811d6f19>] perf_prepare_sample+0xa9/0x4a0
> [32738.887754]  [<ffffffff811d7700>] __perf_event_overflow+0x3f0/0x460
> [32738.894835]  [<ffffffff81022998>] ? x86_perf_event_set_period+0x128/0x210
> [32738.902496]  [<ffffffff811d8494>] perf_event_overflow+0x14/0x20
> [32738.909180]  [<ffffffff8102cabc>] intel_pmu_handle_irq+0x25c/0x520
> [32738.916156]  [<ffffffff81245945>] ? __asan_store8+0x15/0xa0
> [32738.922460]  [<ffffffff81fddb8b>] perf_event_nmi_handler+0x2b/0x50
> [32738.929437]  [<ffffffff81fdd4a8>] nmi_handle+0x88/0x230
> [32738.935346]  [<ffffffff81009873>] do_nmi+0x193/0x490
> [32738.940963]  [<ffffffff81fdc6d6>] end_repeat_nmi+0x1a/0x1e
> [32738.947163]  [<ffffffff81245d22>] ? __asan_load8+0x32/0xa0
> [32738.953358]  [<ffffffff81245d22>] ? __asan_load8+0x32/0xa0
> [32738.959554]  [<ffffffff81245d22>] ? __asan_load8+0x32/0xa0
> [32738.965718]  <<EOE>> 
> [32738.965787]  [<ffffffff811065a2>] ? check_preempt_wakeup+0x1a2/0x3a0
> [32738.972970]  [<ffffffff810f4618>] check_preempt_curr+0xf8/0x120
> [32738.979658]  [<ffffffff810f465d>] ttwu_do_wakeup+0x1d/0x1b0
> [32738.985953]  [<ffffffff810f4909>] ttwu_do_activate.constprop.105+0x89/0x90
> [32738.993710]  [<ffffffff810f87fe>] try_to_wake_up+0x29e/0x4e0
> [32739.000100]  [<ffffffff810f8aaf>] default_wake_function+0x2f/0x40
> [32739.006979]  [<ffffffff81114338>] autoremove_wake_function+0x18/0x50
> [32739.014149]  [<ffffffff81fe0888>] ? preempt_count_sub+0x18/0xf0
> [32739.020836]  [<ffffffff81113ab9>] __wake_up_common+0x79/0xb0
> [32739.027232]  [<ffffffff81113d69>] __wake_up+0x39/0x50
> [32739.032945]  [<ffffffff81135918>] __call_rcu_nocb_enqueue+0x158/0x160
> [32739.040207]  [<ffffffff81135a4c>] __call_rcu+0x12c/0x450

And while we just called release_task(), that call_rcu() should still be
pending at this point, also I don't think that can be current until
after do_task_dead() where we schedule away from the dead task and
change current.

> [32739.046207]  [<ffffffff81135dcd>] call_rcu+0x1d/0x20
> [32739.051821]  [<ffffffff810ae2da>] release_task+0x6aa/0x8d0
> [32739.058022]  [<ffffffff8111e86f>] ? do_raw_write_unlock+0x6f/0xd0
> [32739.064900]  [<ffffffff810b1002>] do_exit+0xe52/0x1020
> [32739.070712]  [<ffffffff810b1222>] SyS_exit+0x22/0x30
> [32739.076328]  [<ffffffff81fe9063>] sysenter_dispatch+0x7/0x1f
> [32739.082725]  [<ffffffff8152f33b>] ? trace_hardirqs_on_thunk+0x3a/0x3c

Oleg, any idea?

^ permalink raw reply	[flat|nested] 27+ messages in thread