linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: hit a KASan bug related to Perf during stress test
       [not found] <318B87A793BE164187D8851D6CE09D64371C8811@shsmsx102.ccr.corp.intel.com>
@ 2016-10-24  9:53 ` Peter Zijlstra
  2016-10-24 11:15   ` Oleg Nesterov
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-10-24  9:53 UTC (permalink / raw)
  To: Ni, BaoleX
  Cc: mingo, acme, linux-kernel, alexander.shishkin, Liu, Chuansheng,
	Oleg Nesterov

On Mon, Oct 24, 2016 at 09:35:46AM +0000, Ni, BaoleX wrote:
> 
> [32736.018823] BUG: KASan: use after free in task_tgid_nr_ns+0x35/0xb0 at addr ffff8800265568c0
> [32736.028309] Read of size 8 by task dumpsys/11268
> [32736.033511] =============================================================================
> [32736.042700] BUG task_struct (Tainted: G        W  O): kasan: bad access detected

'W' this wasn't the first WARN you got, this means this might be the
result of prior borkage.

Also, it says: "BUG task_struct", does that mean task_struct was the
object accessed after free?

> [32736.051002] -----------------------------------------------------------------------------
> [32736.051002] 
> [32736.061840] Disabling lock debugging due to kernel taint
> [32736.067830] INFO: Slab 0xffffea0000995400 objects=5 used=3 fp=0xffff880026550000 flags=0x4000000000004080
> [32736.078572] INFO: Object 0xffff880026556440 @offset=25664 fp=0x          (null)
> ...
> [32738.776936] CPU: 0 PID: 11268 Comm: dumpsys Tainted: G    B   W  O 3.14.70-x86_64-02260-g162539f #1
> [32738.787092] Hardware name: Insyde CherryTrail/T3 MRD, BIOS CHTMRD.A6.002.016 09/20/2016
> [32738.796082]  ffff880026550000 0000000000000086 0000000000000000 ffff880065e05a70
> [32738.796215]  ffffffff81fc9427 ffff880065803b40 ffff880026556440 ffff880065e05aa0
> [32738.796345]  ffffffff8123fe2d ffff880065803b40 ffffea0000995400 ffff880026556440
> [32738.796475] Call Trace:
> [32738.796510]  <NMI> 
> [32738.796585]  [<ffffffff81fc9427>] dump_stack+0x67/0x90
> [32738.802404]  [<ffffffff8123fe2d>] print_trailer+0xfd/0x170
> [32738.808603]  [<ffffffff81244f26>] object_err+0x36/0x40
> [32738.814417]  [<ffffffff812467ed>] kasan_report_error+0x1fd/0x3d0
> [32738.821193]  [<ffffffff81131b84>] ? __rcu_read_unlock+0x24/0x90
> [32738.827881]  [<ffffffff81fe0888>] ? preempt_count_sub+0x18/0xf0
> [32738.834565]  [<ffffffff811db32c>] ? perf_output_put_handle+0x5c/0x170
> [32738.841833]  [<ffffffff81246e70>] kasan_report+0x40/0x50
> [32738.847838]  [<ffffffff810d9975>] ? task_tgid_nr_ns+0x35/0xb0
> [32738.854327]  [<ffffffff81245d59>] __asan_load8+0x69/0xa0
> [32738.860333]  [<ffffffff811dba18>] ? perf_output_copy+0x88/0x120
> [32738.867020]  [<ffffffff810d9975>] task_tgid_nr_ns+0x35/0xb0

So here we did: perf_event_[pt]id(event, current);

How can _current_ not be valid anymore?

> [32738.873319]  [<ffffffff811cd5d8>] __perf_event_header__init_id+0xb8/0x200
> [32738.880970]  [<ffffffff811d6f19>] perf_prepare_sample+0xa9/0x4a0
> [32738.887754]  [<ffffffff811d7700>] __perf_event_overflow+0x3f0/0x460
> [32738.894835]  [<ffffffff81022998>] ? x86_perf_event_set_period+0x128/0x210
> [32738.902496]  [<ffffffff811d8494>] perf_event_overflow+0x14/0x20
> [32738.909180]  [<ffffffff8102cabc>] intel_pmu_handle_irq+0x25c/0x520
> [32738.916156]  [<ffffffff81245945>] ? __asan_store8+0x15/0xa0
> [32738.922460]  [<ffffffff81fddb8b>] perf_event_nmi_handler+0x2b/0x50
> [32738.929437]  [<ffffffff81fdd4a8>] nmi_handle+0x88/0x230
> [32738.935346]  [<ffffffff81009873>] do_nmi+0x193/0x490
> [32738.940963]  [<ffffffff81fdc6d6>] end_repeat_nmi+0x1a/0x1e
> [32738.947163]  [<ffffffff81245d22>] ? __asan_load8+0x32/0xa0
> [32738.953358]  [<ffffffff81245d22>] ? __asan_load8+0x32/0xa0
> [32738.959554]  [<ffffffff81245d22>] ? __asan_load8+0x32/0xa0
> [32738.965718]  <<EOE>> 
> [32738.965787]  [<ffffffff811065a2>] ? check_preempt_wakeup+0x1a2/0x3a0
> [32738.972970]  [<ffffffff810f4618>] check_preempt_curr+0xf8/0x120
> [32738.979658]  [<ffffffff810f465d>] ttwu_do_wakeup+0x1d/0x1b0
> [32738.985953]  [<ffffffff810f4909>] ttwu_do_activate.constprop.105+0x89/0x90
> [32738.993710]  [<ffffffff810f87fe>] try_to_wake_up+0x29e/0x4e0
> [32739.000100]  [<ffffffff810f8aaf>] default_wake_function+0x2f/0x40
> [32739.006979]  [<ffffffff81114338>] autoremove_wake_function+0x18/0x50
> [32739.014149]  [<ffffffff81fe0888>] ? preempt_count_sub+0x18/0xf0
> [32739.020836]  [<ffffffff81113ab9>] __wake_up_common+0x79/0xb0
> [32739.027232]  [<ffffffff81113d69>] __wake_up+0x39/0x50
> [32739.032945]  [<ffffffff81135918>] __call_rcu_nocb_enqueue+0x158/0x160
> [32739.040207]  [<ffffffff81135a4c>] __call_rcu+0x12c/0x450

And while we just called release_task(), that call_rcu() should still be
pending at this point, also I don't think that can be current until
after do_task_dead() where we schedule away from the dead task and
change current.

> [32739.046207]  [<ffffffff81135dcd>] call_rcu+0x1d/0x20
> [32739.051821]  [<ffffffff810ae2da>] release_task+0x6aa/0x8d0
> [32739.058022]  [<ffffffff8111e86f>] ? do_raw_write_unlock+0x6f/0xd0
> [32739.064900]  [<ffffffff810b1002>] do_exit+0xe52/0x1020
> [32739.070712]  [<ffffffff810b1222>] SyS_exit+0x22/0x30
> [32739.076328]  [<ffffffff81fe9063>] sysenter_dispatch+0x7/0x1f
> [32739.082725]  [<ffffffff8152f33b>] ? trace_hardirqs_on_thunk+0x3a/0x3c

Oleg, any idea?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24  9:53 ` hit a KASan bug related to Perf during stress test Peter Zijlstra
@ 2016-10-24 11:15   ` Oleg Nesterov
  2016-10-24 11:24     ` Peter Zijlstra
                       ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Oleg Nesterov @ 2016-10-24 11:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On 10/24, Peter Zijlstra wrote:
>
> > [32738.867020]  [<ffffffff810d9975>] task_tgid_nr_ns+0x35/0xb0
>
> So here we did: perf_event_[pt]id(event, current);
>
> How can _current_ not be valid anymore?

...

> > [32739.040207]  [<ffffffff81135a4c>] __call_rcu+0x12c/0x450
>
> And while we just called release_task(), that call_rcu() should still be
> pending at this point,

Yes, current is still valid.

But nothing protects current->group_leader or parent/real_parent, they
can point to the exited/freed task. We really need to nullify them in
__unhash_process() to catch the problems like this, I wanted to do this
many times...

So you simply can't know your tgid or even tid after release_task() calls
__unhash_process(). Actually after exit_notify() unless the exiting task
autoreaps itself.

How about the trivial fix below?

Oleg.

--- x/kernel/events/core.c
+++ x/kernel/events/core.c
@@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
 	if (event->parent)
 		event = event->parent;
 
-	return task_tgid_nr_ns(p, event->ns);
+	return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
 }
 
 static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 11:15   ` Oleg Nesterov
@ 2016-10-24 11:24     ` Peter Zijlstra
  2016-10-24 12:02       ` Oleg Nesterov
  2016-10-24 11:27     ` Peter Zijlstra
  2016-10-24 12:11     ` Peter Zijlstra
  2 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-10-24 11:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On Mon, Oct 24, 2016 at 01:15:27PM +0200, Oleg Nesterov wrote:
> On 10/24, Peter Zijlstra wrote:
> >
> > > [32738.867020]  [<ffffffff810d9975>] task_tgid_nr_ns+0x35/0xb0
> >
> > So here we did: perf_event_[pt]id(event, current);
> >
> > How can _current_ not be valid anymore?
> 
> ...
> 
> > > [32739.040207]  [<ffffffff81135a4c>] __call_rcu+0x12c/0x450
> >
> > And while we just called release_task(), that call_rcu() should still be
> > pending at this point,
> 
> Yes, current is still valid.
> 
> But nothing protects current->group_leader or parent/real_parent, they
> can point to the exited/freed task. We really need to nullify them in
> __unhash_process() to catch the problems like this, I wanted to do this
> many times...
> 
> So you simply can't know your tgid or even tid after release_task() calls
> __unhash_process(). Actually after exit_notify() unless the exiting task
> autoreaps itself.
> 
> How about the trivial fix below?
> 
> Oleg.
> 
> --- x/kernel/events/core.c
> +++ x/kernel/events/core.c
> @@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
>  	if (event->parent)
>  		event = event->parent;
>  
> -	return task_tgid_nr_ns(p, event->ns);
> +	return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
>  }

Hurm.. should we not push this into task_tgid_nr_ns() ? I mean, now the
user needs to be aware of this dinky detail.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 11:15   ` Oleg Nesterov
  2016-10-24 11:24     ` Peter Zijlstra
@ 2016-10-24 11:27     ` Peter Zijlstra
  2016-10-24 11:29       ` Peter Zijlstra
  2016-10-24 12:11     ` Peter Zijlstra
  2 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-10-24 11:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng, Jiri Olsa

On Mon, Oct 24, 2016 at 01:15:27PM +0200, Oleg Nesterov wrote:
> How about the trivial fix below?
> 
> Oleg.
> 
> --- x/kernel/events/core.c
> +++ x/kernel/events/core.c
> @@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
>  	if (event->parent)
>  		event = event->parent;
>  
> -	return task_tgid_nr_ns(p, event->ns);
> +	return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
>  }

Also, now we get a (few) sample(s) with a different pid:tid than prior
samples and not matching the sched_switch() events.

I can imagine that being somewhat confusing for people/tools.

Acme/Jolsa, any idea if that will bugger perf-report?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 11:27     ` Peter Zijlstra
@ 2016-10-24 11:29       ` Peter Zijlstra
  2016-10-24 12:04         ` Jiri Olsa
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-10-24 11:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng, Jiri Olsa

On Mon, Oct 24, 2016 at 01:27:32PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 24, 2016 at 01:15:27PM +0200, Oleg Nesterov wrote:
> > How about the trivial fix below?
> > 
> > Oleg.
> > 
> > --- x/kernel/events/core.c
> > +++ x/kernel/events/core.c
> > @@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
> >  	if (event->parent)
> >  		event = event->parent;
> >  
> > -	return task_tgid_nr_ns(p, event->ns);
> > +	return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
> >  }
> 
> Also, now we get a (few) sample(s) with a different pid:tid than prior
> samples and not matching the sched_switch() events.
> 
> I can imagine that being somewhat confusing for people/tools.
> 
> Acme/Jolsa, any idea if that will bugger perf-report?

Hurm, then again, I imagine that after unhash_process the PID/TID could
be instantly re-used and then we're still confused.

Yuck.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 11:24     ` Peter Zijlstra
@ 2016-10-24 12:02       ` Oleg Nesterov
  2016-10-24 12:10         ` Oleg Nesterov
  2016-10-24 12:19         ` Peter Zijlstra
  0 siblings, 2 replies; 27+ messages in thread
From: Oleg Nesterov @ 2016-10-24 12:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On 10/24, Peter Zijlstra wrote:
>
> On Mon, Oct 24, 2016 at 01:15:27PM +0200, Oleg Nesterov wrote:
> >
> > --- x/kernel/events/core.c
> > +++ x/kernel/events/core.c
> > @@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
> >  	if (event->parent)
> >  		event = event->parent;
> >
> > -	return task_tgid_nr_ns(p, event->ns);
> > +	return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
> >  }
>
> Hurm.. should we not push this into task_tgid_nr_ns() ? I mean, now the
> user needs to be aware of this dinky detail.

Perhaps. Or into task_tgid(). Or even the patch below, __task_pid_nr_ns()
is always safe. This certainly needs some cleanups.

Oleg.

--- x/include/linux/pid.h
+++ x/include/linux/pid.h
@@ -8,7 +8,8 @@ enum pid_type
 	PIDTYPE_PID,
 	PIDTYPE_PGID,
 	PIDTYPE_SID,
-	PIDTYPE_MAX
+	PIDTYPE_MAX,
+	PIDTYPE_TGID	/* do not use */
 };
 
 /*
--- x/kernel/pid.c
+++ x/kernel/pid.c
@@ -538,7 +538,7 @@ EXPORT_SYMBOL(__task_pid_nr_ns);
 
 pid_t task_tgid_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
 {
-	return pid_nr_ns(task_tgid(tsk), ns);
+	return __task_pid_nr_ns(tsk, PIDTYPE_TGID, ns);
 }
 EXPORT_SYMBOL(task_tgid_nr_ns);
 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 11:29       ` Peter Zijlstra
@ 2016-10-24 12:04         ` Jiri Olsa
  2016-10-24 12:12           ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Jiri Olsa @ 2016-10-24 12:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Ni, BaoleX, mingo, acme, linux-kernel,
	alexander.shishkin, Liu, Chuansheng, Jiri Olsa

On Mon, Oct 24, 2016 at 01:29:45PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 24, 2016 at 01:27:32PM +0200, Peter Zijlstra wrote:
> > On Mon, Oct 24, 2016 at 01:15:27PM +0200, Oleg Nesterov wrote:
> > > How about the trivial fix below?
> > > 
> > > Oleg.
> > > 
> > > --- x/kernel/events/core.c
> > > +++ x/kernel/events/core.c
> > > @@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
> > >  	if (event->parent)
> > >  		event = event->parent;
> > >  
> > > -	return task_tgid_nr_ns(p, event->ns);
> > > +	return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
> > >  }
> > 
> > Also, now we get a (few) sample(s) with a different pid:tid than prior
> > samples and not matching the sched_switch() events.
> > 
> > I can imagine that being somewhat confusing for people/tools.
> > 
> > Acme/Jolsa, any idea if that will bugger perf-report?
> 
> Hurm, then again, I imagine that after unhash_process the PID/TID could
> be instantly re-used and then we're still confused.

sounds bad.. I haven't checked the related pid_alive code,
but shouldn't we already get the EXIT event in this case?

jirka

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 12:02       ` Oleg Nesterov
@ 2016-10-24 12:10         ` Oleg Nesterov
  2016-10-24 12:22           ` Peter Zijlstra
  2016-10-24 12:19         ` Peter Zijlstra
  1 sibling, 1 reply; 27+ messages in thread
From: Oleg Nesterov @ 2016-10-24 12:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On 10/24, Oleg Nesterov wrote:
>
> On 10/24, Peter Zijlstra wrote:
> >
> > On Mon, Oct 24, 2016 at 01:15:27PM +0200, Oleg Nesterov wrote:
> > >
> > > --- x/kernel/events/core.c
> > > +++ x/kernel/events/core.c
> > > @@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
> > >  	if (event->parent)
> > >  		event = event->parent;
> > >
> > > -	return task_tgid_nr_ns(p, event->ns);
> > > +	return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
> > >  }
> >
> > Hurm.. should we not push this into task_tgid_nr_ns() ? I mean, now the
> > user needs to be aware of this dinky detail.
>
> Perhaps. Or into task_tgid(). Or even the patch below, __task_pid_nr_ns()
> is always safe. This certainly needs some cleanups.

the patch was obviously incomplete.

Oleg.


--- x/include/linux/pid.h
+++ x/include/linux/pid.h
@@ -8,7 +8,8 @@ enum pid_type
 	PIDTYPE_PID,
 	PIDTYPE_PGID,
 	PIDTYPE_SID,
-	PIDTYPE_MAX
+	PIDTYPE_MAX,
+	PIDTYPE_TGID	/* do not use */
 };
 
 /*
--- x/kernel/pid.c
+++ x/kernel/pid.c
@@ -526,8 +526,11 @@ pid_t __task_pid_nr_ns(struct task_struc
 	if (!ns)
 		ns = task_active_pid_ns(current);
 	if (likely(pid_alive(task))) {
-		if (type != PIDTYPE_PID)
+		if (type != PIDTYPE_PID) {
+			if (type == PIDTYPE_TGID)
+				type = PIDTYPE_PID;
 			task = task->group_leader;
+		}
 		nr = pid_nr_ns(rcu_dereference(task->pids[type].pid), ns);
 	}
 	rcu_read_unlock();
@@ -538,7 +541,7 @@ EXPORT_SYMBOL(__task_pid_nr_ns);
 
 pid_t task_tgid_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
 {
-	return pid_nr_ns(task_tgid(tsk), ns);
+	return __task_pid_nr_ns(tsk, PIDTYPE_TGID, ns);
 }
 EXPORT_SYMBOL(task_tgid_nr_ns);
 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 11:15   ` Oleg Nesterov
  2016-10-24 11:24     ` Peter Zijlstra
  2016-10-24 11:27     ` Peter Zijlstra
@ 2016-10-24 12:11     ` Peter Zijlstra
  2016-10-24 12:21       ` Oleg Nesterov
  2 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-10-24 12:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On Mon, Oct 24, 2016 at 01:15:27PM +0200, Oleg Nesterov wrote:
> Yes, current is still valid.
> 
> But nothing protects current->group_leader or parent/real_parent, they
> can point to the exited/freed task. We really need to nullify them in
> __unhash_process() to catch the problems like this, I wanted to do this
> many times...
> 
> So you simply can't know your tgid or even tid after release_task() calls
> __unhash_process(). Actually after exit_notify() unless the exiting task
> autoreaps itself.
> 
> How about the trivial fix below?
> 
> Oleg.
> 
> --- x/kernel/events/core.c
> +++ x/kernel/events/core.c
> @@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
>  	if (event->parent)
>  		event = event->parent;
>  
> -	return task_tgid_nr_ns(p, event->ns);
> +	return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
>  }
>  
>  static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)

Should we do the same for perf_event_tid() and report -1 as the pid/tid
in the !alive case? -1 should be an obvious invalid pid since we limit
the pid-space to less than 32 bits.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 12:04         ` Jiri Olsa
@ 2016-10-24 12:12           ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-10-24 12:12 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Ni, BaoleX, mingo, acme, linux-kernel,
	alexander.shishkin, Liu, Chuansheng, Jiri Olsa

On Mon, Oct 24, 2016 at 02:04:11PM +0200, Jiri Olsa wrote:
> On Mon, Oct 24, 2016 at 01:29:45PM +0200, Peter Zijlstra wrote:

> > Hurm, then again, I imagine that after unhash_process the PID/TID could
> > be instantly re-used and then we're still confused.
> 
> sounds bad.. I haven't checked the related pid_alive code,
> but shouldn't we already get the EXIT event in this case?

It has, perf_event_exit_task() happens before we unhash.

But a per-cpu event that has PID/TID reporting on will run into this.

We'll observe 'funny' values between the unhash and the next context
switch.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 12:02       ` Oleg Nesterov
  2016-10-24 12:10         ` Oleg Nesterov
@ 2016-10-24 12:19         ` Peter Zijlstra
  1 sibling, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-10-24 12:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On Mon, Oct 24, 2016 at 02:02:32PM +0200, Oleg Nesterov wrote:
> Perhaps. Or into task_tgid(). Or even the patch below, __task_pid_nr_ns()
> is always safe. This certainly needs some cleanups.


> --- x/include/linux/pid.h
> +++ x/include/linux/pid.h
> @@ -8,7 +8,8 @@ enum pid_type
>  	PIDTYPE_PID,
>  	PIDTYPE_PGID,
>  	PIDTYPE_SID,
> -	PIDTYPE_MAX
> +	PIDTYPE_MAX,
> +	PIDTYPE_TGID	/* do not use */
>  };
>  
>  /*
> --- x/kernel/pid.c
> +++ x/kernel/pid.c
> @@ -538,7 +538,7 @@ EXPORT_SYMBOL(__task_pid_nr_ns);
>  
>  pid_t task_tgid_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
>  {
> -	return pid_nr_ns(task_tgid(tsk), ns);
> +	return __task_pid_nr_ns(tsk, PIDTYPE_TGID, ns);
>  }
>  EXPORT_SYMBOL(task_tgid_nr_ns);
>  
> 

Right, that will return 0 on !alive. But I'm not seeing how PIDTYPE_TGID
isn't an array bound violating of its own though. Then again, I didn't
look to hard at the pid stuff.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 12:11     ` Peter Zijlstra
@ 2016-10-24 12:21       ` Oleg Nesterov
  2016-10-24 12:27         ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Oleg Nesterov @ 2016-10-24 12:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On 10/24, Peter Zijlstra wrote:
>
> On Mon, Oct 24, 2016 at 01:15:27PM +0200, Oleg Nesterov wrote:
> >
> > --- x/kernel/events/core.c
> > +++ x/kernel/events/core.c
> > @@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
> >  	if (event->parent)
> >  		event = event->parent;
> >
> > -	return task_tgid_nr_ns(p, event->ns);
> > +	return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
> >  }
> >
> >  static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
>
> Should we do the same for perf_event_tid() and report -1 as the pid/tid
> in the !alive case? -1 should be an obvious invalid pid since we limit
> the pid-space to less than 32 bits.

task_pid_nr_ns() is always safe, it calls __task_pid_nr_ns(). But yes,
it can return zero if called after exit_notify() and/or release_task().

And while zero is not a valid pid too, I guess it can be confused with
the idle thread's "pid" ?

Oleg.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 12:10         ` Oleg Nesterov
@ 2016-10-24 12:22           ` Peter Zijlstra
  2016-10-24 12:29             ` Oleg Nesterov
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-10-24 12:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On Mon, Oct 24, 2016 at 02:10:31PM +0200, Oleg Nesterov wrote:
> --- x/kernel/pid.c
> +++ x/kernel/pid.c
> @@ -526,8 +526,11 @@ pid_t __task_pid_nr_ns(struct task_struc
>  	if (!ns)
>  		ns = task_active_pid_ns(current);
>  	if (likely(pid_alive(task))) {
> -		if (type != PIDTYPE_PID)
> +		if (type != PIDTYPE_PID) {
> +			if (type == PIDTYPE_TGID)
> +				type = PIDTYPE_PID;
>  			task = task->group_leader;
> +		}

Aah, that makes much more sense ;-)

>  		nr = pid_nr_ns(rcu_dereference(task->pids[type].pid), ns);
>  	}
>  	rcu_read_unlock();


Still, I wonder if returning 0 is the right thing. 0 is a 'valid' PID
for the init/idle task.

And we still have the re-use issue for the TID, because when we get here
TID is already unhashed too afaict, it just doesn't explode because we
don't deref freed memory.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 12:21       ` Oleg Nesterov
@ 2016-10-24 12:27         ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-10-24 12:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On Mon, Oct 24, 2016 at 02:21:23PM +0200, Oleg Nesterov wrote:
> > Should we do the same for perf_event_tid() and report -1 as the pid/tid
> > in the !alive case? -1 should be an obvious invalid pid since we limit
> > the pid-space to less than 32 bits.
> 
> task_pid_nr_ns() is always safe, it calls __task_pid_nr_ns(). But yes,
> it can return zero if called after exit_notify() and/or release_task().
> 
> And while zero is not a valid pid too, I guess it can be confused with
> the idle thread's "pid" ?

Right, 0 is the idle thread.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 12:22           ` Peter Zijlstra
@ 2016-10-24 12:29             ` Oleg Nesterov
  2016-10-24 12:38               ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Oleg Nesterov @ 2016-10-24 12:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On 10/24, Peter Zijlstra wrote:
>
> On Mon, Oct 24, 2016 at 02:10:31PM +0200, Oleg Nesterov wrote:
> > --- x/kernel/pid.c
> > +++ x/kernel/pid.c
> > @@ -526,8 +526,11 @@ pid_t __task_pid_nr_ns(struct task_struc
> >  	if (!ns)
> >  		ns = task_active_pid_ns(current);
> >  	if (likely(pid_alive(task))) {
> > -		if (type != PIDTYPE_PID)
> > +		if (type != PIDTYPE_PID) {
> > +			if (type == PIDTYPE_TGID)
> > +				type = PIDTYPE_PID;
> >  			task = task->group_leader;
> > +		}
>
> Aah, that makes much more sense ;-)
>
> >  		nr = pid_nr_ns(rcu_dereference(task->pids[type].pid), ns);
> >  	}
> >  	rcu_read_unlock();
>
>
> Still, I wonder if returning 0 is the right thing. 0 is a 'valid' PID
> for the init/idle task.

Yes, now I think that -1 would make more sense. Unfortunately we can't
just change __task_pid_nr_ns(), it already has the users which assume
it returns zero... attach_to_pi_state() for example.

> And we still have the re-use issue for the TID, because when we get here
> TID is already unhashed too afaict,

Yes, so perf_event_tid() will report zero.

Oleg.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 12:29             ` Oleg Nesterov
@ 2016-10-24 12:38               ` Peter Zijlstra
  2016-10-24 13:25                 ` Oleg Nesterov
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-10-24 12:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On Mon, Oct 24, 2016 at 02:29:42PM +0200, Oleg Nesterov wrote:
> On 10/24, Peter Zijlstra wrote:
> >
> > On Mon, Oct 24, 2016 at 02:10:31PM +0200, Oleg Nesterov wrote:
> > > --- x/kernel/pid.c
> > > +++ x/kernel/pid.c
> > > @@ -526,8 +526,11 @@ pid_t __task_pid_nr_ns(struct task_struc
> > >  	if (!ns)
> > >  		ns = task_active_pid_ns(current);
> > >  	if (likely(pid_alive(task))) {
> > > -		if (type != PIDTYPE_PID)
> > > +		if (type != PIDTYPE_PID) {
> > > +			if (type == PIDTYPE_TGID)
> > > +				type = PIDTYPE_PID;
> > >  			task = task->group_leader;
> > > +		}
> >
> > Aah, that makes much more sense ;-)
> >
> > >  		nr = pid_nr_ns(rcu_dereference(task->pids[type].pid), ns);
> > >  	}
> > >  	rcu_read_unlock();
> >
> >
> > Still, I wonder if returning 0 is the right thing. 0 is a 'valid' PID
> > for the init/idle task.
> 
> Yes, now I think that -1 would make more sense. Unfortunately we can't
> just change __task_pid_nr_ns(), it already has the users which assume
> it returns zero... attach_to_pi_state() for example.

Indeed. And I have a patch that assumes task_pid_vnr(&init_task) == 0,
is that true because of this !alive case or true in general?

No worries though, we can revert to your earlier explicit test and
return -1 while adding a comment to explain details? I'll go write one
up in a bit, but I need to run an errand first.

> > And we still have the re-use issue for the TID, because when we get here
> > TID is already unhashed too afaict,
> 
> Yes, so perf_event_tid() will report zero.

Ah, ok. So whould we change that to match pid and return (explicit) -1
there too?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 12:38               ` Peter Zijlstra
@ 2016-10-24 13:25                 ` Oleg Nesterov
  2016-10-24 13:40                   ` Oleg Nesterov
  2016-10-24 14:36                   ` Peter Zijlstra
  0 siblings, 2 replies; 27+ messages in thread
From: Oleg Nesterov @ 2016-10-24 13:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On 10/24, Peter Zijlstra wrote:
>
> On Mon, Oct 24, 2016 at 02:29:42PM +0200, Oleg Nesterov wrote:
> > On 10/24, Peter Zijlstra wrote:
> > >
> > > Still, I wonder if returning 0 is the right thing. 0 is a 'valid' PID
> > > for the init/idle task.
> >
> > Yes, now I think that -1 would make more sense. Unfortunately we can't
> > just change __task_pid_nr_ns(), it already has the users which assume
> > it returns zero... attach_to_pi_state() for example.
>
> Indeed. And I have a patch that assumes task_pid_vnr(&init_task) == 0,
> is that true because of this !alive case or true in general?

This is true in general. Idle threads are always alive but they use the
the special init_struct_pid with .nr == 0.

> No worries though, we can revert to your earlier explicit test and
> return -1 while adding a comment to explain details?

...

> Ah, ok. So whould we change that to match pid and return (explicit) -1
> there too?

Well, if we add that PIDTYPE_TGID hack, I think we can do something
like below...

Or do you think we should add a perf_alive() check into perf_event_pid()
for a quick fix?

Either way it's a pity we can't report at least the valid tid, perhaps
perf_event_tid() could use task_pid_nr() if event->ns == init_pid_ns,
I dunno.

Oleg.

--- x/kernel/events/core.c
+++ x/kernel/events/core.c
@@ -1249,26 +1249,30 @@ unclone_ctx(struct perf_event_context *c
 	return parent_ctx;
 }
 
-static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
+static u32 perf_event_xxx(struct perf_event *event, struct task_struct *p,
+			 	enum pid_type type)
 {
+	pid_t nr;
 	/*
 	 * only top level events have the pid namespace they were created in
 	 */
 	if (event->parent)
 		event = event->parent;
 
-	return task_tgid_nr_ns(p, event->ns);
+	nr = __task_pid_nr_ns(p, type, event->ns);
+	if (!nr && !is_idle_task(p))
+		nr = -1;
+	return nr;
 }
 
-static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
+static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
 {
-	/*
-	 * only top level events have the pid namespace they were created in
-	 */
-	if (event->parent)
-		event = event->parent;
+	return perf_event_xxx(p, event, PIDTYPE_TGID);
+}
 
-	return task_pid_nr_ns(p, event->ns);
+static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
+{
+	return perf_event_xxx(p, event, PIDTYPE_PID);
 }
 
 /*

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 13:25                 ` Oleg Nesterov
@ 2016-10-24 13:40                   ` Oleg Nesterov
  2016-10-24 14:17                     ` Peter Zijlstra
  2016-10-24 14:36                   ` Peter Zijlstra
  1 sibling, 1 reply; 27+ messages in thread
From: Oleg Nesterov @ 2016-10-24 13:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On 10/24, Oleg Nesterov wrote:
>
> -static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
> +static u32 perf_event_xxx(struct perf_event *event, struct task_struct *p,
> +			 	enum pid_type type)
>  {
> +	pid_t nr;
>  	/*
>  	 * only top level events have the pid namespace they were created in
>  	 */
>  	if (event->parent)
>  		event = event->parent;
>
> -	return task_tgid_nr_ns(p, event->ns);
> +	nr = __task_pid_nr_ns(p, type, event->ns);
> +	if (!nr && !is_idle_task(p))
> +		nr = -1;
> +	return nr;

And just in case... In any case __task_pid_nr_ns() and other similar helpers
can also return zero if "p" runs in another namespace. Say, in the parent ns.

Say, perf_event_switch_output(). What do we want to report in this case, zero
or -1 ?

Oleg.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 13:40                   ` Oleg Nesterov
@ 2016-10-24 14:17                     ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-10-24 14:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On Mon, Oct 24, 2016 at 03:40:13PM +0200, Oleg Nesterov wrote:
> On 10/24, Oleg Nesterov wrote:
> >
> > -static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
> > +static u32 perf_event_xxx(struct perf_event *event, struct task_struct *p,
> > +			 	enum pid_type type)
> >  {
> > +	pid_t nr;
> >  	/*
> >  	 * only top level events have the pid namespace they were created in
> >  	 */
> >  	if (event->parent)
> >  		event = event->parent;
> >
> > -	return task_tgid_nr_ns(p, event->ns);
> > +	nr = __task_pid_nr_ns(p, type, event->ns);
> > +	if (!nr && !is_idle_task(p))
> > +		nr = -1;
> > +	return nr;
> 
> And just in case... In any case __task_pid_nr_ns() and other similar helpers
> can also return zero if "p" runs in another namespace. Say, in the parent ns.

Right, I'm tempted to not change that. Its been the behaviour for a
while and changing that will upset people.

The unhash case is different in that its actively broken so we must do
something.

> Say, perf_event_switch_output(). What do we want to report in this case, zero
> or -1 ?

As with all switch_output() cases, the user had better know wth he's
doing ;-) Doing a switch_output() on a running counter is dubious to
begin with.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 13:25                 ` Oleg Nesterov
  2016-10-24 13:40                   ` Oleg Nesterov
@ 2016-10-24 14:36                   ` Peter Zijlstra
  2016-10-24 15:39                     ` Oleg Nesterov
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-10-24 14:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On Mon, Oct 24, 2016 at 03:25:55PM +0200, Oleg Nesterov wrote:
> Well, if we add that PIDTYPE_TGID hack, I think we can do something
> like below...
> 
> Or do you think we should add a perf_alive() check into perf_event_pid()
> for a quick fix?

That is what I was thinking. Then we don't need to do the TGID hack,
I suspect some people might object to that.

> Either way it's a pity we can't report at least the valid tid, perhaps
> perf_event_tid() could use task_pid_nr() if event->ns == init_pid_ns,
> I dunno.

Right, but after unhash is there really still the notion of a valid TID?
I mean, the TID can be reused, at which point you'll end up with two
tasks etc..

But yes, very tedious.

I was thinking something like so?

---

 kernel/events/core.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index c6e47e97b33f..2c9a22485e9e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1257,7 +1257,14 @@ static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
 	if (event->parent)
 		event = event->parent;
 
-	return task_tgid_nr_ns(p, event->ns);
+	/*
+	 * It is possible the task already got unhashed, in which case we
+	 * cannot determine the current->group_leader/real_parent.
+	 *
+	 * Also, report -1 to indicate unhashed, so as not to confused with
+	 * 0 for the idle task.
+	 */
+	return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : ~0;
 }
 
 static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
@@ -1268,7 +1275,7 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
 	if (event->parent)
 		event = event->parent;
 
-	return task_pid_nr_ns(p, event->ns);
+	return pid_alive(p) ? task_pid_nr_ns(p, event->ns) : ~0;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 14:36                   ` Peter Zijlstra
@ 2016-10-24 15:39                     ` Oleg Nesterov
  2016-10-24 15:53                       ` Oleg Nesterov
  2016-10-25  9:28                       ` Peter Zijlstra
  0 siblings, 2 replies; 27+ messages in thread
From: Oleg Nesterov @ 2016-10-24 15:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On 10/24, Peter Zijlstra wrote:
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1257,7 +1257,14 @@ static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
>  	if (event->parent)
>  		event = event->parent;
>  
> -	return task_tgid_nr_ns(p, event->ns);
> +	/*
> +	 * It is possible the task already got unhashed, in which case we
> +	 * cannot determine the current->group_leader/real_parent.
> +	 *
> +	 * Also, report -1 to indicate unhashed, so as not to confused with
> +	 * 0 for the idle task.
> +	 */
> +	return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : ~0;
>  }

Yes, but this _looks_ racy unless p == current. I mean, pid_alive() makes
task_tgid_nr_ns() safe, but task_tgid_nr_ns() still can return zero _if_
it can race with the exiting task.

>  static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
> @@ -1268,7 +1275,7 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
>  	if (event->parent)
>  		event = event->parent;
>
> -	return task_pid_nr_ns(p, event->ns);
> +	return pid_alive(p) ? task_pid_nr_ns(p, event->ns) : ~0;

The same.

However. At first glance the only case when p != current is copy_process(),
right? And in this case the new child can't go away. So I think this patch
is fine.

Oleg.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 15:39                     ` Oleg Nesterov
@ 2016-10-24 15:53                       ` Oleg Nesterov
  2016-10-25  6:55                         ` Ni, BaoleX
  2016-10-25  9:28                       ` Peter Zijlstra
  1 sibling, 1 reply; 27+ messages in thread
From: Oleg Nesterov @ 2016-10-24 15:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On 10/24, Oleg Nesterov wrote:
>
> On 10/24, Peter Zijlstra wrote:
> >
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -1257,7 +1257,14 @@ static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
> >  	if (event->parent)
> >  		event = event->parent;
> >
> > -	return task_tgid_nr_ns(p, event->ns);
> > +	/*
> > +	 * It is possible the task already got unhashed, in which case we
> > +	 * cannot determine the current->group_leader/real_parent.
> > +	 *
> > +	 * Also, report -1 to indicate unhashed, so as not to confused with
> > +	 * 0 for the idle task.
> > +	 */
> > +	return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : ~0;
> >  }
>
> Yes, but this _looks_ racy unless p == current. I mean, pid_alive() makes
> task_tgid_nr_ns() safe, but task_tgid_nr_ns() still can return zero _if_
> it can race with the exiting task.
>
> >  static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
> > @@ -1268,7 +1275,7 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
> >  	if (event->parent)
> >  		event = event->parent;
> >
> > -	return task_pid_nr_ns(p, event->ns);
> > +	return pid_alive(p) ? task_pid_nr_ns(p, event->ns) : ~0;
>
> The same.
>
> However. At first glance the only case when p != current is copy_process(),
> right? And in this case the new child can't go away. So I think this patch
> is fine.

Actually there is another case, comm_write() -> perf_event_comm_output(). It
checks same_thread_group(current, p), so we can only race with the exiting
sub-thread. perf_event_pid() can't return zero, perf_event_tid() can.

And I personally think we do not care and your patch is fine anyway ;)

Oleg.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: hit a KASan bug related to Perf during stress test
  2016-10-24 15:53                       ` Oleg Nesterov
@ 2016-10-25  6:55                         ` Ni, BaoleX
  0 siblings, 0 replies; 27+ messages in thread
From: Ni, BaoleX @ 2016-10-25  6:55 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: mingo, acme, linux-kernel, alexander.shishkin, Liu, Chuansheng

Thanks a lot, guys.
I will take Peter's patch to do stress test.

-----Original Message-----
From: Oleg Nesterov [mailto:oleg@redhat.com] 
Sent: Monday, October 24, 2016 11:53 PM
To: Peter Zijlstra
Cc: Ni, BaoleX; mingo@redhat.com; acme@kernel.org; linux-kernel@vger.kernel.org; alexander.shishkin@linux.intel.com; Liu, Chuansheng
Subject: Re: hit a KASan bug related to Perf during stress test

On 10/24, Oleg Nesterov wrote:
>
> On 10/24, Peter Zijlstra wrote:
> >
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -1257,7 +1257,14 @@ static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
> >  	if (event->parent)
> >  		event = event->parent;
> >
> > -	return task_tgid_nr_ns(p, event->ns);
> > +	/*
> > +	 * It is possible the task already got unhashed, in which case we
> > +	 * cannot determine the current->group_leader/real_parent.
> > +	 *
> > +	 * Also, report -1 to indicate unhashed, so as not to confused with
> > +	 * 0 for the idle task.
> > +	 */
> > +	return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : ~0;
> >  }
>
> Yes, but this _looks_ racy unless p == current. I mean, pid_alive() 
> makes
> task_tgid_nr_ns() safe, but task_tgid_nr_ns() still can return zero 
> _if_ it can race with the exiting task.
>
> >  static u32 perf_event_tid(struct perf_event *event, struct 
> > task_struct *p) @@ -1268,7 +1275,7 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
> >  	if (event->parent)
> >  		event = event->parent;
> >
> > -	return task_pid_nr_ns(p, event->ns);
> > +	return pid_alive(p) ? task_pid_nr_ns(p, event->ns) : ~0;
>
> The same.
>
> However. At first glance the only case when p != current is 
> copy_process(), right? And in this case the new child can't go away. 
> So I think this patch is fine.

Actually there is another case, comm_write() -> perf_event_comm_output(). It checks same_thread_group(current, p), so we can only race with the exiting sub-thread. perf_event_pid() can't return zero, perf_event_tid() can.

And I personally think we do not care and your patch is fine anyway ;)

Oleg.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-24 15:39                     ` Oleg Nesterov
  2016-10-24 15:53                       ` Oleg Nesterov
@ 2016-10-25  9:28                       ` Peter Zijlstra
  2016-10-25 14:41                         ` Oleg Nesterov
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-10-25  9:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On Mon, Oct 24, 2016 at 05:39:08PM +0200, Oleg Nesterov wrote:
> On 10/24, Peter Zijlstra wrote:
> >
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -1257,7 +1257,14 @@ static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
> >  	if (event->parent)
> >  		event = event->parent;
> >  
> > -	return task_tgid_nr_ns(p, event->ns);
> > +	/*
> > +	 * It is possible the task already got unhashed, in which case we
> > +	 * cannot determine the current->group_leader/real_parent.
> > +	 *
> > +	 * Also, report -1 to indicate unhashed, so as not to confused with
> > +	 * 0 for the idle task.
> > +	 */
> > +	return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : ~0;
> >  }
> 
> Yes, but this _looks_ racy unless p == current. I mean, pid_alive() makes
> task_tgid_nr_ns() safe, but task_tgid_nr_ns() still can return zero _if_
> it can race with the exiting task.

So what serialization would close that race? __task_pid_nr_ns() only
seems to use RCU nothing more.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-25  9:28                       ` Peter Zijlstra
@ 2016-10-25 14:41                         ` Oleg Nesterov
  2016-10-26  9:03                           ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Oleg Nesterov @ 2016-10-25 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On 10/25, Peter Zijlstra wrote:
>
> On Mon, Oct 24, 2016 at 05:39:08PM +0200, Oleg Nesterov wrote:
> > On 10/24, Peter Zijlstra wrote:
> > >
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -1257,7 +1257,14 @@ static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
> > >  	if (event->parent)
> > >  		event = event->parent;
> > >
> > > -	return task_tgid_nr_ns(p, event->ns);
> > > +	/*
> > > +	 * It is possible the task already got unhashed, in which case we
> > > +	 * cannot determine the current->group_leader/real_parent.
> > > +	 *
> > > +	 * Also, report -1 to indicate unhashed, so as not to confused with
> > > +	 * 0 for the idle task.
> > > +	 */
> > > +	return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : ~0;
> > >  }
> >
> > Yes, but this _looks_ racy unless p == current. I mean, pid_alive() makes
> > task_tgid_nr_ns() safe, but task_tgid_nr_ns() still can return zero _if_
> > it can race with the exiting task.
>
> So what serialization would close that race? __task_pid_nr_ns() only
> seems to use RCU nothing more.

I do not see how can we close this race, we obviously do not want to use
any locking.

That is why I tried to suggest

	nr = __task_pid_nr_ns(p, type, event->ns);
	if (!nr && !is_idle_task(p))
		nr = -1;
	return nr;

but this will report -1 if p runs in another namespace, so perhaps we
can do

	nr = __task_pid_nr_ns(p, type, event->ns);
	if (!nr && p->exit_state)
		// it has already called exit_notify
		nr = -1;
	return nr;

Oleg.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-25 14:41                         ` Oleg Nesterov
@ 2016-10-26  9:03                           ` Peter Zijlstra
  2016-10-26 16:10                             ` Oleg Nesterov
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-10-26  9:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On Tue, Oct 25, 2016 at 04:41:26PM +0200, Oleg Nesterov wrote:
> >
> > So what serialization would close that race? __task_pid_nr_ns() only
> > seems to use RCU nothing more.
> 
> I do not see how can we close this race, we obviously do not want to use
> any locking.
> 
> That is why I tried to suggest
> 
> 	nr = __task_pid_nr_ns(p, type, event->ns);
> 	if (!nr && !is_idle_task(p))
> 		nr = -1;
> 	return nr;
> 
> but this will report -1 if p runs in another namespace, so perhaps we
> can do
> 
> 	nr = __task_pid_nr_ns(p, type, event->ns);
> 	if (!nr && p->exit_state)
> 		// it has already called exit_notify
> 		nr = -1;
> 	return nr;

I think I'm asking how __task_pid_nr_ns() isn't susceptible to this race
;-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: hit a KASan bug related to Perf during stress test
  2016-10-26  9:03                           ` Peter Zijlstra
@ 2016-10-26 16:10                             ` Oleg Nesterov
  0 siblings, 0 replies; 27+ messages in thread
From: Oleg Nesterov @ 2016-10-26 16:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ni, BaoleX, mingo, acme, linux-kernel, alexander.shishkin, Liu,
	Chuansheng

On 10/26, Peter Zijlstra wrote:
>
> On Tue, Oct 25, 2016 at 04:41:26PM +0200, Oleg Nesterov wrote:
> > >
> > > So what serialization would close that race? __task_pid_nr_ns() only
> > > seems to use RCU nothing more.
> >
> > I do not see how can we close this race, we obviously do not want to use
> > any locking.
> >
> > That is why I tried to suggest
> >
> > 	nr = __task_pid_nr_ns(p, type, event->ns);
> > 	if (!nr && !is_idle_task(p))
> > 		nr = -1;
> > 	return nr;
> >
> > but this will report -1 if p runs in another namespace, so perhaps we
> > can do
> >
> > 	nr = __task_pid_nr_ns(p, type, event->ns);
> > 	if (!nr && p->exit_state)
> > 		// it has already called exit_notify
> > 		nr = -1;
> > 	return nr;
>
> I think I'm asking how __task_pid_nr_ns() isn't susceptible to this race
> ;-)

which race ? ;) it seems that I confused you. Lets ignore the original
problem with perf_event_pid()->task_tgid_nr_ns() which can access the
freed memory. Lets suppose it is already fixed.

Another problem, as you noted, is that task_tgid_nr_ns/task_pid_nr_ns
returns zero if the task exits and this zero can be confused with the
swapper's pid.

	return pid_alive(p) ? task_pid_nr_ns(p, event->ns) : ~0

still can return zero because pid_alive(p) == T is not stable if we can
race with the exiting task, so it can't guarantee that task_pid_nr_ns()
won't return 0.

So we can check ->exit_state or, even better, that same pid_alive() after
task_pid_nr_ns() returns 0.

	nr = task_pid_nr_ns(p);
	/* avoid -1 if it is idle thread or runs in another ns */
	if (!nr && !pid_alive(p))
		nr = -1;
	return nr;

Or I misunderstood you?

Oleg.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2016-10-26 16:12 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <318B87A793BE164187D8851D6CE09D64371C8811@shsmsx102.ccr.corp.intel.com>
2016-10-24  9:53 ` hit a KASan bug related to Perf during stress test Peter Zijlstra
2016-10-24 11:15   ` Oleg Nesterov
2016-10-24 11:24     ` Peter Zijlstra
2016-10-24 12:02       ` Oleg Nesterov
2016-10-24 12:10         ` Oleg Nesterov
2016-10-24 12:22           ` Peter Zijlstra
2016-10-24 12:29             ` Oleg Nesterov
2016-10-24 12:38               ` Peter Zijlstra
2016-10-24 13:25                 ` Oleg Nesterov
2016-10-24 13:40                   ` Oleg Nesterov
2016-10-24 14:17                     ` Peter Zijlstra
2016-10-24 14:36                   ` Peter Zijlstra
2016-10-24 15:39                     ` Oleg Nesterov
2016-10-24 15:53                       ` Oleg Nesterov
2016-10-25  6:55                         ` Ni, BaoleX
2016-10-25  9:28                       ` Peter Zijlstra
2016-10-25 14:41                         ` Oleg Nesterov
2016-10-26  9:03                           ` Peter Zijlstra
2016-10-26 16:10                             ` Oleg Nesterov
2016-10-24 12:19         ` Peter Zijlstra
2016-10-24 11:27     ` Peter Zijlstra
2016-10-24 11:29       ` Peter Zijlstra
2016-10-24 12:04         ` Jiri Olsa
2016-10-24 12:12           ` Peter Zijlstra
2016-10-24 12:11     ` Peter Zijlstra
2016-10-24 12:21       ` Oleg Nesterov
2016-10-24 12:27         ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).