From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752937AbcLLNql (ORCPT ); Mon, 12 Dec 2016 08:46:41 -0500 Received: from mx1.redhat.com ([209.132.183.28]:43016 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750712AbcLLNqk (ORCPT ); Mon, 12 Dec 2016 08:46:40 -0500 Date: Mon, 12 Dec 2016 14:46:11 +0100 From: Oleg Nesterov To: "Eric W. Biederman" Cc: EunTaik Lee , "mingo@redhat.com" , "peterz@infradead.org" , "linux-kernel@vger.kernel.org" Subject: Re: [PATCH] sched/pid fix use-after free in task_tgid_vnr Message-ID: <20161212134611.GA11078@redhat.com> References: <20161209093351epcms1p418673c3cdec7d4c3e81b5df131173c57@epcms1p4> <20161209172114.GA25742@redhat.com> <87wpf8hhfc.fsf@xmission.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87wpf8hhfc.fsf@xmission.com> User-Agent: Mutt/1.5.18 (2008-05-17) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Mon, 12 Dec 2016 13:46:39 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 12/10, Eric W. Biederman wrote: > > Oleg Nesterov writes: > > > On 12/09, EunTaik Lee wrote: > >> > >> There is a use-after-free case with below call stack. > >> > >> pid_nr_ns+0x10/0x38 > >> cgroup_pidlist_start+0x144/0x400 > >> cgroup_seqfile_start+0x1c/0x24 > >> kernfs_seq_start+0x54/0x90 > >> seq_read+0x15c/0x3a8 > >> kernfs_fop_read+0x38/0x160 > >> __vfs_read+0x28/0xc8 > >> vfs_read+0x84/0xfc > > How is this a use after free. The function pid_nr_ns should take a NULL pointer > as input and return 0? No, the task (task_struct) itself can't go away, but task->group_leader can point to nowhere. > Certainly if the addtion of pid_alive fixes it pid_vnr(task_tgid(tsk)) > is fine. Are we perhaps missing rcu locking? rcu_read_lock() is not enough in this case, see below. > Or is the problem simply that in task_tgid we are accessing > task->group_leader which may already be dead? Yes. Lets forget about the callchain above, I didn't even bother to verify that it can actually hit the problem. Although I think EunTaik is very right, css_task_iter_next() does get_task_struct() and drops css_set_lock, the task can exit after that. Forget. Just suppose that a task simply does pid = task_tgid_vnr(current); after it has already called exit_notify(). And this is what perf_event_pid() does, perhaps we have more buggy users. In this case current->group_leader or parent/real_parent can point to the exited/freed tasks. I already said this many times, ee really need to nullify them in __unhash_process() but this needs a lot of (mostly simple) cleanups. > If so the fix needs to be > in task_tgid. Yes, task_tgid() should probably return NULL in this case, but this connects to "a lot of cleanups" above. > > --- x/include/linux/pid.h > > +++ x/include/linux/pid.h > > @@ -8,7 +8,8 @@ enum pid_type > > PIDTYPE_PID, > > PIDTYPE_PGID, > > PIDTYPE_SID, > > - PIDTYPE_MAX > > + PIDTYPE_MAX, > > + PIDTYPE_TGID /* do not use */ > > > I would do: > > /* __PIDTYPE_TGID is only valid to __task_pid_nr_ns */ > #define __PIDTYPE_TGID PIDTYPE_MAX > > Prefixing __PIDTYPE_TGID with __ should help make it clear > this is a special use define. OK, will do, thanks. > I am also curious why pid_alive is the proper check to see if > task->group_leader is valid? That feels like it could get us into > trouble later. It is. pid_alive(task) == T means that this task was not removed from rcu-protected-lists and thus task->group_leader (in particular) can't go away. As long as you check pid_alive() and use ->group_leader in the same rcu_read_lock section, of course. To clarify, this is not because detach_pid(PIDTYPE_PID) is called before list_del_rcu(), this doesn't matter. What does matter is that if you see pid_alive() == T (or task->sighand != NULL, or anything else which means that release_task/__unhash_process was not called yet) you know that this task, its leader/parent/real_parent/pids/etc can't go away until rcu_read_unlock(). And this is why __task_pid_nr_ns() checks pid_alive() before it reads ->group_leader. Note that __task_pid_nr_ns(PIDTYPE_PID) does not need this check, task->pids[PID] is nullified by detach_pid() and pid_nr_ns() check pid != NULL. However. I think it should be renamed. Or, better, we should add a new helper to make this all more clear. Say, bool task_is_rcu_safe(task) { return task->sighand != NULL; } then change __task_pid_nr_ns() to use this helper rather than pid_alive(). Because, once again, it is not that we need to ensure that pids[PIDTYPE_PID].pid != NULL, we need to ensure that rcu_read_lock() can actually protect this task and its leader/parent/etc. And of course, it can have more users. Oleg.