All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] proc: first_tid() fix/cleanup
@ 2013-05-27 20:27 Oleg Nesterov
  2013-05-27 20:28 ` [PATCH 1/3] proc: first_tid: fix the potential use-after-free Oleg Nesterov
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Oleg Nesterov @ 2013-05-27 20:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Eric W. Biederman, KAMEZAWA Hiroyuki,
	Michal Hocko, Sergey Dyasly, Sha Zhengju, linux-kernel

Hello.

next_thread() should be avoided, probably next_tid() is the
only "valid" user.

But now we have another reason to avoid (and probably even kill)
it, we are going to replace or fix while_each_thread(), almost
every lockless usage is wrong.

I was going to send more changes, but this initial series nearly
killed me. And I think first_tid() needs another cleanup, ->f_pos
truncation doesn't look nice, tomorrow.

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] proc: first_tid: fix the potential use-after-free
  2013-05-27 20:27 [PATCH 0/3] proc: first_tid() fix/cleanup Oleg Nesterov
@ 2013-05-27 20:28 ` Oleg Nesterov
  2013-05-29  4:08   ` Eric W. Biederman
  2013-05-27 20:28 ` [PATCH 2/3] proc: change first_tid() to use while_each_thread() Oleg Nesterov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2013-05-27 20:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Eric W. Biederman, KAMEZAWA Hiroyuki,
	Michal Hocko, Sergey Dyasly, Sha Zhengju, linux-kernel

proc_task_readdir() verifies that the result of get_proc_task()
is pid_alive() and thus its ->group_leader is fine too. However
this is not necessarily true after rcu_read_unlock(), we need
to recheck this after first_tid() does rcu_read_lock() again.

The race is subtle and unlikely, but still it is possible afaics.
To simplify lets ignore the "likely" case when tid != 0, f_version
can be cleared by proc_task_operations->llseek().

Suppose we have a main thread M and its subthread T. Suppose that
f_pos == 3, iow first_tid() should return T. Now suppose that the
following happens between rcu_read_unlock() and rcu_read_lock():

	1. T execs and becomes the new leader. This removes M from
	    ->thread_group but next_thread(M) is still T.

	2. T creates another thread X which does exec as well, T
	   goes away.

	3. X creates another subthread, this increments nr_threads.

	4. first_tid() does next_thread(M) and returns the already
	   dead T.

Note that we need 2. and 3. only because of get_nr_threads() check,
and this check was supposed to be optimization only.

Note: I think that proc_task_readdir/first_tid interaction can be
simplified, but this needs another patch. proc_task_readdir() should
not play with ->group_leader at all. See the next patches.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/base.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index dd51e50..c939c9f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3186,10 +3186,13 @@ static struct task_struct *first_tid(struct task_struct *leader,
 			goto found;
 	}
 
-	/* If nr exceeds the number of threads there is nothing todo */
 	pos = NULL;
+	/* If nr exceeds the number of threads there is nothing todo */
 	if (nr && nr >= get_nr_threads(leader))
 		goto out;
+	/* It could be unhashed before we take rcu lock */
+	if (!pid_alive(leader))
+		goto out;
 
 	/* If we haven't found our starting place yet start
 	 * with the leader and walk nr threads forward.
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] proc: change first_tid() to use while_each_thread()
  2013-05-27 20:27 [PATCH 0/3] proc: first_tid() fix/cleanup Oleg Nesterov
  2013-05-27 20:28 ` [PATCH 1/3] proc: first_tid: fix the potential use-after-free Oleg Nesterov
@ 2013-05-27 20:28 ` Oleg Nesterov
  2013-05-27 20:28 ` [PATCH 3/3] proc: simplify proc_task_readdir/first_tid paths Oleg Nesterov
  2013-05-29  5:22 ` [PATCH 0/3] proc: first_tid() fix/cleanup Eric W. Biederman
  3 siblings, 0 replies; 13+ messages in thread
From: Oleg Nesterov @ 2013-05-27 20:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Eric W. Biederman, KAMEZAWA Hiroyuki,
	Michal Hocko, Sergey Dyasly, Sha Zhengju, linux-kernel

Change first_tid() to use while_each_thread() instead of
next_thread().

This doesn't make the code simpler/better, but we are going
to fix or replace while_each_thread(), next_thread() should
be avoided whenever possible.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/base.c |   20 ++++++++++----------
 1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index c939c9f..bed1096 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3186,24 +3186,24 @@ static struct task_struct *first_tid(struct task_struct *leader,
 			goto found;
 	}
 
-	pos = NULL;
 	/* If nr exceeds the number of threads there is nothing todo */
 	if (nr && nr >= get_nr_threads(leader))
-		goto out;
+		goto fail;
 	/* It could be unhashed before we take rcu lock */
 	if (!pid_alive(leader))
-		goto out;
+		goto fail;
 
 	/* If we haven't found our starting place yet start
 	 * with the leader and walk nr threads forward.
 	 */
-	for (pos = leader; nr > 0; --nr) {
-		pos = next_thread(pos);
-		if (pos == leader) {
-			pos = NULL;
-			goto out;
-		}
-	}
+	pos = leader;
+	do {
+		if (nr-- <= 0)
+			goto found;
+	} while_each_thread(leader, pos);
+fail:
+	pos = NULL;
+	goto out;
 found:
 	get_task_struct(pos);
 out:
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] proc: simplify proc_task_readdir/first_tid paths
  2013-05-27 20:27 [PATCH 0/3] proc: first_tid() fix/cleanup Oleg Nesterov
  2013-05-27 20:28 ` [PATCH 1/3] proc: first_tid: fix the potential use-after-free Oleg Nesterov
  2013-05-27 20:28 ` [PATCH 2/3] proc: change first_tid() to use while_each_thread() Oleg Nesterov
@ 2013-05-27 20:28 ` Oleg Nesterov
  2013-05-29  4:42   ` Eric W. Biederman
  2013-05-29  5:22 ` [PATCH 0/3] proc: first_tid() fix/cleanup Eric W. Biederman
  3 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2013-05-27 20:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Eric W. Biederman, KAMEZAWA Hiroyuki,
	Michal Hocko, Sergey Dyasly, Sha Zhengju, linux-kernel

proc_task_readdir() does not really need "leader", first_tid()
has to revalidate it anyway. Just pass proc_pid(inode) to
first_tid() instead, it can do pid_task(PIDTYPE_PID) itself
and read ->group_leader only if necessary.

Note: I am not sure proc_task_readdir() really needs the initial
-ENOENT check, but this is what the current code does.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/base.c |   46 ++++++++++++++++------------------------------
 1 files changed, 16 insertions(+), 30 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index bed1096..dbc4dae 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3173,34 +3173,35 @@ out_no_task:
  * In the case of a seek we start with the leader and walk nr
  * threads past it.
  */
-static struct task_struct *first_tid(struct task_struct *leader,
-		int tid, int nr, struct pid_namespace *ns)
+static struct task_struct *first_tid(struct pid *pid, int tid,
+					int nr, struct pid_namespace *ns)
 {
-	struct task_struct *pos;
+	struct task_struct *pos, *task;
 
 	rcu_read_lock();
-	/* Attempt to start with the pid of a thread */
+	task = pid_task(pid, PIDTYPE_PID);
+	if (!task)
+		goto fail;
+
+	/* Attempt to start with the tid of a thread */
 	if (tid && (nr > 0)) {
 		pos = find_task_by_pid_ns(tid, ns);
-		if (pos && (pos->group_leader == leader))
+		if (pos && same_thread_group(pos, task))
 			goto found;
 	}
 
 	/* If nr exceeds the number of threads there is nothing todo */
-	if (nr && nr >= get_nr_threads(leader))
-		goto fail;
-	/* It could be unhashed before we take rcu lock */
-	if (!pid_alive(leader))
+	if (nr && nr >= get_nr_threads(task))
 		goto fail;
 
 	/* If we haven't found our starting place yet start
 	 * with the leader and walk nr threads forward.
 	 */
-	pos = leader;
+	pos = task = task->group_leader;
 	do {
 		if (nr-- <= 0)
 			goto found;
-	} while_each_thread(leader, pos);
+	} while_each_thread(task, pos);
 fail:
 	pos = NULL;
 	goto out;
@@ -3247,26 +3248,13 @@ static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldi
 {
 	struct dentry *dentry = filp->f_path.dentry;
 	struct inode *inode = dentry->d_inode;
-	struct task_struct *leader = NULL;
 	struct task_struct *task;
-	int retval = -ENOENT;
 	ino_t ino;
 	int tid;
 	struct pid_namespace *ns;
 
-	task = get_proc_task(inode);
-	if (!task)
-		goto out_no_task;
-	rcu_read_lock();
-	if (pid_alive(task)) {
-		leader = task->group_leader;
-		get_task_struct(leader);
-	}
-	rcu_read_unlock();
-	put_task_struct(task);
-	if (!leader)
-		goto out_no_task;
-	retval = 0;
+	if (!pid_task(proc_pid(inode), PIDTYPE_PID))
+		return -ENOENT;
 
 	switch ((unsigned long)filp->f_pos) {
 	case 0:
@@ -3289,7 +3277,7 @@ static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldi
 	ns = filp->f_dentry->d_sb->s_fs_info;
 	tid = (int)filp->f_version;
 	filp->f_version = 0;
-	for (task = first_tid(leader, tid, filp->f_pos - 2, ns);
+	for (task = first_tid(proc_pid(inode), tid, filp->f_pos - 2, ns);
 	     task;
 	     task = next_tid(task), filp->f_pos++) {
 		tid = task_pid_nr_ns(task, ns);
@@ -3302,9 +3290,7 @@ static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldi
 		}
 	}
 out:
-	put_task_struct(leader);
-out_no_task:
-	return retval;
+	return 0;
 }
 
 static int proc_task_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] proc: first_tid: fix the potential use-after-free
  2013-05-27 20:28 ` [PATCH 1/3] proc: first_tid: fix the potential use-after-free Oleg Nesterov
@ 2013-05-29  4:08   ` Eric W. Biederman
  2013-05-29 12:30     ` Oleg Nesterov
  0 siblings, 1 reply; 13+ messages in thread
From: Eric W. Biederman @ 2013-05-29  4:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, David Rientjes, KAMEZAWA Hiroyuki, Michal Hocko,
	Sergey Dyasly, Sha Zhengju, linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> proc_task_readdir() verifies that the result of get_proc_task()
> is pid_alive() and thus its ->group_leader is fine too. However
> this is not necessarily true after rcu_read_unlock(), we need
> to recheck this after first_tid() does rcu_read_lock() again.

I agree with you but you are missing something critical from your
explanation.  If a process has been passed through __unhash_process
then task->thread_group.next (aka next_thread) returns a pointer to the
process that was it's next thread in the thread group.  Importantly
that pointer is only guaranteed to point to valid memory until the rcu
grace period expires.

Which means that starting a walk of a thread list with a task that
could have been unhashed before the current rcu critical section
began is invalid, and can lead to following an invalid pointer.

> The race is subtle and unlikely, but still it is possible afaics.
> To simplify lets ignore the "likely" case when tid != 0, f_version
> can be cleared by proc_task_operations->llseek().
>
> Suppose we have a main thread M and its subthread T. Suppose that
> f_pos == 3, iow first_tid() should return T. Now suppose that the
> following happens between rcu_read_unlock() and rcu_read_lock():
>
> 	1. T execs and becomes the new leader. This removes M from
> 	    ->thread_group but next_thread(M) is still T.
>
> 	2. T creates another thread X which does exec as well, T
> 	   goes away.
>
> 	3. X creates another subthread, this increments nr_threads.
>
> 	4. first_tid() does next_thread(M) and returns the already
> 	   dead T.
>
> Note that we need 2. and 3. only because of get_nr_threads() check,
> and this check was supposed to be optimization only.

An optimization and denial of service attack prevention.  It keeps us
spinning for nearly unbounded amounts of time in the rcu critical
section.  But I agree it should not be needed from this part of
correctness.

> Note: I think that proc_task_readdir/first_tid interaction can be
> simplified, but this needs another patch. proc_task_readdir() should
> not play with ->group_leader at all. See the next patches.

That sounds right.  I seem to recall that there was a purpose in keeping
the leader pinned but it looks like that purpose is long since gone.

> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  fs/proc/base.c |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index dd51e50..c939c9f 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3186,10 +3186,13 @@ static struct task_struct *first_tid(struct task_struct *leader,
>  			goto found;
>  	}
>  
> -	/* If nr exceeds the number of threads there is nothing todo */
>  	pos = NULL;
> +	/* If nr exceeds the number of threads there is nothing todo */

Moving the comment is just noise and makes for confusing reading of your
patch.

>  	if (nr && nr >= get_nr_threads(leader))
>  		goto out;
> +	/* It could be unhashed before we take rcu lock */
> +	if (!pid_alive(leader))
> +		goto out;
>  
>  	/* If we haven't found our starting place yet start
>  	 * with the leader and walk nr threads forward.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] proc: simplify proc_task_readdir/first_tid paths
  2013-05-27 20:28 ` [PATCH 3/3] proc: simplify proc_task_readdir/first_tid paths Oleg Nesterov
@ 2013-05-29  4:42   ` Eric W. Biederman
  2013-05-29 13:39     ` Oleg Nesterov
  0 siblings, 1 reply; 13+ messages in thread
From: Eric W. Biederman @ 2013-05-29  4:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, David Rientjes, KAMEZAWA Hiroyuki, Michal Hocko,
	Sergey Dyasly, Sha Zhengju, linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> proc_task_readdir() does not really need "leader", first_tid()
> has to revalidate it anyway. Just pass proc_pid(inode) to
> first_tid() instead, it can do pid_task(PIDTYPE_PID) itself
> and read ->group_leader only if necessary.
>
> Note: I am not sure proc_task_readdir() really needs the initial
> -ENOENT check, but this is what the current code does.

This looks like a nice cleanup.

We would need either -ENOENT or a return of 0 and an empty directory at
the least.  We need the check so that empty directories don't have "."
and ".." entries.

Eric


> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  fs/proc/base.c |   46 ++++++++++++++++------------------------------
>  1 files changed, 16 insertions(+), 30 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index bed1096..dbc4dae 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3173,34 +3173,35 @@ out_no_task:
>   * In the case of a seek we start with the leader and walk nr
>   * threads past it.
>   */
> -static struct task_struct *first_tid(struct task_struct *leader,
> -		int tid, int nr, struct pid_namespace *ns)
> +static struct task_struct *first_tid(struct pid *pid, int tid,
> +					int nr, struct pid_namespace *ns)
>  {
> -	struct task_struct *pos;
> +	struct task_struct *pos, *task;
>  
>  	rcu_read_lock();
> -	/* Attempt to start with the pid of a thread */
> +	task = pid_task(pid, PIDTYPE_PID);
Elegant.  If we can from the task from the pid the task is hashed
and still alive when we are in the rcu critical section.

> +	if (!task)
> +		goto fail;
> +
> +	/* Attempt to start with the tid of a thread */
>  	if (tid && (nr > 0)) {
>  		pos = find_task_by_pid_ns(tid, ns);
> -		if (pos && (pos->group_leader == leader))
> +		if (pos && same_thread_group(pos, task))

Sigh this reminds me we need to figure out how to kill task->pid and
task->tgid, which I assume means fixing same_thread_group.

>  			goto found;
>  	}
>  
>  	/* If nr exceeds the number of threads there is nothing todo */
> -	if (nr && nr >= get_nr_threads(leader))
> -		goto fail;
> -	/* It could be unhashed before we take rcu lock */
> -	if (!pid_alive(leader))
> +	if (nr && nr >= get_nr_threads(task))
>  		goto fail;

>  	/* If we haven't found our starting place yet start
>  	 * with the leader and walk nr threads forward.
>  	 */
> -	pos = leader;
> +	pos = task = task->group_leader;
>  	do {
>  		if (nr-- <= 0)
>  			goto found;
> -	} while_each_thread(leader, pos);
> +	} while_each_thread(task, pos);
>  fail:
>  	pos = NULL;
>  	goto out;
> @@ -3247,26 +3248,13 @@ static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldi
>  {
>  	struct dentry *dentry = filp->f_path.dentry;
>  	struct inode *inode = dentry->d_inode;
> -	struct task_struct *leader = NULL;
>  	struct task_struct *task;
> -	int retval = -ENOENT;
>  	ino_t ino;
>  	int tid;
>  	struct pid_namespace *ns;
>  
> -	task = get_proc_task(inode);
> -	if (!task)
> -		goto out_no_task;
> -	rcu_read_lock();
> -	if (pid_alive(task)) {
> -		leader = task->group_leader;
> -		get_task_struct(leader);
> -	}
> -	rcu_read_unlock();
> -	put_task_struct(task);
> -	if (!leader)
> -		goto out_no_task;
> -	retval = 0;


> +	if (!pid_task(proc_pid(inode), PIDTYPE_PID))
> +		return -ENOENT;

Strictly speaking this call to pid_task needs to be in a rcu critical
section.

>  	switch ((unsigned long)filp->f_pos) {
>  	case 0:
> @@ -3289,7 +3277,7 @@ static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldi
>  	ns = filp->f_dentry->d_sb->s_fs_info;
>  	tid = (int)filp->f_version;
>  	filp->f_version = 0;
> -	for (task = first_tid(leader, tid, filp->f_pos - 2, ns);
> +	for (task = first_tid(proc_pid(inode), tid, filp->f_pos - 2, ns);
>  	     task;
>  	     task = next_tid(task), filp->f_pos++) {
>  		tid = task_pid_nr_ns(task, ns);
> @@ -3302,9 +3290,7 @@ static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldi
>  		}
>  	}
>  out:
> -	put_task_struct(leader);
> -out_no_task:
> -	return retval;
> +	return 0;
>  }
>  
>  static int proc_task_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] proc: first_tid() fix/cleanup
  2013-05-27 20:27 [PATCH 0/3] proc: first_tid() fix/cleanup Oleg Nesterov
                   ` (2 preceding siblings ...)
  2013-05-27 20:28 ` [PATCH 3/3] proc: simplify proc_task_readdir/first_tid paths Oleg Nesterov
@ 2013-05-29  5:22 ` Eric W. Biederman
  3 siblings, 0 replies; 13+ messages in thread
From: Eric W. Biederman @ 2013-05-29  5:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, David Rientjes, KAMEZAWA Hiroyuki, Michal Hocko,
	Sergey Dyasly, Sha Zhengju, linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> Hello.
>
> next_thread() should be avoided, probably next_tid() is the
> only "valid" user.
>
> But now we have another reason to avoid (and probably even kill)
> it, we are going to replace or fix while_each_thread(), almost
> every lockless usage is wrong.
>
> I was going to send more changes, but this initial series nearly
> killed me. And I think first_tid() needs another cleanup, ->f_pos
> truncation doesn't look nice, tomorrow.

I have made some comments but overall this looks like a good set of
cleanups.

Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>

As for f_pos truncation if you want you can safely check
if f_pos is greater than PID_MAX_LIMIT as we will never more
threads than we have pids.

Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] proc: first_tid: fix the potential use-after-free
  2013-05-29  4:08   ` Eric W. Biederman
@ 2013-05-29 12:30     ` Oleg Nesterov
  0 siblings, 0 replies; 13+ messages in thread
From: Oleg Nesterov @ 2013-05-29 12:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, David Rientjes, KAMEZAWA Hiroyuki, Michal Hocko,
	Sergey Dyasly, Sha Zhengju, linux-kernel

On 05/28, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > proc_task_readdir() verifies that the result of get_proc_task()
> > is pid_alive() and thus its ->group_leader is fine too. However
> > this is not necessarily true after rcu_read_unlock(), we need
> > to recheck this after first_tid() does rcu_read_lock() again.
>
> I agree with you but you are missing something critical from your
> explanation.  If a process has been passed through __unhash_process
> then task->thread_group.next (aka next_thread) returns a pointer to the
> process that was it's next thread in the thread group.  Importantly
> that pointer is only guaranteed to point to valid memory until the rcu
> grace period expires.

I tried to explain this below, in 1-4 steps... But OK, agreed, this
should be explained more clearly.

I'll update the changelog.

> > Note that we need 2. and 3. only because of get_nr_threads() check,
> > and this check was supposed to be optimization only.
>
> An optimization and denial of service attack prevention.  It keeps us
> spinning for nearly unbounded amounts of time in the rcu critical
> section.

I do not really think we need this check to prevent the DoS attacks.

The main loop does while_each_thread(), so it will stop after
nr_threads iterations. And a user can always do llseek to trigger
the "full" scan.

But this is off-topic, and

> But I agree it should not be needed from this part of
> correctness.

Yes.

> >
> > -	/* If nr exceeds the number of threads there is nothing todo */
> >  	pos = NULL;
> > +	/* If nr exceeds the number of threads there is nothing todo */
>
> Moving the comment is just noise and makes for confusing reading of your
> patch.

Well, I think this makes the code look a bit better. Without this change
the code will be

        /* If nr exceeds the number of threads there is nothing todo */
        pos = NULL;
        if (nr && nr >= get_nr_threads(leader))
                goto out;
        /* It could be unhashed before we take rcu lock */
        if (!pid_alive(leader))
                goto out;

and the comments explaining the checks are not "simmetrical". But I won't
argue, I'll update the patch and remove it. 3/3 changes this code anyway.

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] proc: simplify proc_task_readdir/first_tid paths
  2013-05-29  4:42   ` Eric W. Biederman
@ 2013-05-29 13:39     ` Oleg Nesterov
  2013-05-29 20:38       ` Eric W. Biederman
  0 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2013-05-29 13:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, David Rientjes, KAMEZAWA Hiroyuki, Michal Hocko,
	Sergey Dyasly, Sha Zhengju, linux-kernel

On 05/28, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > proc_task_readdir() does not really need "leader", first_tid()
> > has to revalidate it anyway. Just pass proc_pid(inode) to
> > first_tid() instead, it can do pid_task(PIDTYPE_PID) itself
> > and read ->group_leader only if necessary.
> >
> > Note: I am not sure proc_task_readdir() really needs the initial
> > -ENOENT check, but this is what the current code does.
>
> This looks like a nice cleanup.
>
> We would need either -ENOENT or a return of 0 and an empty directory at
> the least.  We need the check so that empty directories don't have "."
> and ".." entries.

And this is not clear to me...

Why the empty "." + ".." dir is bad if the task(s) has gone away after
opendir?

> >  	if (tid && (nr > 0)) {
> >  		pos = find_task_by_pid_ns(tid, ns);
> > -		if (pos && (pos->group_leader == leader))
> > +		if (pos && same_thread_group(pos, task))
>
> Sigh this reminds me we need to figure out how to kill task->pid and
> task->tgid,

Yeah.

> which I assume means fixing same_thread_group.

Now that ->signal can't go away before task_struct, we can make it

	static inline
	int same_thread_group(struct task_struct *p1, struct task_struct *p2)
	{
		return p1->signal == p2->signal;
	}


> > +	if (!pid_task(proc_pid(inode), PIDTYPE_PID))
> > +		return -ENOENT;
>
> Strictly speaking this call to pid_task needs to be in a rcu critical
> section.

Argh, thanks.

we do not really need rcu, we are not going to dereference this pointer,
but we should make __rcu_dereference_check() happy...

I'll change this... but once again, can't we simply remove this check?






While you are here. Could you explain the ->d_inode check in
proc_fill_cache() ? The code _looks_ wrong,

	if (!child || IS_ERR(child) || !child->d_inode)
		goto end_instantiate;

If d_inode == NULL, who does dput() ?

OTOH, if we ensure d_inode != NULL, why do we check "if (inode)" after
inode = child->d_inode ?

IOW, it seems that this check should be simply removed?

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] proc: simplify proc_task_readdir/first_tid paths
  2013-05-29 13:39     ` Oleg Nesterov
@ 2013-05-29 20:38       ` Eric W. Biederman
  2013-05-31 16:38         ` Oleg Nesterov
  0 siblings, 1 reply; 13+ messages in thread
From: Eric W. Biederman @ 2013-05-29 20:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, David Rientjes, KAMEZAWA Hiroyuki, Michal Hocko,
	Sergey Dyasly, Sha Zhengju, linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> On 05/28, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > proc_task_readdir() does not really need "leader", first_tid()
>> > has to revalidate it anyway. Just pass proc_pid(inode) to
>> > first_tid() instead, it can do pid_task(PIDTYPE_PID) itself
>> > and read ->group_leader only if necessary.
>> >
>> > Note: I am not sure proc_task_readdir() really needs the initial
>> > -ENOENT check, but this is what the current code does.
>>
>> This looks like a nice cleanup.
>>
>> We would need either -ENOENT or a return of 0 and an empty directory at
>> the least.  We need the check so that empty directories don't have "."
>> and ".." entries.
>
> And this is not clear to me...
>
> Why the empty "." + ".." dir is bad if the task(s) has gone away after
> opendir?

Because the definition of a deleted directory that you are in is that
getdents will return -ENOENT.

You can reproduce this with any linux filesystem.
mkdir foo
cd foo
rmdir ../foo
strace -f ls .

   open(".", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
   getdents(3, 0x1851c88, 32768)           = -1 ENOENT (No such file or directory)
   close(3)                                = 0

Proc is no different in this regard, and the task could have gone away
before opendir if the process made the task directory it's current
working directory before opening the file.

>> >  	if (tid && (nr > 0)) {
>> >  		pos = find_task_by_pid_ns(tid, ns);
>> > -		if (pos && (pos->group_leader == leader))
>> > +		if (pos && same_thread_group(pos, task))
>>
>> Sigh this reminds me we need to figure out how to kill task->pid and
>> task->tgid,
>
> Yeah.
>
>> which I assume means fixing same_thread_group.
>
> Now that ->signal can't go away before task_struct, we can make it
>
> 	static inline
> 	int same_thread_group(struct task_struct *p1, struct task_struct *p2)
> 	{
> 		return p1->signal == p2->signal;
> 	}

Oh cool.  I will review that patch in just a moment.

>> > +	if (!pid_task(proc_pid(inode), PIDTYPE_PID))
>> > +		return -ENOENT;
>>
>> Strictly speaking this call to pid_task needs to be in a rcu critical
>> section.
>
> Argh, thanks.
>
> we do not really need rcu, we are not going to dereference this pointer,
> but we should make __rcu_dereference_check() happy...
>
> I'll change this... but once again, can't we simply remove this check?

I can understand the desire but I think we need something to see if the
task for the directory has at least a zombie around.

> While you are here. Could you explain the ->d_inode check in
> proc_fill_cache() ? The code _looks_ wrong,
>
> 	if (!child || IS_ERR(child) || !child->d_inode)
> 		goto end_instantiate;
>
> If d_inode == NULL, who does dput() ?
>
> OTOH, if we ensure d_inode != NULL, why do we check "if (inode)" after
> inode = child->d_inode ?
>
> IOW, it seems that this check should be simply removed?

I seem to agree as I have a patch removing it on my development branch.
I think I wrote that before realizing that negative dentries don't
happen in proc.

Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] proc: simplify proc_task_readdir/first_tid paths
  2013-05-29 20:38       ` Eric W. Biederman
@ 2013-05-31 16:38         ` Oleg Nesterov
  2013-05-31 18:12           ` Eric W. Biederman
  0 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2013-05-31 16:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, David Rientjes, KAMEZAWA Hiroyuki, Michal Hocko,
	Sergey Dyasly, Sha Zhengju, linux-kernel

Eric, sorry for delay.

On 05/29, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > Why the empty "." + ".." dir is bad if the task(s) has gone away after
> > opendir?
>
> Because the definition of a deleted directory that you are in is that
> getdents will return -ENOENT.
>
> You can reproduce this with any linux filesystem.
> mkdir foo
> cd foo
> rmdir ../foo
> strace -f ls .
>
>    open(".", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
>    getdents(3, 0x1851c88, 32768)           = -1 ENOENT (No such file or directory)
>    close(3)                                = 0

Heh. Indeed, vfs_readdir() checks IS_DEADDIR().

Thanks.

OK. But this means that even 1/3 is not 100% right, exactly because
leader can be unhashed right before first_tid() takes rcu lock. Easy
to fix, we should simply factor out the "nr != 0" check.

And this also means that 3/3 is not right by the same reason. I'll
make a simpler patch which only avoids the unnecessary get/put in
proc_task_readdir().

Unless we can tolerate this very unlikely rase when the leader goes
away after initial ENOENT check at the start, of course... Or unless
we add canceldir() which resets getdents_callback->previous so that
we could return ENOENT after filldir() was already called ;)

Thanks Eric. I'll send v2.

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] proc: simplify proc_task_readdir/first_tid paths
  2013-05-31 16:38         ` Oleg Nesterov
@ 2013-05-31 18:12           ` Eric W. Biederman
  2013-05-31 18:34             ` Oleg Nesterov
  0 siblings, 1 reply; 13+ messages in thread
From: Eric W. Biederman @ 2013-05-31 18:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, David Rientjes, KAMEZAWA Hiroyuki, Michal Hocko,
	Sergey Dyasly, Sha Zhengju, linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> Eric, sorry for delay.
>
> On 05/29, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > Why the empty "." + ".." dir is bad if the task(s) has gone away after
>> > opendir?
>>
>> Because the definition of a deleted directory that you are in is that
>> getdents will return -ENOENT.
>>
>> You can reproduce this with any linux filesystem.
>> mkdir foo
>> cd foo
>> rmdir ../foo
>> strace -f ls .
>>
>>    open(".", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
>>    getdents(3, 0x1851c88, 32768)           = -1 ENOENT (No such file or directory)
>>    close(3)                                = 0
>
> Heh. Indeed, vfs_readdir() checks IS_DEADDIR().
>
> Thanks.
>
> OK. But this means that even 1/3 is not 100% right, exactly because
> leader can be unhashed right before first_tid() takes rcu lock. Easy
> to fix, we should simply factor out the "nr != 0" check.
>
> And this also means that 3/3 is not right by the same reason. I'll
> make a simpler patch which only avoids the unnecessary get/put in
> proc_task_readdir().
>
> Unless we can tolerate this very unlikely rase when the leader goes
> away after initial ENOENT check at the start, of course... Or unless
> we add canceldir() which resets getdents_callback->previous so that
> we could return ENOENT after filldir() was already called ;)

A small race is fine and is fundamental to the process of readdir.

The guarantee of open+readdir+close is that all directory entries that
exited before open and after close are returned.  Directory entries that
are added or removed during the open+readir+close are returned at most
once.

The important case to handle is when someone has opened the directory a
very long time ago or has chdir'd to the directory.  With the result
the directory was removed before we start the readdir process entirely.

If the tasks die in the narrow window while we are inside of readdir
races are impossible to avoid.

Eric


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] proc: simplify proc_task_readdir/first_tid paths
  2013-05-31 18:12           ` Eric W. Biederman
@ 2013-05-31 18:34             ` Oleg Nesterov
  0 siblings, 0 replies; 13+ messages in thread
From: Oleg Nesterov @ 2013-05-31 18:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, David Rientjes, KAMEZAWA Hiroyuki, Michal Hocko,
	Sergey Dyasly, Sha Zhengju, linux-kernel

On 05/31, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > OK. But this means that even 1/3 is not 100% right, exactly because
> > leader can be unhashed right before first_tid() takes rcu lock. Easy
> > to fix, we should simply factor out the "nr != 0" check.
> >
> > And this also means that 3/3 is not right by the same reason. I'll
> > make a simpler patch which only avoids the unnecessary get/put in
> > proc_task_readdir().
> >
> > Unless we can tolerate this very unlikely rase when the leader goes
> > away after initial ENOENT check at the start, of course... Or unless
> > we add canceldir() which resets getdents_callback->previous so that
> > we could return ENOENT after filldir() was already called ;)
>
> A small race is fine and is fundamental to the process of readdir.
>
> The guarantee of open+readdir+close is that all directory entries that
> exited before open and after close are returned.  Directory entries that
> are added or removed during the open+readir+close are returned at most
> once.
>
> The important case to handle is when someone has opened the directory a
> very long time ago or has chdir'd to the directory.  With the result
> the directory was removed before we start the readdir process entirely.
>
> If the tasks die in the narrow window while we are inside of readdir
> races are impossible to avoid.

Ah OK. This means that v2 only needs the simple update.

Thanks again Eric.

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-05-31 18:38 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-27 20:27 [PATCH 0/3] proc: first_tid() fix/cleanup Oleg Nesterov
2013-05-27 20:28 ` [PATCH 1/3] proc: first_tid: fix the potential use-after-free Oleg Nesterov
2013-05-29  4:08   ` Eric W. Biederman
2013-05-29 12:30     ` Oleg Nesterov
2013-05-27 20:28 ` [PATCH 2/3] proc: change first_tid() to use while_each_thread() Oleg Nesterov
2013-05-27 20:28 ` [PATCH 3/3] proc: simplify proc_task_readdir/first_tid paths Oleg Nesterov
2013-05-29  4:42   ` Eric W. Biederman
2013-05-29 13:39     ` Oleg Nesterov
2013-05-29 20:38       ` Eric W. Biederman
2013-05-31 16:38         ` Oleg Nesterov
2013-05-31 18:12           ` Eric W. Biederman
2013-05-31 18:34             ` Oleg Nesterov
2013-05-29  5:22 ` [PATCH 0/3] proc: first_tid() fix/cleanup Eric W. Biederman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.