linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] CRED: Fix get_task_cred() and task_state() to not resurrect dead credentials
@ 2010-07-29 11:45 David Howells
  2010-07-29 11:45 ` [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner comment David Howells
  0 siblings, 1 reply; 19+ messages in thread
From: David Howells @ 2010-07-29 11:45 UTC (permalink / raw)
  To: torvalds, akpm
  Cc: linux-kernel, linux-security-module, David Howells, Jiri Olsa

It's possible for get_task_cred() as it currently stands to 'corrupt' a set of
credentials by incrementing their usage count after their replacement by the
task being accessed.

What happens is that get_task_cred() can race with commit_creds():

	TASK_1			TASK_2			RCU_CLEANER
	-->get_task_cred(TASK_2)
	rcu_read_lock()
	__cred = __task_cred(TASK_2)
				-->commit_creds()
				old_cred = TASK_2->real_cred
				TASK_2->real_cred = ...
				put_cred(old_cred)
				  call_rcu(old_cred)
		[__cred->usage == 0]
	get_cred(__cred)
		[__cred->usage == 1]
	rcu_read_unlock()
							-->put_cred_rcu()
							[__cred->usage == 1]
							panic()

However, since a tasks credentials are generally not changed very often, we can
reasonably make use of a loop involving reading the creds pointer and using
atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero.

If successful, we can safely return the credentials in the knowledge that, even
if the task we're accessing has released them, they haven't gone to the RCU
cleanup code.

We then change task_state() in procfs to use get_task_cred() rather than
calling get_cred() on the result of __task_cred(), as that suffers from the
same problem.


Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be
tripped when it is noticed that the usage count is not zero as it ought to be,
for example:

kernel BUG at kernel/cred.c:168!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/run
CPU 0
Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex
745
RIP: 0010:[<ffffffff81069881>]  [<ffffffff81069881>] __put_cred+0xc/0x45
RSP: 0018:ffff88019e7e9eb8  EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff
RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0
RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0
R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001
FS:  00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0)
Stack:
 ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45
<0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000
<0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246
Call Trace:
 [<ffffffff810698cd>] put_cred+0x13/0x15
 [<ffffffff81069b45>] commit_creds+0x16b/0x175
 [<ffffffff8106aace>] set_current_groups+0x47/0x4e
 [<ffffffff8106ac89>] sys_setgroups+0xf6/0x105
 [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00
48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b
04 25 00 cc 00 00 48 3b b8 58 04 00 00 75
RIP  [<ffffffff81069881>] __put_cred+0xc/0x45
 RSP <ffff88019e7e9eb8>
---[ end trace df391256a100ebdd ]---

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
---

 fs/proc/array.c      |    2 +-
 include/linux/cred.h |   21 +--------------------
 kernel/cred.c        |   25 +++++++++++++++++++++++++
 3 files changed, 27 insertions(+), 21 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 9b58d38..fff6572 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -176,7 +176,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 		if (tracer)
 			tpid = task_pid_nr_ns(tracer, ns);
 	}
-	cred = get_cred((struct cred *) __task_cred(p));
+	cred = get_task_cred(p);
 	seq_printf(m,
 		"State:\t%s\n"
 		"Tgid:\t%d\n"
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 75c0fa8..ce40cbc 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -153,6 +153,7 @@ struct cred {
 extern void __put_cred(struct cred *);
 extern void exit_creds(struct task_struct *);
 extern int copy_creds(struct task_struct *, unsigned long);
+extern const struct cred *get_task_cred(struct task_struct *);
 extern struct cred *cred_alloc_blank(void);
 extern struct cred *prepare_creds(void);
 extern struct cred *prepare_exec_creds(void);
@@ -282,26 +283,6 @@ static inline void put_cred(const struct cred *_cred)
 	((const struct cred *)(rcu_dereference_check((task)->real_cred, rcu_read_lock_held() || lockdep_tasklist_lock_is_held())))
 
 /**
- * get_task_cred - Get another task's objective credentials
- * @task: The task to query
- *
- * Get the objective credentials of a task, pinning them so that they can't go
- * away.  Accessing a task's credentials directly is not permitted.
- *
- * The caller must make sure task doesn't go away, either by holding a ref on
- * task or by holding tasklist_lock to prevent it from being unlinked.
- */
-#define get_task_cred(task)				\
-({							\
-	struct cred *__cred;				\
-	rcu_read_lock();				\
-	__cred = (struct cred *) __task_cred((task));	\
-	get_cred(__cred);				\
-	rcu_read_unlock();				\
-	__cred;						\
-})
-
-/**
  * get_current_cred - Get the current task's subjective credentials
  *
  * Get the subjective credentials of the current task, pinning them so that
diff --git a/kernel/cred.c b/kernel/cred.c
index a2d5504..60bc8b1 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -209,6 +209,31 @@ void exit_creds(struct task_struct *tsk)
 	}
 }
 
+/**
+ * get_task_cred - Get another task's objective credentials
+ * @task: The task to query
+ *
+ * Get the objective credentials of a task, pinning them so that they can't go
+ * away.  Accessing a task's credentials directly is not permitted.
+ *
+ * The caller must also make sure task doesn't get deleted, either by holding a
+ * ref on task or by holding tasklist_lock to prevent it from being unlinked.
+ */
+const struct cred *get_task_cred(struct task_struct *task)
+{
+	const struct cred *cred;
+
+	rcu_read_lock();
+
+	do {
+		cred = __task_cred((task));
+		BUG_ON(!cred);
+	} while (!atomic_inc_not_zero(&((struct cred *)cred)->usage));
+
+	rcu_read_unlock();
+	return cred;
+}
+
 /*
  * Allocate blank credentials, such that the credentials can be filled in at a
  * later date without risk of ENOMEM.


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner comment
  2010-07-29 11:45 [PATCH 1/2] CRED: Fix get_task_cred() and task_state() to not resurrect dead credentials David Howells
@ 2010-07-29 11:45 ` David Howells
  2010-08-02 20:40   ` Paul E. McKenney
  0 siblings, 1 reply; 19+ messages in thread
From: David Howells @ 2010-07-29 11:45 UTC (permalink / raw)
  To: torvalds, akpm
  Cc: linux-kernel, linux-security-module, David Howells, Jiri Olsa,
	Paul E. McKenney

Fix __task_cred()'s lockdep check by removing the following validation
condition:

	lockdep_tasklist_lock_is_held()

as commit_creds() does not take the tasklist_lock, and nor do most of the
functions that call it, so this check is pointless and it can prevent
detection of the RCU lock not being held if the tasklist_lock is held.

Instead, add the following validation condition:

	task->exit_state >= 0

to permit the access if the target task is dead and therefore unable to change
its own credentials.


Fix __task_cred()'s comment to:

 (1) discard the bit that says that the caller must prevent the target task
     from being deleted.  That shouldn't need saying.

 (2) Add a comment indicating the result of __task_cred() should not be passed
     directly to get_cred(), but rather than get_task_cred() should be used
     instead.

Also put a note into the documentation to enforce this point there too.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 Documentation/credentials.txt |    3 +++
 include/linux/cred.h          |   15 ++++++++++-----
 include/linux/sched.h         |    1 +
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/Documentation/credentials.txt b/Documentation/credentials.txt
index a2db352..995baf3 100644
--- a/Documentation/credentials.txt
+++ b/Documentation/credentials.txt
@@ -417,6 +417,9 @@ reference on them using:
 This does all the RCU magic inside of it.  The caller must call put_cred() on
 the credentials so obtained when they're finished with.
 
+ [*] Note: The result of __task_cred() should not be passed directly to
+     get_cred() as this may race with commit_cred().
+
 There are a couple of convenience functions to access bits of another task's
 credentials, hiding the RCU magic from the caller:
 
diff --git a/include/linux/cred.h b/include/linux/cred.h
index ce40cbc..4d2c395 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -274,13 +274,18 @@ static inline void put_cred(const struct cred *_cred)
  * @task: The task to query
  *
  * Access the objective credentials of a task.  The caller must hold the RCU
- * readlock.
+ * readlock or the task must be dead and unable to change its own credentials.
  *
- * The caller must make sure task doesn't go away, either by holding a ref on
- * task or by holding tasklist_lock to prevent it from being unlinked.
+ * The result of this function should not be passed directly to get_cred();
+ * rather get_task_cred() should be used instead.
  */
-#define __task_cred(task) \
-	((const struct cred *)(rcu_dereference_check((task)->real_cred, rcu_read_lock_held() || lockdep_tasklist_lock_is_held())))
+#define __task_cred(task)						\
+	({								\
+		const struct task_struct *__t = (task);			\
+		rcu_dereference_check(__t->real_cred,			\
+				      rcu_read_lock_held() ||		\
+				      task_is_dead(__t));		\
+	})
 
 /**
  * get_current_cred - Get the current task's subjective credentials
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 747fcae..0478888 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -214,6 +214,7 @@ extern char ___assert_task_state[1 - 2*!!(
 
 #define task_is_traced(task)	((task->state & __TASK_TRACED) != 0)
 #define task_is_stopped(task)	((task->state & __TASK_STOPPED) != 0)
+#define task_is_dead(task)	((task)->exit_state != 0)
 #define task_is_stopped_or_traced(task)	\
 			((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0)
 #define task_contributes_to_load(task)	\


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner comment
  2010-07-29 11:45 ` [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner comment David Howells
@ 2010-08-02 20:40   ` Paul E. McKenney
  2010-08-03  0:55     ` Tetsuo Handa
  2010-08-03  9:34     ` David Howells
  0 siblings, 2 replies; 19+ messages in thread
From: Paul E. McKenney @ 2010-08-02 20:40 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, akpm, linux-kernel, linux-security-module, Jiri Olsa

On Thu, Jul 29, 2010 at 12:45:55PM +0100, David Howells wrote:
> Fix __task_cred()'s lockdep check by removing the following validation
> condition:
> 
> 	lockdep_tasklist_lock_is_held()
> 
> as commit_creds() does not take the tasklist_lock, and nor do most of the
> functions that call it, so this check is pointless and it can prevent
> detection of the RCU lock not being held if the tasklist_lock is held.
> 
> Instead, add the following validation condition:
> 
> 	task->exit_state >= 0
> 
> to permit the access if the target task is dead and therefore unable to change
> its own credentials.
> 
> 
> Fix __task_cred()'s comment to:
> 
>  (1) discard the bit that says that the caller must prevent the target task
>      from being deleted.  That shouldn't need saying.
> 
>  (2) Add a comment indicating the result of __task_cred() should not be passed
>      directly to get_cred(), but rather than get_task_cred() should be used
>      instead.
> 
> Also put a note into the documentation to enforce this point there too.

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: David Howells <dhowells@redhat.com>
> Acked-by: Jiri Olsa <jolsa@redhat.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
> 
>  Documentation/credentials.txt |    3 +++
>  include/linux/cred.h          |   15 ++++++++++-----
>  include/linux/sched.h         |    1 +
>  3 files changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/credentials.txt b/Documentation/credentials.txt
> index a2db352..995baf3 100644
> --- a/Documentation/credentials.txt
> +++ b/Documentation/credentials.txt
> @@ -417,6 +417,9 @@ reference on them using:
>  This does all the RCU magic inside of it.  The caller must call put_cred() on
>  the credentials so obtained when they're finished with.
> 
> + [*] Note: The result of __task_cred() should not be passed directly to
> +     get_cred() as this may race with commit_cred().
> +
>  There are a couple of convenience functions to access bits of another task's
>  credentials, hiding the RCU magic from the caller:
> 
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index ce40cbc..4d2c395 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -274,13 +274,18 @@ static inline void put_cred(const struct cred *_cred)
>   * @task: The task to query
>   *
>   * Access the objective credentials of a task.  The caller must hold the RCU
> - * readlock.
> + * readlock or the task must be dead and unable to change its own credentials.
>   *
> - * The caller must make sure task doesn't go away, either by holding a ref on
> - * task or by holding tasklist_lock to prevent it from being unlinked.
> + * The result of this function should not be passed directly to get_cred();
> + * rather get_task_cred() should be used instead.
>   */
> -#define __task_cred(task) \
> -	((const struct cred *)(rcu_dereference_check((task)->real_cred, rcu_read_lock_held() || lockdep_tasklist_lock_is_held())))
> +#define __task_cred(task)						\
> +	({								\
> +		const struct task_struct *__t = (task);			\
> +		rcu_dereference_check(__t->real_cred,			\
> +				      rcu_read_lock_held() ||		\
> +				      task_is_dead(__t));		\
> +	})
> 
>  /**
>   * get_current_cred - Get the current task's subjective credentials
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 747fcae..0478888 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -214,6 +214,7 @@ extern char ___assert_task_state[1 - 2*!!(
> 
>  #define task_is_traced(task)	((task->state & __TASK_TRACED) != 0)
>  #define task_is_stopped(task)	((task->state & __TASK_STOPPED) != 0)
> +#define task_is_dead(task)	((task)->exit_state != 0)
>  #define task_is_stopped_or_traced(task)	\
>  			((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0)
>  #define task_contributes_to_load(task)	\
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner comment
  2010-08-02 20:40   ` Paul E. McKenney
@ 2010-08-03  0:55     ` Tetsuo Handa
  2010-08-03  9:34     ` David Howells
  1 sibling, 0 replies; 19+ messages in thread
From: Tetsuo Handa @ 2010-08-03  0:55 UTC (permalink / raw)
  To: paulmck
  Cc: torvalds, akpm, linux-kernel, linux-security-module, Jiri Olsa,
	David Howells

I got below warning. Is this related to this patch?

[  140.173556] ===================================================
[  140.215379] [ INFO: suspicious rcu_dereference_check() usage. ]
[  140.216461] ---------------------------------------------------
[  140.217530] kernel/signal.c:660 invoked rcu_dereference_check() without protection!
[  140.218937] 
[  140.218938] other info that might help us debug this:
[  140.218939] 
[  140.220508] 
[  140.220509] rcu_scheduler_active = 1, debug_locks = 1
[  140.221991] 1 lock held by init/1:
[  140.222668]  #0:  (tasklist_lock){.+.+..}, at: [<c104a0ac>] kill_something_info+0x7c/0x160
[  140.224709] 
[  140.224711] stack backtrace:
[  140.225661] Pid: 1, comm: init Not tainted 2.6.35 #1
[  140.226576] Call Trace:
[  140.227111]  [<c103cca8>] ? printk+0x18/0x20
[  140.227908]  [<c1069884>] lockdep_rcu_dereference+0x94/0xb0
[  140.228931]  [<c104936a>] check_kill_permission+0x15a/0x170
[  140.229932]  [<c104a0ac>] ? kill_something_info+0x7c/0x160
[  140.230921]  [<c1049cca>] group_send_sig_info+0x1a/0x50
[  140.231866]  [<c1049d36>] __kill_pgrp_info+0x36/0x60
[  140.232780]  [<c104a0d0>] kill_something_info+0xa0/0x160
[  140.233740]  [<c10831c5>] ? __call_rcu+0xa5/0x110
[  140.234596]  [<c104b7ee>] sys_kill+0x5e/0x70
[  140.235387]  [<c10d1eee>] ? mntput_no_expire+0x1e/0xa0
[  140.236329]  [<c10bbd10>] ? __fput+0x170/0x220
[  140.257756]  [<c10bbdd9>] ? fput+0x19/0x20
[  140.258529]  [<c137ad94>] ? restore_all_notrace+0x0/0x18
[  140.259496]  [<c11bfb04>] ? trace_hardirqs_on_thunk+0xc/0x10
[  140.260531]  [<c137ad61>] syscall_call+0x7/0xb
[  144.627841] nfsd: last server has exited, flushing export cache
[  154.040420] Restarting system.
[  154.041061] machine restart

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner comment
  2010-08-02 20:40   ` Paul E. McKenney
  2010-08-03  0:55     ` Tetsuo Handa
@ 2010-08-03  9:34     ` David Howells
  2010-08-03 16:07       ` Linus Torvalds
  2010-08-04  0:38       ` Tetsuo Handa
  1 sibling, 2 replies; 19+ messages in thread
From: David Howells @ 2010-08-03  9:34 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: dhowells, paulmck, torvalds, akpm, linux-kernel,
	linux-security-module, Jiri Olsa

Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:

> I got below warning. Is this related to this patch?
> 
> [  140.173556] ===================================================
> [  140.215379] [ INFO: suspicious rcu_dereference_check() usage. ]
> [  140.216461] ---------------------------------------------------
> [  140.217530] kernel/signal.c:660 invoked rcu_dereference_check() without protection!

Yes.  The patch has uncovered a case of where we should be holding a lock, but
aren't.

Can you try the attached patch?

David
---
From: David Howells <dhowells@redhat.com>
Subject: [PATCH] CRED: Fix RCU warning due to previous patch fixing __task_cred()'s checks

A previous patch:

	commit 8f92054e7ca1d3a3ae50fb42d2253ac8730d9b2a
	Author: David Howells <dhowells@redhat.com>
	Date:   Thu Jul 29 12:45:55 2010 +0100
	Subject: CRED: Fix __task_cred()'s lockdep check and banner comment

fixed the lockdep checks on __task_cred().  This has shown up a place in the
signalling code where a lock should be held - namely that
check_kill_permission() requires its callers to hold the RCU lock.

It's may be that it would be better to add RCU read lock calls in
group_send_sig_info() only, around the call to check_kill_permission().  On the
other hand, some of the callers are either holding the RCU read lock already,
or have disabled interrupts, in which case, it's just extra overhead to do it
in g_s_s_i().

Without this patch, the following warning can occur:

[  140.173556] ===================================================
[  140.215379] [ INFO: suspicious rcu_dereference_check() usage. ]
[  140.216461] ---------------------------------------------------
[  140.217530] kernel/signal.c:660 invoked rcu_dereference_check() without protection!
[  140.218937] 
[  140.218938] other info that might help us debug this:
[  140.218939] 
[  140.220508] 
[  140.220509] rcu_scheduler_active = 1, debug_locks = 1
[  140.221991] 1 lock held by init/1:
[  140.222668]  #0:  (tasklist_lock){.+.+..}, at: [<c104a0ac>] kill_something_info+0x7c/0x160
[  140.224709] 
[  140.224711] stack backtrace:
[  140.225661] Pid: 1, comm: init Not tainted 2.6.35 #1
[  140.226576] Call Trace:
[  140.227111]  [<c103cca8>] ? printk+0x18/0x20
[  140.227908]  [<c1069884>] lockdep_rcu_dereference+0x94/0xb0
[  140.228931]  [<c104936a>] check_kill_permission+0x15a/0x170
[  140.229932]  [<c104a0ac>] ? kill_something_info+0x7c/0x160
[  140.230921]  [<c1049cca>] group_send_sig_info+0x1a/0x50
[  140.231866]  [<c1049d36>] __kill_pgrp_info+0x36/0x60
[  140.232780]  [<c104a0d0>] kill_something_info+0xa0/0x160
[  140.233740]  [<c10831c5>] ? __call_rcu+0xa5/0x110
[  140.234596]  [<c104b7ee>] sys_kill+0x5e/0x70
[  140.235387]  [<c10d1eee>] ? mntput_no_expire+0x1e/0xa0
[  140.236329]  [<c10bbd10>] ? __fput+0x170/0x220
[  140.257756]  [<c10bbdd9>] ? fput+0x19/0x20
[  140.258529]  [<c137ad94>] ? restore_all_notrace+0x0/0x18
[  140.259496]  [<c11bfb04>] ? trace_hardirqs_on_thunk+0xc/0x10
[  140.260531]  [<c137ad61>] syscall_call+0x7/0xb
[  144.627841] nfsd: last server has exited, flushing export cache
[  154.040420] Restarting system.
[  154.041061] machine restart

Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: David Howells <dhowells@redhat.com>
---

 kernel/exit.c   |    2 ++
 kernel/signal.c |    8 ++++++--
 2 files changed, 8 insertions(+), 2 deletions(-)


diff --git a/kernel/exit.c b/kernel/exit.c
index ceffc67..7858ebf 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -773,6 +773,7 @@ static void forget_original_parent(struct task_struct *father)
 
 	exit_ptrace(father);
 
+	rcu_read_lock();
 	write_lock_irq(&tasklist_lock);
 	reaper = find_new_reaper(father);
 
@@ -791,6 +792,7 @@ static void forget_original_parent(struct task_struct *father)
 		reparent_leader(father, p, &dead_children);
 	}
 	write_unlock_irq(&tasklist_lock);
+	rcu_read_unlock();
 
 	BUG_ON(!list_empty(&father->children));
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 906ae5a..f771156 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -637,7 +637,7 @@ static inline bool si_fromuser(const struct siginfo *info)
 
 /*
  * Bad permissions for sending the signal
- * - the caller must hold at least the RCU read lock
+ * - the caller must hold the RCU read lock
  */
 static int check_kill_permission(int sig, struct siginfo *info,
 				 struct task_struct *t)
@@ -1127,7 +1127,7 @@ struct sighand_struct *lock_task_sighand(struct task_struct *tsk, unsigned long
 
 /*
  * send signal info to all the members of a group
- * - the caller must hold the RCU read lock at least
+ * - the caller must hold the RCU read lock
  */
 int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
 {
@@ -1151,11 +1151,13 @@ int __kill_pgrp_info(int sig, struct siginfo *info, struct pid *pgrp)
 
 	success = 0;
 	retval = -ESRCH;
+	rcu_read_lock();
 	do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
 		int err = group_send_sig_info(sig, info, p);
 		success |= !err;
 		retval = err;
 	} while_each_pid_task(pgrp, PIDTYPE_PGID, p);
+	rcu_read_unlock();
 	return success ? 0 : retval;
 }
 
@@ -1261,6 +1263,7 @@ static int kill_something_info(int sig, struct siginfo *info, pid_t pid)
 		int retval = 0, count = 0;
 		struct task_struct * p;
 
+		rcu_read_lock();
 		for_each_process(p) {
 			if (task_pid_vnr(p) > 1 &&
 					!same_thread_group(p, current)) {
@@ -1270,6 +1273,7 @@ static int kill_something_info(int sig, struct siginfo *info, pid_t pid)
 					retval = err;
 			}
 		}
+		rcu_read_unlock();
 		ret = count ? retval : -ESRCH;
 	}
 	read_unlock(&tasklist_lock);


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner  comment
  2010-08-03  9:34     ` David Howells
@ 2010-08-03 16:07       ` Linus Torvalds
  2010-08-03 17:48         ` Thomas Gleixner
                           ` (3 more replies)
  2010-08-04  0:38       ` Tetsuo Handa
  1 sibling, 4 replies; 19+ messages in thread
From: Linus Torvalds @ 2010-08-03 16:07 UTC (permalink / raw)
  To: David Howells, Oleg Nesterov, Thomas Gleixner
  Cc: Tetsuo Handa, paulmck, akpm, linux-kernel, linux-security-module,
	Jiri Olsa

Added Oleg and Thomas to the participants.

Oleg/Thomas: the whole thread is on lkml, but I'm quoting most of the
relevant parts..

On Tue, Aug 3, 2010 at 2:34 AM, David Howells <dhowells@redhat.com> wrote:
>
> A previous patch:
>
>        commit 8f92054e7ca1d3a3ae50fb42d2253ac8730d9b2a
>        Author: David Howells <dhowells@redhat.com>
>        Date:   Thu Jul 29 12:45:55 2010 +0100
>        Subject: CRED: Fix __task_cred()'s lockdep check and banner comment
>
> fixed the lockdep checks on __task_cred().  This has shown up a place in the
> signalling code where a lock should be held - namely that
> check_kill_permission() requires its callers to hold the RCU lock.

It's not just check_kill_permission(), is it? I thought we could do
the "for_each_process()" loops with just RCU, rather than holding the
whole tasklist_lock? So I _think_ that getting the RCU read-lock would
make it possible to get rid of the tasklist_lock in there too? At
least in kill_something_info().

Yes/no? What am I missing? This is an Oleg question, mainly.

> It's may be that it would be better to add RCU read lock calls in
> group_send_sig_info() only, around the call to check_kill_permission().  On the
> other hand, some of the callers are either holding the RCU read lock already,
> or have disabled interrupts, in which case, it's just extra overhead to do it
> in g_s_s_i().
>
> Without this patch, the following warning can occur:
>
> [  140.173556] ===================================================
> [  140.215379] [ INFO: suspicious rcu_dereference_check() usage. ]
> [  140.216461] ---------------------------------------------------
> [  140.217530] kernel/signal.c:660 invoked rcu_dereference_check() without protection!
> [  140.218937]
> [  140.218938] other info that might help us debug this:
> [  140.218939]
> [  140.220508]
> [  140.220509] rcu_scheduler_active = 1, debug_locks = 1
> [  140.221991] 1 lock held by init/1:
> [  140.222668]  #0:  (tasklist_lock){.+.+..}, at: [<c104a0ac>] kill_something_info+0x7c/0x160
> [  140.224709]
> [  140.224711] stack backtrace:
> [  140.225661] Pid: 1, comm: init Not tainted 2.6.35 #1
> [  140.226576] Call Trace:
> [  140.227111]  [<c103cca8>] ? printk+0x18/0x20
> [  140.227908]  [<c1069884>] lockdep_rcu_dereference+0x94/0xb0
> [  140.228931]  [<c104936a>] check_kill_permission+0x15a/0x170
> [  140.229932]  [<c104a0ac>] ? kill_something_info+0x7c/0x160
> [  140.230921]  [<c1049cca>] group_send_sig_info+0x1a/0x50
> [  140.231866]  [<c1049d36>] __kill_pgrp_info+0x36/0x60
> [  140.232780]  [<c104a0d0>] kill_something_info+0xa0/0x160
> [  140.233740]  [<c10831c5>] ? __call_rcu+0xa5/0x110
> [  140.234596]  [<c104b7ee>] sys_kill+0x5e/0x70
> [  140.235387]  [<c10d1eee>] ? mntput_no_expire+0x1e/0xa0
> [  140.236329]  [<c10bbd10>] ? __fput+0x170/0x220
> [  140.257756]  [<c10bbdd9>] ? fput+0x19/0x20
> [  140.258529]  [<c137ad94>] ? restore_all_notrace+0x0/0x18
> [  140.259496]  [<c11bfb04>] ? trace_hardirqs_on_thunk+0xc/0x10
> [  140.260531]  [<c137ad61>] syscall_call+0x7/0xb
> [  144.627841] nfsd: last server has exited, flushing export cache
> [  154.040420] Restarting system.
> [  154.041061] machine restart
>
> Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
>
>  kernel/exit.c   |    2 ++
>  kernel/signal.c |    8 ++++++--
>  2 files changed, 8 insertions(+), 2 deletions(-)
>
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index ceffc67..7858ebf 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -773,6 +773,7 @@ static void forget_original_parent(struct task_struct *father)
>
>        exit_ptrace(father);
>
> +       rcu_read_lock();
>        write_lock_irq(&tasklist_lock);
>        reaper = find_new_reaper(father);
>
> @@ -791,6 +792,7 @@ static void forget_original_parent(struct task_struct *father)
>                reparent_leader(father, p, &dead_children);
>        }
>        write_unlock_irq(&tasklist_lock);
> +       rcu_read_unlock();
>
>        BUG_ON(!list_empty(&father->children));
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 906ae5a..f771156 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -637,7 +637,7 @@ static inline bool si_fromuser(const struct siginfo *info)
>
>  /*
>  * Bad permissions for sending the signal
> - * - the caller must hold at least the RCU read lock
> + * - the caller must hold the RCU read lock
>  */
>  static int check_kill_permission(int sig, struct siginfo *info,
>                                 struct task_struct *t)
> @@ -1127,7 +1127,7 @@ struct sighand_struct *lock_task_sighand(struct task_struct *tsk, unsigned long
>
>  /*
>  * send signal info to all the members of a group
> - * - the caller must hold the RCU read lock at least
> + * - the caller must hold the RCU read lock
>  */
>  int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
>  {
> @@ -1151,11 +1151,13 @@ int __kill_pgrp_info(int sig, struct siginfo *info, struct pid *pgrp)
>
>        success = 0;
>        retval = -ESRCH;
> +       rcu_read_lock();
>        do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
>                int err = group_send_sig_info(sig, info, p);
>                success |= !err;
>                retval = err;
>        } while_each_pid_task(pgrp, PIDTYPE_PGID, p);
> +       rcu_read_unlock();

Ok, makes sense. At the same time, I do wonder if we shouldn't perhaps
do this in the caller. The callers would seem to all want it - look at
kill_something_info(), for example, where it really looks like it
would be nicer to put that _whole_ function under rcu_read_lock() (in
the caller itself). Or in kernel/exit.c, we have two calls
back-to-back, so doing it in the caller would clean that up and do it
just once.

Look here:

> @@ -1261,6 +1263,7 @@ static int kill_something_info(int sig, struct siginfo *info, pid_t pid)
>                int retval = 0, count = 0;
>                struct task_struct * p;
>
> +               rcu_read_lock();
>                for_each_process(p) {
>                        if (task_pid_vnr(p) > 1 &&
>                                        !same_thread_group(p, current)) {
> @@ -1270,6 +1273,7 @@ static int kill_something_info(int sig, struct siginfo *info, pid_t pid)
>                                        retval = err;
>                        }
>                }
> +               rcu_read_unlock();
>                ret = count ? retval : -ESRCH;
>        }
>        read_unlock(&tasklist_lock);

with those fragments, we now end up having rcu_read_lock() for _all_
the three cases in kill_something_info(), but rather than do it once,
we do it explicitly, and we do it in two different ways (two cases do
it explicitly, the middle one does it implicitly when it calls
__kill_pgrp_info().

So I think the patch is correct, but at the same time I do get the
feeling that we should just do it slightly differently.

Comments?

            Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner comment
  2010-08-03 16:07       ` Linus Torvalds
@ 2010-08-03 17:48         ` Thomas Gleixner
  2010-08-04 13:17         ` Oleg Nesterov
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Thomas Gleixner @ 2010-08-03 17:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Oleg Nesterov, Tetsuo Handa, paulmck, akpm,
	linux-kernel, linux-security-module, Jiri Olsa

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1296 bytes --]

On Tue, 3 Aug 2010, Linus Torvalds wrote:

> Added Oleg and Thomas to the participants.
> 
> Oleg/Thomas: the whole thread is on lkml, but I'm quoting most of the
> relevant parts..
> 
> On Tue, Aug 3, 2010 at 2:34 AM, David Howells <dhowells@redhat.com> wrote:
> >
> > A previous patch:
> >
> >        commit 8f92054e7ca1d3a3ae50fb42d2253ac8730d9b2a
> >        Author: David Howells <dhowells@redhat.com>
> >        Date:   Thu Jul 29 12:45:55 2010 +0100
> >        Subject: CRED: Fix __task_cred()'s lockdep check and banner comment
> >
> > fixed the lockdep checks on __task_cred().  This has shown up a place in the
> > signalling code where a lock should be held - namely that
> > check_kill_permission() requires its callers to hold the RCU lock.
> 
> It's not just check_kill_permission(), is it? I thought we could do
> the "for_each_process()" loops with just RCU, rather than holding the
> whole tasklist_lock? So I _think_ that getting the RCU read-lock would
> make it possible to get rid of the tasklist_lock in there too? At
> least in kill_something_info().
> 
> Yes/no? What am I missing? This is an Oleg question, mainly.

Yes, almost all places in the kernel which hold tasklist_lock read
locked are safe with RCU. I started to work on that, but got
distracted.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner comment
  2010-08-03  9:34     ` David Howells
  2010-08-03 16:07       ` Linus Torvalds
@ 2010-08-04  0:38       ` Tetsuo Handa
  1 sibling, 0 replies; 19+ messages in thread
From: Tetsuo Handa @ 2010-08-04  0:38 UTC (permalink / raw)
  To: dhowells
  Cc: paulmck, torvalds, akpm, linux-kernel, linux-security-module, jolsa

David Howells wrote:
> Tetsuo Handa wrote:
> 
> > I got below warning. Is this related to this patch?
> > 
> > [  140.173556] ===================================================
> > [  140.215379] [ INFO: suspicious rcu_dereference_check() usage. ]
> > [  140.216461] ---------------------------------------------------
> > [  140.217530] kernel/signal.c:660 invoked rcu_dereference_check() without protection!
> 
> Yes.  The patch has uncovered a case of where we should be holding a lock, but
> aren't.
> 
> Can you try the attached patch?
> 
> David
> ---
> From: David Howells <dhowells@redhat.com>
> Subject: [PATCH] CRED: Fix RCU warning due to previous patch fixing __task_cred()'s checks
> 
> A previous patch:
> 
> 	commit 8f92054e7ca1d3a3ae50fb42d2253ac8730d9b2a
> 	Author: David Howells <dhowells@redhat.com>
> 	Date:   Thu Jul 29 12:45:55 2010 +0100
> 	Subject: CRED: Fix __task_cred()'s lockdep check and banner comment
> 
> fixed the lockdep checks on __task_cred().  This has shown up a place in the
> signalling code where a lock should be held - namely that
> check_kill_permission() requires its callers to hold the RCU lock.
> 
> It's may be that it would be better to add RCU read lock calls in
> group_send_sig_info() only, around the call to check_kill_permission().  On the
> other hand, some of the callers are either holding the RCU read lock already,
> or have disabled interrupts, in which case, it's just extra overhead to do it
> in g_s_s_i().

That patch solved the warning. Thank you.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner comment
  2010-08-03 16:07       ` Linus Torvalds
  2010-08-03 17:48         ` Thomas Gleixner
@ 2010-08-04 13:17         ` Oleg Nesterov
  2010-08-04 14:01         ` David Howells
  2010-08-05  7:19         ` Eric W. Biederman
  3 siblings, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2010-08-04 13:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Thomas Gleixner, Tetsuo Handa, paulmck, akpm,
	linux-kernel, linux-security-module, Jiri Olsa, Roland McGrath

On 08/03, Linus Torvalds wrote:
>
> On Tue, Aug 3, 2010 at 2:34 AM, David Howells <dhowells@redhat.com> wrote:
> >
> > A previous patch:
> >
> >        commit 8f92054e7ca1d3a3ae50fb42d2253ac8730d9b2a
> >        Author: David Howells <dhowells@redhat.com>
> >        Date:   Thu Jul 29 12:45:55 2010 +0100
> >        Subject: CRED: Fix __task_cred()'s lockdep check and banner comment

I am not sure I understand this patch.

__task_cred() checks rcu_read_lock_held() || task_is_dead(), and
task_is_dead(task) is ((task)->exit_state != 0).

OK, task_is_dead() is valid for, say, wait_task_zombie(). But
wait_task_stopped() calls __task_cred(p) without rcu lock and p is alive.
The code is correct, this thread can do nothing until we drop ->siglock.

> > fixed the lockdep checks on __task_cred().  This has shown up a place in the
> > signalling code where a lock should be held - namely that
> > check_kill_permission() requires its callers to hold the RCU lock.
>
> It's not just check_kill_permission(), is it? I thought we could do
> the "for_each_process()" loops with just RCU, rather than holding the
> whole tasklist_lock?

Yes, for_each_process() is rcu-safe by itself.

> So I _think_ that getting the RCU read-lock would
> make it possible to get rid of the tasklist_lock in there too? At
> least in kill_something_info().

As for kill_something_info(), I think yes. I even sent (iirc) the
protoptype patch a long ago. We can't just remove tasklist, we should
avoid the races fork/exit/exec in the kill(-1, SIG) case.

The same for kill_pgrp/__kill_pgrp_info(). We need tasklist to ensure
that nobody in this group can escape the signal. This seems solveable
too, it was even discussed a bit.

> > It's may be that it would be better to add RCU read lock calls in
> > group_send_sig_info() only, around the call to check_kill_permission().

I must admit, at first glance changing check_kill_permission() to take
rcu lock looks better to me.

> On the
> > other hand, some of the callers are either holding the RCU read lock already,
> > or have disabled interrupts,

Hmm. So, local_irq_disable() "officially" blocks rcu? It does in practice
(unless I missed the new version of RCU), but, say,  posix_timer_event()
takes rcu_read_lock() exactly because I thought we shouldn't assume that
irqs_disabled() acts as rcu_read_lock() ?

There are other examples of rcu_read_lock() under local_irq_disable().

> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -773,6 +773,7 @@ static void forget_original_parent(struct task_struct *father)
> >
> >        exit_ptrace(father);
> >
> > +       rcu_read_lock();
> >        write_lock_irq(&tasklist_lock);
> >        reaper = find_new_reaper(father);

No, this doesn't look right. find_new_reaper() can drop tasklist and sleep.

Besides, this patch conflicts with the change in -mm tree. And imho this
looks a bit as "action at a distance".

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner comment
  2010-08-03 16:07       ` Linus Torvalds
  2010-08-03 17:48         ` Thomas Gleixner
  2010-08-04 13:17         ` Oleg Nesterov
@ 2010-08-04 14:01         ` David Howells
  2010-08-04 15:08           ` Oleg Nesterov
  2010-08-04 15:22           ` David Howells
  2010-08-05  7:19         ` Eric W. Biederman
  3 siblings, 2 replies; 19+ messages in thread
From: David Howells @ 2010-08-04 14:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: dhowells, Linus Torvalds, Thomas Gleixner, Tetsuo Handa, paulmck,
	akpm, linux-kernel, linux-security-module, Jiri Olsa,
	Roland McGrath

Oleg Nesterov <oleg@redhat.com> wrote:

> On 08/03, Linus Torvalds wrote:
> >
> > On Tue, Aug 3, 2010 at 2:34 AM, David Howells <dhowells@redhat.com> wrote:
> > >
> > > A previous patch:
> > >
> > >        commit 8f92054e7ca1d3a3ae50fb42d2253ac8730d9b2a
> > >        Author: David Howells <dhowells@redhat.com>
> > >        Date:   Thu Jul 29 12:45:55 2010 +0100
> > >        Subject: CRED: Fix __task_cred()'s lockdep check and banner comment
> 
> I am not sure I understand this patch.

You are talking about the 'previous patch'?

> __task_cred() checks rcu_read_lock_held() || task_is_dead(), and
> task_is_dead(task) is ((task)->exit_state != 0).
> 
> OK, task_is_dead() is valid for, say, wait_task_zombie(). But
> wait_task_stopped() calls __task_cred(p) without rcu lock and p is alive.
> The code is correct, this thread can do nothing until we drop ->siglock.

The problem is that we have to tell lockdep this.  Just checking in
__task_cred() that siglock is held is insufficient.  That doesn't handle, say,
sys_setuid() from changing the credentials, and effectively skips the check in
places where it mustn't.

Similarly, having interrupts disabled on the CPU we're running on doesn't help
either, since it doesn't stop another CPU replacing those credentials.

There are ways of dealing with wait_task_stopped():

 (1) Place an rcu_read_lock()'d section around the call to __task_cred().

 (2) Make __task_cred()'s lockdep understand about the target task being
     stopped whilst we hold its siglock.

 (3) Don't use __task_cred(), but rather dereference the pointer directly:

	rcu_dereference_protected(p->real_cred,
				  lock_is_held(&p->sighand->siglock))

     (Possibly wrapped in a macro in linux/cred.h).

> > > It's may be that it would be better to add RCU read lock calls in
> > > group_send_sig_info() only, around the call to check_kill_permission().
> 
> I must admit, at first glance changing check_kill_permission() to take
> rcu lock looks better to me.

I think group_send_sig_info() would be better.  The only other caller of
c_k_p() already has to hold the RCU read lock for other reasons.

How about the attached patch then?

> > > On the other hand, some of the callers are either holding the RCU read
> > > lock already, or have disabled interrupts,
> 
> Hmm. So, local_irq_disable() "officially" blocks rcu? It does in practice
> (unless I missed the new version of RCU), but, say,  posix_timer_event()
> takes rcu_read_lock() exactly because I thought we shouldn't assume that
> irqs_disabled() acts as rcu_read_lock() ?

This CPU can't be preempted if it can't be interrupted, I think.

David
---
From: David Howells <dhowells@redhat.com>
Subject: [PATCH] CRED: Fix RCU warning due to previous patch fixing __task_cred()'s checks

A previous patch:

	commit 8f92054e7ca1d3a3ae50fb42d2253ac8730d9b2a
	Author: David Howells <dhowells@redhat.com>
	Date:   Thu Jul 29 12:45:55 2010 +0100
	Subject: CRED: Fix __task_cred()'s lockdep check and banner comment

fixed the lockdep checks on __task_cred().  This has shown up a place in the
signalling code where a lock should be held - namely that
check_kill_permission() requires its callers to hold the RCU lock.

Fix group_send_sig_info() to get the RCU read lock around its call to
check_kill_permission().

Without this patch, the following warning can occur:

[  140.173556] ===================================================
[  140.215379] [ INFO: suspicious rcu_dereference_check() usage. ]
[  140.216461] ---------------------------------------------------
[  140.217530] kernel/signal.c:660 invoked rcu_dereference_check() without protection!
[  140.218937]
[  140.218938] other info that might help us debug this:
[  140.218939]
[  140.220508]
[  140.220509] rcu_scheduler_active = 1, debug_locks = 1
[  140.221991] 1 lock held by init/1:
[  140.222668]  #0:  (tasklist_lock){.+.+..}, at: [<c104a0ac>] kill_something_info+0x7c/0x160
[  140.224709]
[  140.224711] stack backtrace:
[  140.225661] Pid: 1, comm: init Not tainted 2.6.35 #1
[  140.226576] Call Trace:
[  140.227111]  [<c103cca8>] ? printk+0x18/0x20
[  140.227908]  [<c1069884>] lockdep_rcu_dereference+0x94/0xb0
[  140.228931]  [<c104936a>] check_kill_permission+0x15a/0x170
[  140.229932]  [<c104a0ac>] ? kill_something_info+0x7c/0x160
[  140.230921]  [<c1049cca>] group_send_sig_info+0x1a/0x50
[  140.231866]  [<c1049d36>] __kill_pgrp_info+0x36/0x60
[  140.232780]  [<c104a0d0>] kill_something_info+0xa0/0x160
[  140.233740]  [<c10831c5>] ? __call_rcu+0xa5/0x110
[  140.234596]  [<c104b7ee>] sys_kill+0x5e/0x70
[  140.235387]  [<c10d1eee>] ? mntput_no_expire+0x1e/0xa0
[  140.236329]  [<c10bbd10>] ? __fput+0x170/0x220
[  140.257756]  [<c10bbdd9>] ? fput+0x19/0x20
[  140.258529]  [<c137ad94>] ? restore_all_notrace+0x0/0x18
[  140.259496]  [<c11bfb04>] ? trace_hardirqs_on_thunk+0xc/0x10
[  140.260531]  [<c137ad61>] syscall_call+0x7/0xb
[  144.627841] nfsd: last server has exited, flushing export cache
[  154.040420] Restarting system.
[  154.041061] machine restart

Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: David Howells <dhowells@redhat.com>
---

 kernel/signal.c |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)


diff --git a/kernel/signal.c b/kernel/signal.c
index 906ae5a..bded651 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -637,7 +637,7 @@ static inline bool si_fromuser(const struct siginfo *info)
 
 /*
  * Bad permissions for sending the signal
- * - the caller must hold at least the RCU read lock
+ * - the caller must hold the RCU read lock
  */
 static int check_kill_permission(int sig, struct siginfo *info,
 				 struct task_struct *t)
@@ -1127,11 +1127,14 @@ struct sighand_struct *lock_task_sighand(struct task_struct *tsk, unsigned long
 
 /*
  * send signal info to all the members of a group
- * - the caller must hold the RCU read lock at least
  */
 int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
 {
-	int ret = check_kill_permission(sig, info, p);
+	int ret;
+
+	rcu_read_lock();
+	ret = check_kill_permission(sig, info, p);
+	rcu_read_unlock();
 
 	if (!ret && sig)
 		ret = do_send_sig_info(sig, info, p, true);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner comment
  2010-08-04 14:01         ` David Howells
@ 2010-08-04 15:08           ` Oleg Nesterov
  2010-08-04 15:22           ` David Howells
  1 sibling, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2010-08-04 15:08 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Thomas Gleixner, Tetsuo Handa, paulmck, akpm,
	linux-kernel, linux-security-module, Jiri Olsa, Roland McGrath

On 08/04, David Howells wrote:
>
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > On 08/03, Linus Torvalds wrote:
> > >
> > > On Tue, Aug 3, 2010 at 2:34 AM, David Howells <dhowells@redhat.com> wrote:
> > > >
> > > > A previous patch:
> > > >
> > > >        commit 8f92054e7ca1d3a3ae50fb42d2253ac8730d9b2a
> > > >        Author: David Howells <dhowells@redhat.com>
> > > >        Date:   Thu Jul 29 12:45:55 2010 +0100
> > > >        Subject: CRED: Fix __task_cred()'s lockdep check and banner comment
> >
> > I am not sure I understand this patch.
>
> You are talking about the 'previous patch'?
>
> > __task_cred() checks rcu_read_lock_held() || task_is_dead(), and
> > task_is_dead(task) is ((task)->exit_state != 0).
> >
> > OK, task_is_dead() is valid for, say, wait_task_zombie(). But
> > wait_task_stopped() calls __task_cred(p) without rcu lock and p is alive.
> > The code is correct, this thread can do nothing until we drop ->siglock.
>
> The problem is that we have to tell lockdep this.  Just checking in
> __task_cred() that siglock is held is insufficient.  That doesn't handle, say,
> sys_setuid() from changing the credentials, and effectively skips the check in
> places where it mustn't.
>
> Similarly, having interrupts disabled on the CPU we're running on doesn't help
> either, since it doesn't stop another CPU replacing those credentials.
>
> There are ways of dealing with wait_task_stopped():
>
>  (1) Place an rcu_read_lock()'d section around the call to __task_cred().

Sure, this solves the problem. But probably this needs a comment to
explain why do we take rcu lock.

OTOH, wait_task_continued() does need rcu_read_lock(), the task is running.

UNLESS we believe that local_irq_disable() makes rcu_read_lock() unnecessary,
see below.

>  (2) Make __task_cred()'s lockdep understand about the target task being
>      stopped whilst we hold its siglock.

May be... but we have so many special cases. Say, fill_psinfo()->__task_cred().
This is called under rcu lock, but it is not needed. The task is either
current or it sleeps in exit_mm().

I mean, perhaps it is better to either always require rcu_read_lock()
around __task_cred() even if it is not needed, or do not use
rcu_dereference_check() at all.

In any case, task_is_dead() doesn't help afaics, it is only useful for
wait_task_zombie().

> > I must admit, at first glance changing check_kill_permission() to take
> > rcu lock looks better to me.
>
> I think group_send_sig_info() would be better.  The only other caller of
> c_k_p() already has to hold the RCU read lock for other reasons.
>
> How about the attached patch then?

Agreed, the patch looks fine to me.

> > > > On the other hand, some of the callers are either holding the RCU read
> > > > lock already, or have disabled interrupts,
> >
> > Hmm. So, local_irq_disable() "officially" blocks rcu? It does in practice
> > (unless I missed the new version of RCU), but, say,  posix_timer_event()
> > takes rcu_read_lock() exactly because I thought we shouldn't assume that
> > irqs_disabled() acts as rcu_read_lock() ?
>
> This CPU can't be preempted if it can't be interrupted, I think.

Yes, please note "It does in practice" above.

My question is, should/can we rely on this fact? Or should we assume
that nothing except rcu_read_lock() implies rcu_read_lock() ?

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner comment
  2010-08-04 14:01         ` David Howells
  2010-08-04 15:08           ` Oleg Nesterov
@ 2010-08-04 15:22           ` David Howells
  2010-08-04 15:44             ` Oleg Nesterov
  1 sibling, 1 reply; 19+ messages in thread
From: David Howells @ 2010-08-04 15:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: dhowells, Linus Torvalds, Thomas Gleixner, Tetsuo Handa, paulmck,
	akpm, linux-kernel, linux-security-module, Jiri Olsa,
	Roland McGrath

Oleg Nesterov <oleg@redhat.com> wrote:

> > > I must admit, at first glance changing check_kill_permission() to take
> > > rcu lock looks better to me.
> >
> > I think group_send_sig_info() would be better.  The only other caller of
> > c_k_p() already has to hold the RCU read lock for other reasons.
> >
> > How about the attached patch then?
> 
> Agreed, the patch looks fine to me.

Can I take that as an Acked-by?

David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner comment
  2010-08-04 15:22           ` David Howells
@ 2010-08-04 15:44             ` Oleg Nesterov
  0 siblings, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2010-08-04 15:44 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Thomas Gleixner, Tetsuo Handa, paulmck, akpm,
	linux-kernel, linux-security-module, Jiri Olsa, Roland McGrath

On 08/04, David Howells wrote:
>
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > > > I must admit, at first glance changing check_kill_permission() to take
> > > > rcu lock looks better to me.
> > >
> > > I think group_send_sig_info() would be better.  The only other caller of
> > > c_k_p() already has to hold the RCU read lock for other reasons.
> > >
> > > How about the attached patch then?
> >
> > Agreed, the patch looks fine to me.
>
> Can I take that as an Acked-by?

Yes, thanks, please feel free to add

Acked-by: Oleg Nesterov <oleg@redhat.com>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner  comment
  2010-08-03 16:07       ` Linus Torvalds
                           ` (2 preceding siblings ...)
  2010-08-04 14:01         ` David Howells
@ 2010-08-05  7:19         ` Eric W. Biederman
  2010-08-05 16:14           ` Linus Torvalds
  3 siblings, 1 reply; 19+ messages in thread
From: Eric W. Biederman @ 2010-08-05  7:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Oleg Nesterov, Thomas Gleixner, Tetsuo Handa,
	paulmck, akpm, linux-kernel, linux-security-module, Jiri Olsa

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Added Oleg and Thomas to the participants.
>
> Oleg/Thomas: the whole thread is on lkml, but I'm quoting most of the
> relevant parts..
>
> On Tue, Aug 3, 2010 at 2:34 AM, David Howells <dhowells@redhat.com> wrote:
>>
>> A previous patch:
>>
>>        commit 8f92054e7ca1d3a3ae50fb42d2253ac8730d9b2a
>>        Author: David Howells <dhowells@redhat.com>
>>        Date:   Thu Jul 29 12:45:55 2010 +0100
>>        Subject: CRED: Fix __task_cred()'s lockdep check and banner comment
>>
>> fixed the lockdep checks on __task_cred().  This has shown up a place in the
>> signalling code where a lock should be held - namely that
>> check_kill_permission() requires its callers to hold the RCU lock.
>
> It's not just check_kill_permission(), is it? I thought we could do
> the "for_each_process()" loops with just RCU, rather than holding the
> whole tasklist_lock? So I _think_ that getting the RCU read-lock would
> make it possible to get rid of the tasklist_lock in there too? At
> least in kill_something_info().
>
> Yes/no? What am I missing? This is an Oleg question, mainly.

No.  When we send a signal to multiple processes it needs to be an
atomic operation so that kill -KILL -pgrp won't let processes escape.
It is what posix specifies, it is what real programs expect, and it
is the useful semantic in userspace.

Just using rcu_read_lock() breaks that atomicity of sending a signal
to a process group, which means we can have new processes that escape
the signal.  Even with the tasklist_lock we have to have a special
case in fork to ensure we don't add a process to a process group
while a signal is being delivered to it.

I have scratched my head a few times wondering if there is a way to
preserve the atomic guarantee without taking a global lock, but I
haven't found one I can wrap my head around yet.

>>        success = 0;
>>        retval = -ESRCH;
>> +       rcu_read_lock();
>>        do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
>>                int err = group_send_sig_info(sig, info, p);
>>                success |= !err;
>>                retval = err;
>>        } while_each_pid_task(pgrp, PIDTYPE_PGID, p);
>> +       rcu_read_unlock();
>
> Ok, makes sense. At the same time, I do wonder if we shouldn't perhaps
> do this in the caller. The callers would seem to all want it - look at
> kill_something_info(), for example, where it really looks like it
> would be nicer to put that _whole_ function under rcu_read_lock() (in
> the caller itself). Or in kernel/exit.c, we have two calls
> back-to-back, so doing it in the caller would clean that up and do it
> just once.

Likely.  At one point read_lock(&tasklist_lock) implied having the
rcu_read_lock().  As it no longer does in some cases we have a small
pile of places with outdated assumptions.  I know I have been guilty a
few times of forgetting about the new rule that we have to take rcu_read_lock()
everywhere.

> Look here:
>
>> @@ -1261,6 +1263,7 @@ static int kill_something_info(int sig, struct siginfo *info, pid_t pid)
>>                int retval = 0, count = 0;
>>                struct task_struct * p;
>>
>> +               rcu_read_lock();
>>                for_each_process(p) {
>>                        if (task_pid_vnr(p) > 1 &&
>>                                        !same_thread_group(p, current)) {
>> @@ -1270,6 +1273,7 @@ static int kill_something_info(int sig, struct siginfo *info, pid_t pid)
>>                                        retval = err;
>>                        }
>>                }
>> +               rcu_read_unlock();
>>                ret = count ? retval : -ESRCH;
>>        }
>>        read_unlock(&tasklist_lock);
>
> with those fragments, we now end up having rcu_read_lock() for _all_
> the three cases in kill_something_info(), but rather than do it once,
> we do it explicitly, and we do it in two different ways (two cases do
> it explicitly, the middle one does it implicitly when it calls
> __kill_pgrp_info().
>
> So I think the patch is correct, but at the same time I do get the
> feeling that we should just do it slightly differently.
>
> Comments?

With the tasklist_lock the rule in these functions is that the caller
will take the lock, so we probably make the rule the caller should
take the lock in the same scenarios for the rcu_read_lock(). Aka just
say:

read_lock(&tasklist_lock);
rcu_read_lock();

everywhere, that today we say just:

read_lock(&tasklist_lock);

It is what was meant when the code was written and the rcu-ized, and
it will certainly preserve my human intuition and general practical
reality that when you have the tasklist_lock you have the
rcu_read_lock.

*Shrug*  I don't think it matters a whole lot.

Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner  comment
  2010-08-05  7:19         ` Eric W. Biederman
@ 2010-08-05 16:14           ` Linus Torvalds
  2010-08-05 18:16             ` Oleg Nesterov
  2010-08-05 20:13             ` Eric W. Biederman
  0 siblings, 2 replies; 19+ messages in thread
From: Linus Torvalds @ 2010-08-05 16:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Howells, Oleg Nesterov, Thomas Gleixner, Tetsuo Handa,
	paulmck, akpm, linux-kernel, linux-security-module, Jiri Olsa

On Thu, Aug 5, 2010 at 12:19 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> No.  When we send a signal to multiple processes it needs to be an
> atomic operation so that kill -KILL -pgrp won't let processes escape.
> It is what posix specifies, it is what real programs expect, and it
> is the useful semantic in userspace.

Ok. However, in that case, it's not really about the whole list
traversal, it's a totally separate thing, and it's really sad that we
end up using the (rather hot) tasklist_lock for something like that.
With the dcache/inode locks basically going away, I think
tasklist_lock ends up being one of the few hot locks left.

Wouldn't it be much nicer to:
 - make it clear that all the "real" signal locking can rely on RCU
 - use a separate per-pgrp lock that ends up being the one that gives
the signal _semantic_ meaning?

That would automatically document why we get the lock too, which
certainly isn't clear from the code as-is.

The per-pgrp lock might be something as simple as a silly hash that
just spreads out the process groups over some random number of simple
spinlocks.

> With the tasklist_lock the rule in these functions is that the caller
> will take the lock, so we probably make the rule the caller should
> take the lock in the same scenarios for the rcu_read_lock(). Aka just
> say:
>
> read_lock(&tasklist_lock);
> rcu_read_lock();
>
> everywhere, that today we say just:
>
> read_lock(&tasklist_lock);

I agree that we probably should have done that originally, in order to
not have these bugs show up later. However, I don't think it makes
sense any more, especially not if tasklist_lock isn't even a "real"
lock from a kernel internal consistency standpoint, but has a totally
secondary meaning.

                       Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner comment
  2010-08-05 16:14           ` Linus Torvalds
@ 2010-08-05 18:16             ` Oleg Nesterov
  2010-08-05 20:13             ` Eric W. Biederman
  1 sibling, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2010-08-05 18:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, David Howells, Thomas Gleixner, Tetsuo Handa,
	paulmck, akpm, linux-kernel, linux-security-module, Jiri Olsa

On 08/05, Linus Torvalds wrote:
>
> On Thu, Aug 5, 2010 at 12:19 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> >
> > No.  When we send a signal to multiple processes it needs to be an
> > atomic operation so that kill -KILL -pgrp won't let processes escape.
> > It is what posix specifies, it is what real programs expect, and it
> > is the useful semantic in userspace.
>
> Ok. However, in that case, it's not really about the whole list
> traversal, it's a totally separate thing, and it's really sad that we
> end up using the (rather hot) tasklist_lock for something like that.
> With the dcache/inode locks basically going away, I think
> tasklist_lock ends up being one of the few hot locks left.
>
> Wouldn't it be much nicer to:
>  - make it clear that all the "real" signal locking can rely on RCU
>  - use a separate per-pgrp lock that ends up being the one that gives
> the signal _semantic_ meaning?
>
> That would automatically document why we get the lock too, which
> certainly isn't clear from the code as-is.
>
> The per-pgrp lock might be something as simple as a silly hash that
> just spreads out the process groups over some random number of simple
> spinlocks.

I still think we can avoid the new lock and rely on RCU in
kill_something_info() and kill_pgrp().

I am attaching my old email below. It talks about pgrp, however I
think kill_something_info() is almost the same thing.

Oleg.

On 12/07, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > On 12/05, Eric W. Biederman wrote:
> >>
> >> Atomically sending signal to every member of a process group, is the
> >> big fly in the ointment I am aware of.  Last time I looked I could
> >> not see how to convert it rcu.
> >
> > I am not sure, but iirc we can do this lockless (under rcu_lock).
> > We need to modify pid_link to use list_entry and attach_pid() should
> > add the new task to the end. Of course we need more changes, but
> > (again iirc) this is not too hard.
>
> The problem is that even adding to the end of the list, we could run
> into a deleted entry and not see the new end of the list.
>
> Suppose when we start iterating the list we have:
>
>   A -> B -> C -> D
>
> Then someone deletes some of the entries while we are iterating the list.
>
> A ->
>  B' -> C' -> D'
>
> We will continue on traversing through the deleted entries.
>
> Then someone adds a new entry to the end of the list.
>
> A-> N
>
> Since we are at B', C' or D' we will never see the new entry on the
> end of the list.

Yes, but who can add the new entry?

Let's forget about setpgrp/etc for the moment, I think we have "races"
with or without tasklist. Say, setpgrp() can add the new process to the
already "killed" pgrp.

Then, I think the only important case is SIGKILL/SIGSTOP (or other
signals which can't be blockes/ignored). We must kill/stop the entire
pgrp, we must not race with fork() and miss a child.

In this case I _think_ rcu_read_lock() is enough,

	rcu_read_lock()

	list_for_each_entry_rcu(task, pid->tasks[PIDTYPE_PGID)
		group_send_sig_info(sig, task);

	rcu_read_unlock();

except group_send_sig_info() can race with mt-exec, but this is simple
to fix.

If we send a signal (not necessary SIGKILL) to a process P, we must see
all childs which were forked by P, both send_signal() and copy_process()
take the same ->siglock, we must see the result of list_add_tail_rcu().
And, after we sent SIGKILL/SIGSTOP, it can't fork the new child.

If list_for_each_entry() does not see the exited process P, this means
we see the result of list_del_rcu(). But this also means we must the
the result of the previous list_add_rcu().

IOW, fork+exit means list_add_rcu() + wmb() + list_del_rcu(), if we
don't see the new entry on list, we must see the new one, right?

(I am ignoring the case when list_for_each_entry_rcu() sees a process
 P but lock_task_sighand(P) fails, I think this is the same as if we
 we missed P)

Now suppose a signal is blocked/ignored or has a handler. In this case
we can miss a child, but I think this is OK, we can pretend the new
child was forked after kill_pgrp() completes. Say, this child C was
forked by some process P. We can miss C only if it was forked after
we already sent the signal to P.

However. I do not pretend the reasoning above is "complete", and
perhaps I missed something else.

> Additionally we have the other possibility that if a child is forking
> we send the signal to the parent after the child forks away but before
> the child joins whichever list we are walking, and we complete our
> traversal without seeing the child.

Not sure I understand... But afaics this case is covered above.
->siglock should serialize this, copy_process() does attach_pid()
under this lock.

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner  comment
  2010-08-05 16:14           ` Linus Torvalds
  2010-08-05 18:16             ` Oleg Nesterov
@ 2010-08-05 20:13             ` Eric W. Biederman
  2010-08-05 20:26               ` Linus Torvalds
  1 sibling, 1 reply; 19+ messages in thread
From: Eric W. Biederman @ 2010-08-05 20:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Oleg Nesterov, Thomas Gleixner, Tetsuo Handa,
	paulmck, akpm, linux-kernel, linux-security-module, Jiri Olsa

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Aug 5, 2010 at 12:19 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> No.  When we send a signal to multiple processes it needs to be an
>> atomic operation so that kill -KILL -pgrp won't let processes escape.
>> It is what posix specifies, it is what real programs expect, and it
>> is the useful semantic in userspace.
>
> Ok. However, in that case, it's not really about the whole list
> traversal, it's a totally separate thing, and it's really sad that we
> end up using the (rather hot) tasklist_lock for something like that.
> With the dcache/inode locks basically going away, I think
> tasklist_lock ends up being one of the few hot locks left.

It is about the list traversal.  In the process group case it is about
traversing the pid->tasks[PIDTYPE_PGID] hlist, which is also protected
by the tasklist_lock.

> Wouldn't it be much nicer to:
>  - make it clear that all the "real" signal locking can rely on RCU
>  - use a separate per-pgrp lock that ends up being the one that gives
> the signal _semantic_ meaning?
>
> That would automatically document why we get the lock too, which
> certainly isn't clear from the code as-is.
>
> The per-pgrp lock might be something as simple as a silly hash that
> just spreads out the process groups over some random number of simple
> spinlocks.

I think it is totally reasonable to add a per pid lock,
that would protect the pid->task[...] hlist.  That would make
things clearer and finer grained without a lot of effort.  Just
a little more struct pid bloat, and a little extra care in fork,
when we add to those lists.

Even with the per-pgrp lock we still need a lock on the global process
list for the kill -KILL -1 case.  Which suggests that tasklist_lock is
still needed for part of kill_something_info.

Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner  comment
  2010-08-05 20:13             ` Eric W. Biederman
@ 2010-08-05 20:26               ` Linus Torvalds
  2010-08-05 21:20                 ` Eric W. Biederman
  0 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2010-08-05 20:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Howells, Oleg Nesterov, Thomas Gleixner, Tetsuo Handa,
	paulmck, akpm, linux-kernel, linux-security-module, Jiri Olsa

On Thu, Aug 5, 2010 at 1:13 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> I think it is totally reasonable to add a per pid lock,
> that would protect the pid->task[...] hlist.  That would make
> things clearer and finer grained without a lot of effort.  Just
> a little more struct pid bloat, and a little extra care in fork,
> when we add to those lists.

Hmm. Have you taken a look at Nick Piggin's VFS scalability patches?
They introduce this "RCU-safe hash chain lock", where each hashchain
has a lock-bit in the low bit. I wonder if that would be the right
thing to use?

> Even with the per-pgrp lock we still need a lock on the global process
> list for the kill -KILL -1 case.  Which suggests that tasklist_lock is
> still needed for part of kill_something_info.

Well, that -1 case is special anyway. The fact that we might want to
use the tasklist_lock there is not very relevant, I think. That is
_not_ a hotpath, really (at least not under any relevant loads, I'm
sure you could make a silly benchmark of "kill(-1,0)").

                          Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner  comment
  2010-08-05 20:26               ` Linus Torvalds
@ 2010-08-05 21:20                 ` Eric W. Biederman
  0 siblings, 0 replies; 19+ messages in thread
From: Eric W. Biederman @ 2010-08-05 21:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Oleg Nesterov, Thomas Gleixner, Tetsuo Handa,
	paulmck, akpm, linux-kernel, linux-security-module, Jiri Olsa

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Aug 5, 2010 at 1:13 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> I think it is totally reasonable to add a per pid lock,
>> that would protect the pid->task[...] hlist.  That would make
>> things clearer and finer grained without a lot of effort.  Just
>> a little more struct pid bloat, and a little extra care in fork,
>> when we add to those lists.
>
> Hmm. Have you taken a look at Nick Piggin's VFS scalability patches?
> They introduce this "RCU-safe hash chain lock", where each hashchain
> has a lock-bit in the low bit. I wonder if that would be the right
> thing to use?

Interesting.  I haven't looked but a lock bit in the low bit of the
hlist head in struct pid would not add any space, and would not add
any extra cache line bounces.  So that would just be a matter of
adding the extra locks and getting the lock ordering correct.

>> Even with the per-pgrp lock we still need a lock on the global process
>> list for the kill -KILL -1 case.  Which suggests that tasklist_lock is
>> still needed for part of kill_something_info.
>
> Well, that -1 case is special anyway. The fact that we might want to
> use the tasklist_lock there is not very relevant, I think. That is
> _not_ a hotpath, really (at least not under any relevant loads, I'm
> sure you could make a silly benchmark of "kill(-1,0)").

I expect even signal deliver to process groups is relatively rare.

The interesting question is can we kill the tasklist_lock and/or the
tasklist.   A quick grep shows that we have maybe 100 users of the tasklist
in the entire kernel.  Poking a little deeper it looks like man of those
are connected to scheduling and are uses that today take the tasklist_lock
but would be fine with the rcu_read_lock().

Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2010-08-05 21:20 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-29 11:45 [PATCH 1/2] CRED: Fix get_task_cred() and task_state() to not resurrect dead credentials David Howells
2010-07-29 11:45 ` [PATCH 2/2] CRED: Fix __task_cred()'s lockdep check and banner comment David Howells
2010-08-02 20:40   ` Paul E. McKenney
2010-08-03  0:55     ` Tetsuo Handa
2010-08-03  9:34     ` David Howells
2010-08-03 16:07       ` Linus Torvalds
2010-08-03 17:48         ` Thomas Gleixner
2010-08-04 13:17         ` Oleg Nesterov
2010-08-04 14:01         ` David Howells
2010-08-04 15:08           ` Oleg Nesterov
2010-08-04 15:22           ` David Howells
2010-08-04 15:44             ` Oleg Nesterov
2010-08-05  7:19         ` Eric W. Biederman
2010-08-05 16:14           ` Linus Torvalds
2010-08-05 18:16             ` Oleg Nesterov
2010-08-05 20:13             ` Eric W. Biederman
2010-08-05 20:26               ` Linus Torvalds
2010-08-05 21:20                 ` Eric W. Biederman
2010-08-04  0:38       ` Tetsuo Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).