Re: [rfc][patch] Avoid taking global tasklist_lock for single threaded process at getrusage()

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [rfc][patch] Avoid taking global tasklist_lock for single threaded  process at getrusage()
@ 2005-12-24 17:52 Oleg Nesterov
  2005-12-27 20:21 ` Christoph Lameter
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2005-12-24 17:52 UTC (permalink / raw)
  To: Ravikiran Thirumalai, Shai Fultheim, Nippun Goel
  Cc: linux-kernel, Christoph Lameter, Andrew Morton

Ravikiran G Thirumalai wrote:
>
> +int getrusage_both(struct task_struct *p, struct rusage __user *ru)
>  {
> +	unsigned long flags;
> +	int lockflag = 0;
> +	cputime_t utime, stime;
>  	struct rusage r;
> -	read_lock(&tasklist_lock);
> -	k_getrusage(p, who, &r);
> -	read_unlock(&tasklist_lock);
> +	struct task_struct *t;
> +	memset((char *) &r, 0, sizeof (r));
> +
> +	if (unlikely(!p->signal))
> +		 return copy_to_user(ru, &r, sizeof(r)) ? -EFAULT : 0;
> +
> +	if (!thread_group_empty(p)) {
> +		read_lock(&tasklist_lock);
> +		lockflag = 1;
> +	}

I can't understand this. 'p' can do clone(CLONE_THREAD) immediately
after 'if (!thread_group_empty(p))' check.

> +	spin_lock_irqsave(&p->sighand->siglock, flags);

It is unsafe to do (unless p == current or tasklist held) even if
'p' is the only one process in the thread group.

p->sighand can be changed (and even freed) if 'p' does exec, see
de_thread().

p->sighand may be NULL , nothing prevents 'p' from release_task(p).
This patch checks p->signal, but this is meaningless unless it was
done under tasklist_lock.

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single threaded process at getrusage()
  2005-12-24 17:52 [rfc][patch] Avoid taking global tasklist_lock for single threaded process at getrusage() Oleg Nesterov
@ 2005-12-27 20:21 ` Christoph Lameter
  2005-12-28 12:38   ` [rfc][patch] Avoid taking global tasklist_lock for single threadedprocess " Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Lameter @ 2005-12-27 20:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ravikiran Thirumalai, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton

On Sat, 24 Dec 2005, Oleg Nesterov wrote:

> I can't understand this. 'p' can do clone(CLONE_THREAD) immediately
> after 'if (!thread_group_empty(p))' check.

Only if p != current. As discussed later the lockless approach is 
intened to only be used if p == current.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single  threadedprocess at getrusage()
  2005-12-27 20:21 ` Christoph Lameter
@ 2005-12-28 12:38   ` Oleg Nesterov
  2005-12-28 18:33     ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2005-12-28 12:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ravikiran Thirumalai, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton

Christoph Lameter wrote:
> 
> On Sat, 24 Dec 2005, Oleg Nesterov wrote:
> 
> > I can't understand this. 'p' can do clone(CLONE_THREAD) immediately
> > after 'if (!thread_group_empty(p))' check.
> 
> Only if p != current. As discussed later the lockless approach is
> intened to only be used if p == current.

Unless I missed something this function (getrusage_both) is called
from wait_noreap_copyout,

> +++ linux-2.6.15-rc6/kernel/exit.c	2005-12-23 10:42:16.000000000 -0800
> @@ -38,7 +38,7 @@
>  extern void sem_exit (void);
>  extern struct task_struct *child_reaper;
>  
> -int getrusage(struct task_struct *, int, struct rusage __user *);
> +int getrusage_both(struct task_struct *, struct rusage __user *);
>  
>  static void exit_mm(struct task_struct * tsk);
>  
> @@ -994,7 +994,7 @@
>  			       struct siginfo __user *infop,
>  			       struct rusage __user *rusagep)
>  {
> -	int retval = rusagep ? getrusage(p, RUSAGE_BOTH, rusagep) : 0;
> +	int retval = rusagep ? getrusage_both(p, rusagep) : 0;
>  	put_task_struct(p);

(the diff lacks '-p' parameter)

So p != current.

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single threadedprocess at getrusage()
  2005-12-28 12:38   ` [rfc][patch] Avoid taking global tasklist_lock for single threadedprocess " Oleg Nesterov
@ 2005-12-28 18:33     ` Ravikiran G Thirumalai
  2005-12-28 22:57       ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 25+ messages in thread
From: Ravikiran G Thirumalai @ 2005-12-28 18:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton

On Wed, Dec 28, 2005 at 03:38:39PM +0300, Oleg Nesterov wrote:
> Christoph Lameter wrote:
> > 
> > On Sat, 24 Dec 2005, Oleg Nesterov wrote:
> > 
> > > I can't understand this. 'p' can do clone(CLONE_THREAD) immediately
> > > after 'if (!thread_group_empty(p))' check.
> > 
> > Only if p != current. As discussed later the lockless approach is
> > intened to only be used if p == current.
> 
> Unless I missed something this function (getrusage_both) is called
> from wait_noreap_copyout,

Hi Oleg,
Yes, this patch needs to be reworked.  I am on it.  I'd also missed that
p->signal was protected by the tasklist_lock.  Thanks for pointing it out.
I will put out the modified version soon.

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single threadedprocess at getrusage()
  2005-12-28 18:33     ` Ravikiran G Thirumalai
@ 2005-12-28 22:57       ` Ravikiran G Thirumalai
  2005-12-30 17:57         ` Oleg Nesterov
  2006-01-03 18:18         ` Christoph Lameter
  0 siblings, 2 replies; 25+ messages in thread
From: Ravikiran G Thirumalai @ 2005-12-28 22:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton

On Wed, Dec 28, 2005 at 10:33:45AM -0800, Ravikiran G Thirumalai wrote:
> On Wed, Dec 28, 2005 at 03:38:39PM +0300, Oleg Nesterov wrote:
> > Christoph Lameter wrote:
> > > 
> > > On Sat, 24 Dec 2005, Oleg Nesterov wrote:
> > > 
> > > > I can't understand this. 'p' can do clone(CLONE_THREAD) immediately
> > > > after 'if (!thread_group_empty(p))' check.
> > > 
> > > Only if p != current. As discussed later the lockless approach is
> > > intened to only be used if p == current.
> > 
> > Unless I missed something this function (getrusage_both) is called
> > from wait_noreap_copyout,
> 
> Hi Oleg,
> Yes, this patch needs to be reworked.  I am on it.  I'd also missed that
> p->signal was protected by the tasklist_lock.  Thanks for pointing it out.
> I will put out the modified version soon.

Oleg, Here's the reworked patch.  
Comments?

It would have been nice if we could avoid tasklist locking on the 
getrusage_both case.  Maybe we should explore using RCU here along with 
other patches from Paul McKenney.... 

Thanks,
Kiran

Following patch avoids taking the global tasklist_lock when possible,
if a process is single threaded during getrusage().  Any avoidance of
tasklist_lock is good for NUMA boxes (and possibly for large SMPs).

Signed-off-by: Nippun Goel <nippung@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>

Index: arch/mips/kernel/irixsig.c
===================================================================
--- arch/mips/kernel/irixsig.c.orig	2005-12-27 14:49:57.000000000 -0800
+++ arch/mips/kernel/irixsig.c	2005-12-27 14:52:47.000000000 -0800
@@ -540,7 +540,7 @@ out:
 #define IRIX_P_PGID   2
 #define IRIX_P_ALL    7
 
-extern int getrusage(struct task_struct *, int, struct rusage __user *);
+extern int getrusage_both(struct task_struct *, struct rusage __user *);
 
 #define W_EXITED     1
 #define W_TRAPPED    2
@@ -605,7 +605,7 @@ repeat:
 			remove_parent(p);
 			add_parent(p, p->parent);
 			write_unlock_irq(&tasklist_lock);
-			retval = ru ? getrusage(p, RUSAGE_BOTH, ru) : 0;
+			retval = ru ? getrusage_both(p, ru) : 0;
 			if (retval)
 				goto end_waitsys;
 
@@ -626,7 +626,7 @@ repeat:
 			current->signal->cutime += p->utime + p->signal->cutime;
 			current->signal->cstime += p->stime + p->signal->cstime;
 			if (ru != NULL)
-				getrusage(p, RUSAGE_BOTH, ru);
+				getrusage_both(p, ru);
 			retval = __put_user(SIGCHLD, &info->sig);
 			retval |= __put_user(1, &info->code);      /* CLD_EXITED */
 			retval |= __put_user(p->pid, &info->stuff.procinfo.pid);
Index: arch/mips/kernel/sysirix.c
===================================================================
--- arch/mips/kernel/sysirix.c.orig	2005-12-27 14:49:57.000000000 -0800
+++ arch/mips/kernel/sysirix.c	2005-12-27 14:52:47.000000000 -0800
@@ -234,7 +234,8 @@ asmlinkage int irix_prctl(unsigned optio
 #undef DEBUG_PROCGRPS
 
 extern unsigned long irix_mapelf(int fd, struct elf_phdr __user *user_phdrp, int cnt);
-extern int getrusage(struct task_struct *p, int who, struct rusage __user *ru);
+extern int getrusage_self(struct rusage __user *ru);
+extern int getrusage_children(struct rusage __user *ru);
 extern char *prom_getenv(char *name);
 extern long prom_setenv(char *name, char *value);
 
@@ -405,12 +406,12 @@ asmlinkage int irix_syssgi(struct pt_reg
 		switch((int) regs->regs[base + 5]) {
 		case 0:
 			/* rusage self */
-			retval = getrusage(current, RUSAGE_SELF, ru);
+			retval = getrusage_self(ru);
 			goto out;
 
 		case -1:
 			/* rusage children */
-			retval = getrusage(current, RUSAGE_CHILDREN, ru);
+			retval = getrusage_children(ru);
 			goto out;
 
 		default:
Index: kernel/exit.c
===================================================================
--- kernel/exit.c.orig	2005-12-27 14:49:57.000000000 -0800
+++ kernel/exit.c	2005-12-27 14:52:47.000000000 -0800
@@ -38,7 +38,7 @@
 extern void sem_exit (void);
 extern struct task_struct *child_reaper;
 
-int getrusage(struct task_struct *, int, struct rusage __user *);
+int getrusage_both(struct task_struct *, struct rusage __user *);
 
 static void exit_mm(struct task_struct * tsk);
 
@@ -994,7 +994,7 @@ static int wait_noreap_copyout(task_t *p
 			       struct siginfo __user *infop,
 			       struct rusage __user *rusagep)
 {
-	int retval = rusagep ? getrusage(p, RUSAGE_BOTH, rusagep) : 0;
+	int retval = rusagep ? getrusage_both(p, rusagep) : 0;
 	put_task_struct(p);
 	if (!retval)
 		retval = put_user(SIGCHLD, &infop->si_signo);
@@ -1111,7 +1111,7 @@ static int wait_task_zombie(task_t *p, i
 	 */
 	read_unlock(&tasklist_lock);
 
-	retval = ru ? getrusage(p, RUSAGE_BOTH, ru) : 0;
+	retval = ru ? getrusage_both(p, ru) : 0;
 	status = (p->signal->flags & SIGNAL_GROUP_EXIT)
 		? p->signal->group_exit_code : p->exit_code;
 	if (!retval && stat_addr)
@@ -1260,7 +1260,7 @@ bail_ref:
 
 	write_unlock_irq(&tasklist_lock);
 
-	retval = ru ? getrusage(p, RUSAGE_BOTH, ru) : 0;
+	retval = ru ? getrusage_both(p, ru) : 0;
 	if (!retval && stat_addr)
 		retval = put_user((exit_code << 8) | 0x7f, stat_addr);
 	if (!retval && infop)
@@ -1321,7 +1321,7 @@ static int wait_task_continued(task_t *p
 	read_unlock(&tasklist_lock);
 
 	if (!infop) {
-		retval = ru ? getrusage(p, RUSAGE_BOTH, ru) : 0;
+		retval = ru ? getrusage_both(p, ru) : 0;
 		put_task_struct(p);
 		if (!retval && stat_addr)
 			retval = put_user(0xffff, stat_addr);
Index: kernel/sys.c
===================================================================
--- kernel/sys.c.orig	2005-12-27 14:49:57.000000000 -0800
+++ kernel/sys.c	2005-12-28 14:14:05.000000000 -0800
@@ -1657,6 +1657,10 @@ asmlinkage long sys_setrlimit(unsigned i
 }
 
 /*
+ * getrusage routines:
+ * getrusage_self() and getrusage_children() always use current.
+ * getrusage_both() need not be for the current task.
+ *
  * It would make sense to put struct rusage in the task_struct,
  * except that would make the task_struct be *really big*.  After
  * task_struct gets moved into malloc'ed memory, it would
@@ -1664,94 +1668,190 @@ asmlinkage long sys_setrlimit(unsigned i
  * a lot simpler!  (Which we're not doing right now because we're not
  * measuring them yet).
  *
- * This expects to be called with tasklist_lock read-locked or better,
- * and the siglock not locked.  It may momentarily take the siglock.
- *
- * When sampling multiple threads for RUSAGE_SELF, under SMP we might have
- * races with threads incrementing their own counters.  But since word
- * reads are atomic, we either get new values or old values and we don't
- * care which for the sums.  We always take the siglock to protect reading
+ * In multi threaded scenario, we always take the siglock to protect reading
  * the c* fields from p->signal from races with exit.c updating those
  * fields when reaping, so a sample either gets all the additions of a
  * given child after it's reaped, or none so this sample is before reaping.
+ *
+ * getrusage_children -- locking:
+ * In multithreaded scenario, we need to take tasklist_lock for read
+ * to avoid races against another thread doing exec and changing/freeing
+ * signal_struct of the current task.  We need to take the sighand->siglock
+ * too as another thread could be reaping its children.  However, both locks
+ * can be avoided if we are a single threaded process, since we are current; 
+ * No one else can take our signal_struct away, and no one else can reap
+ * children to update our signal->c*  counters.
+ *
  */
-
-static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
+int getrusage_children(struct rusage __user *ru)
 {
-	struct task_struct *t;
 	unsigned long flags;
+	int lockflag = 0;
 	cputime_t utime, stime;
+	struct task_struct *p = current;
+	struct rusage r;
+	memset((char *) &r, 0, sizeof (r));
 
-	memset((char *) r, 0, sizeof *r);
+	if (!thread_group_empty(p)) {
+		read_lock(&tasklist_lock);
+		if (unlikely(!p->signal)) {
+			read_unlock(&tasklist_lock);
+			goto ret;
+		}
+		spin_lock_irqsave(&p->sighand->siglock, flags);
+		lockflag = 1;
+	}
+	
+	utime = p->signal->cutime;
+	stime = p->signal->cstime;
+	r.ru_nvcsw = p->signal->cnvcsw;
+	r.ru_nivcsw = p->signal->cnivcsw;
+	r.ru_minflt = p->signal->cmin_flt;
+	r.ru_majflt = p->signal->cmaj_flt;
+	if (lockflag) {
+		spin_unlock_irqrestore(&p->sighand->siglock, flags);
+		read_unlock(&tasklist_lock);
+	}
+	cputime_to_timeval(utime, &r.ru_utime);
+	cputime_to_timeval(stime, &r.ru_stime);
 
-	if (unlikely(!p->signal))
-		return;
+ret:
+	return copy_to_user(ru, &r, sizeof(r)) ? -EFAULT : 0;
+}
 
-	switch (who) {
-		case RUSAGE_CHILDREN:
-			spin_lock_irqsave(&p->sighand->siglock, flags);
-			utime = p->signal->cutime;
-			stime = p->signal->cstime;
-			r->ru_nvcsw = p->signal->cnvcsw;
-			r->ru_nivcsw = p->signal->cnivcsw;
-			r->ru_minflt = p->signal->cmin_flt;
-			r->ru_majflt = p->signal->cmaj_flt;
-			spin_unlock_irqrestore(&p->sighand->siglock, flags);
-			cputime_to_timeval(utime, &r->ru_utime);
-			cputime_to_timeval(stime, &r->ru_stime);
-			break;
-		case RUSAGE_SELF:
-			spin_lock_irqsave(&p->sighand->siglock, flags);
-			utime = stime = cputime_zero;
-			goto sum_group;
-		case RUSAGE_BOTH:
-			spin_lock_irqsave(&p->sighand->siglock, flags);
-			utime = p->signal->cutime;
-			stime = p->signal->cstime;
-			r->ru_nvcsw = p->signal->cnvcsw;
-			r->ru_nivcsw = p->signal->cnivcsw;
-			r->ru_minflt = p->signal->cmin_flt;
-			r->ru_majflt = p->signal->cmaj_flt;
-		sum_group:
-			utime = cputime_add(utime, p->signal->utime);
-			stime = cputime_add(stime, p->signal->stime);
-			r->ru_nvcsw += p->signal->nvcsw;
-			r->ru_nivcsw += p->signal->nivcsw;
-			r->ru_minflt += p->signal->min_flt;
-			r->ru_majflt += p->signal->maj_flt;
-			t = p;
-			do {
-				utime = cputime_add(utime, t->utime);
-				stime = cputime_add(stime, t->stime);
-				r->ru_nvcsw += t->nvcsw;
-				r->ru_nivcsw += t->nivcsw;
-				r->ru_minflt += t->min_flt;
-				r->ru_majflt += t->maj_flt;
-				t = next_thread(t);
-			} while (t != p);
-			spin_unlock_irqrestore(&p->sighand->siglock, flags);
-			cputime_to_timeval(utime, &r->ru_utime);
-			cputime_to_timeval(stime, &r->ru_stime);
-			break;
-		default:
-			BUG();
+/*
+ * getrusage_self:
+ * In multithreaded scenario, we need to take the tasklist_lock for read
+ * since we traverse the task TGID hash list.  However, we do not need to 
+ * take the siglock even for the multithreaded case,  as the signal fields, 
+ * which the siglock protects are only updated at __exit_signal  with
+ * tasklist_lock taken for write, which cannot happen as we have the read_lock
+ * on tasklist_lock.
+ * If we are single threaded, since we are current, we don't need to take
+ * the tasklist_lock or the siglock as no one else can race with the
+ * signal fields.
+ *
+ * When sampling multiple threads for getrusage_self(), under SMP we might have
+ * races with threads incrementing their own counters.  But since word
+ * reads are atomic, we either get new values or old values and we don't
+ * care which for the sums.
+ *
+ */
+int getrusage_self(struct rusage __user *ru)
+{
+	int lockflag = 0;
+	cputime_t utime, stime;
+	struct task_struct *t, *p = current;
+	struct rusage r;
+	memset((char *) &r, 0, sizeof (r));
+
+	if (!thread_group_empty(p)) {
+		read_lock(&tasklist_lock);
+		lockflag = 1;
 	}
+
+	if (unlikely(!p->signal)) {
+		if (lockflag)
+			read_unlock(&tasklist_lock);
+		goto ret;
+	}
+
+	utime = p->signal->utime;
+	stime = p->signal->stime;
+	r.ru_nvcsw = p->signal->nvcsw;
+	r.ru_nivcsw = p->signal->nivcsw;
+	r.ru_minflt = p->signal->min_flt;
+	r.ru_majflt = p->signal->maj_flt;
+	t = p;
+	do {
+		utime = cputime_add(utime, t->utime);
+		stime = cputime_add(stime, t->stime);
+		r.ru_nvcsw += t->nvcsw;
+		r.ru_nivcsw += t->nivcsw;
+		r.ru_minflt += t->min_flt;
+		r.ru_majflt += t->maj_flt;
+		t = next_thread(t);
+	} while (t != p);
+
+	if (lockflag)
+		read_unlock(&tasklist_lock);
+	cputime_to_timeval(utime, &r.ru_utime);
+	cputime_to_timeval(stime, &r.ru_stime);
+
+ret:
+	return copy_to_user(ru, &r, sizeof(r)) ? -EFAULT : 0;
 }
 
-int getrusage(struct task_struct *p, int who, struct rusage __user *ru)
+/*
+ * getrusage_both:
+ * getrusage_both can be invoked for non current processes -- usually from
+ * wait_* routines, on the children. As even a single threaded process 
+ * we are waiting on can spawn another thread and exec, we cannot 
+ * avoid tasklist_lock for both single threaded and multithreaded cases.
+ * We also need to take the siglock to protect the c* fields of the 
+ * signal_struct for both single threaded and multi threaded case.
+ *
+ */
+int getrusage_both(struct task_struct *p, struct rusage __user *ru)
 {
+	unsigned long flags;
+	cputime_t utime, stime;
 	struct rusage r;
+	struct task_struct *t;
+	memset((char *) &r, 0, sizeof (r));
+
 	read_lock(&tasklist_lock);
-	k_getrusage(p, who, &r);
+	if (unlikely(!p->signal)) {
+		read_unlock(&tasklist_lock);
+		goto ret;
+	}
+
+	spin_lock_irqsave(&p->sighand->siglock, flags);
+	utime = p->signal->cutime;
+	stime = p->signal->cstime;
+	r.ru_nvcsw = p->signal->cnvcsw;
+	r.ru_nivcsw = p->signal->cnivcsw;
+	r.ru_minflt = p->signal->cmin_flt;
+	r.ru_majflt = p->signal->cmaj_flt;
+	spin_unlock_irqrestore(&p->sighand->siglock, flags);
+
+	utime = cputime_add(utime, p->signal->utime);
+	stime = cputime_add(stime, p->signal->stime);
+	r.ru_nvcsw += p->signal->nvcsw;
+	r.ru_nivcsw += p->signal->nivcsw;
+	r.ru_minflt += p->signal->min_flt;
+	r.ru_majflt += p->signal->maj_flt;
+
+	t = p;
+	do {
+		utime = cputime_add(utime, t->utime);
+		stime = cputime_add(stime, t->stime);
+		r.ru_nvcsw += t->nvcsw;
+		r.ru_nivcsw += t->nivcsw;
+		r.ru_minflt += t->min_flt;
+		r.ru_majflt += t->maj_flt;
+		t = next_thread(t);
+	} while (t != p);
+
 	read_unlock(&tasklist_lock);
+	cputime_to_timeval(utime, &r.ru_utime);
+	cputime_to_timeval(stime, &r.ru_stime);
+
+ret:
 	return copy_to_user(ru, &r, sizeof(r)) ? -EFAULT : 0;
 }
 
 asmlinkage long sys_getrusage(int who, struct rusage __user *ru)
 {
-	if (who != RUSAGE_SELF && who != RUSAGE_CHILDREN)
-		return -EINVAL;
-	return getrusage(current, who, ru);
+	switch (who) {
+		case RUSAGE_SELF:
+			return getrusage_self(ru);
+		case RUSAGE_CHILDREN:
+			return getrusage_children(ru);
+		default:
+			break;
+	}
+	return -EINVAL;
 }
 
 asmlinkage long sys_umask(int mask)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single  threadedprocess at getrusage()
  2005-12-28 22:57       ` Ravikiran G Thirumalai
@ 2005-12-30 17:57         ` Oleg Nesterov
  2006-01-04 23:16           ` Ravikiran G Thirumalai
  2006-01-03 18:18         ` Christoph Lameter
  1 sibling, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2005-12-30 17:57 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton

> Ravikiran G Thirumalai wrote:
>
> Following patch avoids taking the global tasklist_lock when possible,
> if a process is single threaded during getrusage().  Any avoidance of
> tasklist_lock is good for NUMA boxes (and possibly for large SMPs).

> --- arch/mips/kernel/irixsig.c.orig	2005-12-27 14:49:57.000000000 -0800
> +++ arch/mips/kernel/irixsig.c	2005-12-27 14:52:47.000000000 -0800
> @@ -540,7 +540,7 @@ out:
>  #define IRIX_P_PGID   2
>  #define IRIX_P_ALL    7
>
> -extern int getrusage(struct task_struct *, int, struct rusage __user *);
> +extern int getrusage_both(struct task_struct *, struct rusage __user *);

I think it's better sense to move this declaration to include/.

> -static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
> +int getrusage_children(struct rusage __user *ru)
>  {
> -	struct task_struct *t;
>  	unsigned long flags;
> +	int lockflag = 0;
>  	cputime_t utime, stime;
> +	struct task_struct *p = current;
> +	struct rusage r;
> +	memset((char *) &r, 0, sizeof (r));
>
> -	memset((char *) r, 0, sizeof *r);
> +	if (!thread_group_empty(p)) {
> +		read_lock(&tasklist_lock);
> +		if (unlikely(!p->signal)) {
> +			read_unlock(&tasklist_lock);
> +			goto ret;

Is this possible? 'current' always has valid signal/sighand.
Or better say, process can't call getrusage after exit_signal().

> +		}
> +		spin_lock_irqsave(&p->sighand->siglock, flags);
> +		lockflag = 1;
> +	}

What if another thread just exited? I think you need 'else smp_rmb()'.
here. Otherwise cpu can read signal->c* out of order.

> +int getrusage_self(struct rusage __user *ru)

Same comments.

> +int getrusage_both(struct task_struct *p, struct rusage __user *ru)
>  {
> +	unsigned long flags;
> +	cputime_t utime, stime;
>  	struct rusage r;
> +	struct task_struct *t;
> +	memset((char *) &r, 0, sizeof (r));
> +
>  	read_lock(&tasklist_lock);
> -	k_getrusage(p, who, &r);
> +	if (unlikely(!p->signal)) {
> +		read_unlock(&tasklist_lock);
> +		goto ret;
> +	}
> +
> +	spin_lock_irqsave(&p->sighand->siglock, flags);
> +	utime = p->signal->cutime;
> +	stime = p->signal->cstime;
> +	r.ru_nvcsw = p->signal->cnvcsw;
> +	r.ru_nivcsw = p->signal->cnivcsw;
> +	r.ru_minflt = p->signal->cmin_flt;
> +	r.ru_majflt = p->signal->cmaj_flt;
> +	spin_unlock_irqrestore(&p->sighand->siglock, flags);
> +
> +	utime = cputime_add(utime, p->signal->utime);
> +	stime = cputime_add(stime, p->signal->stime);
> +	r.ru_nvcsw += p->signal->nvcsw;
> +	r.ru_nivcsw += p->signal->nivcsw;
> +	r.ru_minflt += p->signal->min_flt;
> +	r.ru_majflt += p->signal->maj_flt;
> +
> +	t = p;
> +	do {
> +		utime = cputime_add(utime, t->utime);
> +		stime = cputime_add(stime, t->stime);
> +		r.ru_nvcsw += t->nvcsw;
> +		r.ru_nivcsw += t->nivcsw;
> +		r.ru_minflt += t->min_flt;
> +		r.ru_majflt += t->maj_flt;
> +		t = next_thread(t);
> +	} while (t != p);
> +
>  	read_unlock(&tasklist_lock);
> +	cputime_to_timeval(utime, &r.ru_utime);
> +	cputime_to_timeval(stime, &r.ru_stime);
> +
> +ret:
>  	return copy_to_user(ru, &r, sizeof(r)) ? -EFAULT : 0;
>  }

Looks we can factor out some code.

Actually I dont't understand why can't we move the locking into
k_getrusage,

k_getrusage()

	lock_flag = (p == current && thread_group_empty(p));
	if (lockflag) {
		read_lock(&tasklist_lock);
		spin_lock_irqsave(&p->sighand->siglock, flags);
	}

	and remove ->sighand locking under 'switch' statement.

Isn't this enough to solve perfomance problems?

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single threadedprocess at getrusage()
  2005-12-28 22:57       ` Ravikiran G Thirumalai
  2005-12-30 17:57         ` Oleg Nesterov
@ 2006-01-03 18:18         ` Christoph Lameter
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Lameter @ 2006-01-03 18:18 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Oleg Nesterov, Shai Fultheim, Nippun Goel, linux-kernel, Andrew Morton

I still think that it would be best to not handle __user pointers in 
getrusage_*. Alternatively lets give up on these smaller functions 
if the modifications are simple. Modify the existing function as Oleg 
suggested.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single threadedprocess at getrusage()
  2005-12-30 17:57         ` Oleg Nesterov
@ 2006-01-04 23:16           ` Ravikiran G Thirumalai
  2006-01-05 19:17             ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Ravikiran G Thirumalai @ 2006-01-04 23:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton

On Fri, Dec 30, 2005 at 08:57:41PM +0300, Oleg Nesterov wrote:
> > -	memset((char *) r, 0, sizeof *r);
> > +	if (!thread_group_empty(p)) {
> > +		read_lock(&tasklist_lock);
> > +		if (unlikely(!p->signal)) {
> > +			read_unlock(&tasklist_lock);
> > +			goto ret;
> 
> Is this possible? 'current' always has valid signal/sighand.
> Or better say, process can't call getrusage after exit_signal().

You are right, this check was unnecessary

> 
> > +		}
> > +		spin_lock_irqsave(&p->sighand->siglock, flags);
> > +		lockflag = 1;
> > +	}
> 
> What if another thread just exited? I think you need 'else smp_rmb()'.
> here. Otherwise cpu can read signal->c* out of order.

Yes, looks like we do.  We probably need to do the same at sys_times too...

> 
> >  }
> 
> Looks we can factor out some code.
> 
> Actually I dont't understand why can't we move the locking into
> k_getrusage,
> 
> k_getrusage()
> 
> 	lock_flag = (p == current && thread_group_empty(p));
> 	if (lockflag) {
> 		read_lock(&tasklist_lock);
> 		spin_lock_irqsave(&p->sighand->siglock, flags);
> 	}
> 
> 	and remove ->sighand locking under 'switch' statement.
> 
> Isn't this enough to solve perfomance problems?

Hmm, just that getrusage_self takes the siglock in the multi threaded
case which is not needed I think.  Howz this patch?


getrusage_tasklistlock_optimization-v4

Following patch avoids taking the global tasklist_lock when possible,
if a process is single threaded during getrusage().  Any avoidance of
tasklist_lock is good for NUMA boxes (and possibly for large SMPs).


Index: linux-2.6.15-rc6/kernel/sys.c
===================================================================
--- linux-2.6.15-rc6.orig/kernel/sys.c	2006-01-03 12:12:05.000000000 -0800
+++ linux-2.6.15-rc6/kernel/sys.c	2006-01-04 13:41:17.000000000 -0800
@@ -1664,9 +1664,6 @@ asmlinkage long sys_setrlimit(unsigned i
  * a lot simpler!  (Which we're not doing right now because we're not
  * measuring them yet).
  *
- * This expects to be called with tasklist_lock read-locked or better,
- * and the siglock not locked.  It may momentarily take the siglock.
- *
  * When sampling multiple threads for RUSAGE_SELF, under SMP we might have
  * races with threads incrementing their own counters.  But since word
  * reads are atomic, we either get new values or old values and we don't
@@ -1674,6 +1671,25 @@ asmlinkage long sys_setrlimit(unsigned i
  * the c* fields from p->signal from races with exit.c updating those
  * fields when reaping, so a sample either gets all the additions of a
  * given child after it's reaped, or none so this sample is before reaping.
+ *
+ * tasklist_lock locking optimisation:
+ * If we are current and single threaded, we do not need to take the tasklist
+ * lock or the siglock.  No one else can take our signal_struct away, 
+ * no one else can reap the children to update signal->c* counters, and
+ * no one else can race with the signal-> fields.
+ * If we do not take the tasklist_lock, the signal-> fields could be read
+ * out of order while another thread was just exiting. So we place a 
+ * read memory barrier when we avoid the lock.  On the writer side, 
+ * write memory barrier is implied in  __exit_signal as __exit_signal releases 
+ * the siglock spinlock after updating the signal-> fields.
+ *
+ * We don't really need the siglock when we access the non c* fields
+ * of the signal_struct (for RUSAGE_SELF) even in multithreaded
+ * case, since we take the tasklist lock for read and the non c* signal-> 
+ * fields are updated only in __exit_signal, which is called with 
+ * tasklist_lock taken for write, hence these two threads cannot execute
+ * concurrently.
+ *
  */
 
 static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
@@ -1681,37 +1697,51 @@ static void k_getrusage(struct task_stru
 	struct task_struct *t;
 	unsigned long flags;
 	cputime_t utime, stime;
+	int need_lock = 0;
 
 	memset((char *) r, 0, sizeof *r);
 
-	if (unlikely(!p->signal))
-		return;
+	need_lock = (p == current && thread_group_empty(p));
+
+	if (need_lock) {
+		read_lock(&tasklist_lock);
+		if (unlikely(!p->signal)) {
+			read_unlock(&tasklist_lock);
+			return;
+		}
+	} else
+		/* See locking comments above */
+		smp_rmb();
 
 	switch (who) {
 		case RUSAGE_CHILDREN:
-			spin_lock_irqsave(&p->sighand->siglock, flags);
+			if (need_lock)
+				spin_lock_irqsave(&p->sighand->siglock, flags);
 			utime = p->signal->cutime;
 			stime = p->signal->cstime;
 			r->ru_nvcsw = p->signal->cnvcsw;
 			r->ru_nivcsw = p->signal->cnivcsw;
 			r->ru_minflt = p->signal->cmin_flt;
 			r->ru_majflt = p->signal->cmaj_flt;
-			spin_unlock_irqrestore(&p->sighand->siglock, flags);
+			if (need_lock)
+				spin_unlock_irqrestore(&p->sighand->siglock, flags);
 			cputime_to_timeval(utime, &r->ru_utime);
 			cputime_to_timeval(stime, &r->ru_stime);
 			break;
 		case RUSAGE_SELF:
-			spin_lock_irqsave(&p->sighand->siglock, flags);
 			utime = stime = cputime_zero;
 			goto sum_group;
 		case RUSAGE_BOTH:
-			spin_lock_irqsave(&p->sighand->siglock, flags);
+			if (need_lock)
+				spin_lock_irqsave(&p->sighand->siglock, flags);
 			utime = p->signal->cutime;
 			stime = p->signal->cstime;
 			r->ru_nvcsw = p->signal->cnvcsw;
 			r->ru_nivcsw = p->signal->cnivcsw;
 			r->ru_minflt = p->signal->cmin_flt;
 			r->ru_majflt = p->signal->cmaj_flt;
+			if (need_lock)
+				spin_unlock_irqrestore(&p->sighand->siglock, flags);
 		sum_group:
 			utime = cputime_add(utime, p->signal->utime);
 			stime = cputime_add(stime, p->signal->stime);
@@ -1729,21 +1759,21 @@ static void k_getrusage(struct task_stru
 				r->ru_majflt += t->maj_flt;
 				t = next_thread(t);
 			} while (t != p);
-			spin_unlock_irqrestore(&p->sighand->siglock, flags);
 			cputime_to_timeval(utime, &r->ru_utime);
 			cputime_to_timeval(stime, &r->ru_stime);
 			break;
 		default:
 			BUG();
 	}
+	
+	if (need_lock)
+		read_unlock(&tasklist_lock);
 }
 
 int getrusage(struct task_struct *p, int who, struct rusage __user *ru)
 {
 	struct rusage r;
-	read_lock(&tasklist_lock);
 	k_getrusage(p, who, &r);
-	read_unlock(&tasklist_lock);
 	return copy_to_user(ru, &r, sizeof(r)) ? -EFAULT : 0;
 }
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single  threadedprocess at getrusage()
  2006-01-04 23:16           ` Ravikiran G Thirumalai
@ 2006-01-05 19:17             ` Oleg Nesterov
  2006-01-06  9:46               ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2006-01-05 19:17 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton

Ravikiran G Thirumalai wrote:
>
> +       need_lock = (p == current && thread_group_empty(p));

This should be

	need_lock = !(p == current && thread_group_empty(p));


I think we should cleanup k_getrusage() first, see the patch below.

Do you agree with that patch? If yes, could you make the new one on
top of it? (please send them both to Andrew).

Note that after this cleanup we don't have code duplication.


[PATCH] simplify k_getrusage()

Factor out common code for different RUSAGE_xxx cases.

Don't take ->sighand->siglock in RUSAGE_SELF case, suggested
by Ravikiran G Thirumalai.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>

--- K/kernel/sys.c~	2006-01-03 21:15:58.000000000 +0300
+++ K/kernel/sys.c	2006-01-06 00:40:34.000000000 +0300
@@ -1687,7 +1687,10 @@ static void k_getrusage(struct task_stru
 	if (unlikely(!p->signal))
 		return;
 
+	utime = stime = cputime_zero;
+
 	switch (who) {
+		case RUSAGE_BOTH:
 		case RUSAGE_CHILDREN:
 			spin_lock_irqsave(&p->sighand->siglock, flags);
 			utime = p->signal->cutime;
@@ -1697,22 +1700,11 @@ static void k_getrusage(struct task_stru
 			r->ru_minflt = p->signal->cmin_flt;
 			r->ru_majflt = p->signal->cmaj_flt;
 			spin_unlock_irqrestore(&p->sighand->siglock, flags);
-			cputime_to_timeval(utime, &r->ru_utime);
-			cputime_to_timeval(stime, &r->ru_stime);
-			break;
+
+			if (who == RUSAGE_CHILDREN)
+				break;
+
 		case RUSAGE_SELF:
-			spin_lock_irqsave(&p->sighand->siglock, flags);
-			utime = stime = cputime_zero;
-			goto sum_group;
-		case RUSAGE_BOTH:
-			spin_lock_irqsave(&p->sighand->siglock, flags);
-			utime = p->signal->cutime;
-			stime = p->signal->cstime;
-			r->ru_nvcsw = p->signal->cnvcsw;
-			r->ru_nivcsw = p->signal->cnivcsw;
-			r->ru_minflt = p->signal->cmin_flt;
-			r->ru_majflt = p->signal->cmaj_flt;
-		sum_group:
 			utime = cputime_add(utime, p->signal->utime);
 			stime = cputime_add(stime, p->signal->stime);
 			r->ru_nvcsw += p->signal->nvcsw;
@@ -1729,13 +1721,14 @@ static void k_getrusage(struct task_stru
 				r->ru_majflt += t->maj_flt;
 				t = next_thread(t);
 			} while (t != p);
-			spin_unlock_irqrestore(&p->sighand->siglock, flags);
-			cputime_to_timeval(utime, &r->ru_utime);
-			cputime_to_timeval(stime, &r->ru_stime);
 			break;
+
 		default:
 			BUG();
 	}
+
+	cputime_to_timeval(utime, &r->ru_utime);
+	cputime_to_timeval(stime, &r->ru_stime);
 }
 
 int getrusage(struct task_struct *p, int who, struct rusage __user *ru)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single threadedprocess at getrusage()
  2006-01-05 19:17             ` Oleg Nesterov
@ 2006-01-06  9:46               ` Ravikiran G Thirumalai
  2006-01-06 17:23                 ` Christoph Lameter
  2006-01-08 11:49                 ` Oleg Nesterov
  0 siblings, 2 replies; 25+ messages in thread
From: Ravikiran G Thirumalai @ 2006-01-06  9:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton

On Thu, Jan 05, 2006 at 10:17:01PM +0300, Oleg Nesterov wrote:
> Ravikiran G Thirumalai wrote:
> 
> 
> I think we should cleanup k_getrusage() first, see the patch below.
> 
> Do you agree with that patch? If yes, could you make the new one on
> top of it? (please send them both to Andrew).
> 
> Note that after this cleanup we don't have code duplication.

Yes, nice patch.  Thanks.  Here is the tasklist lock optimization on top of
the simplify-k_getrusage.patch.  Andrew, please apply.

Thanks,
Kiran


Following patch avoids taking the global tasklist_lock when possible,
if a process is single threaded during getrusage().  Any avoidance of
tasklist_lock is good for NUMA boxes (and possibly for large SMPs).
Thanks to Oleg Nesterov for review and suggestions.

Signed-off-by: Nippun Goel <nippung@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>

Index: linux-2.6.15-delme/kernel/sys.c
===================================================================
--- linux-2.6.15-delme.orig/kernel/sys.c	2006-01-05 15:40:38.000000000 -0800
+++ linux-2.6.15-delme/kernel/sys.c	2006-01-06 00:43:08.000000000 -0800
@@ -1664,9 +1664,6 @@ asmlinkage long sys_setrlimit(unsigned i
  * a lot simpler!  (Which we're not doing right now because we're not
  * measuring them yet).
  *
- * This expects to be called with tasklist_lock read-locked or better,
- * and the siglock not locked.  It may momentarily take the siglock.
- *
  * When sampling multiple threads for RUSAGE_SELF, under SMP we might have
  * races with threads incrementing their own counters.  But since word
  * reads are atomic, we either get new values or old values and we don't
@@ -1674,6 +1671,25 @@ asmlinkage long sys_setrlimit(unsigned i
  * the c* fields from p->signal from races with exit.c updating those
  * fields when reaping, so a sample either gets all the additions of a
  * given child after it's reaped, or none so this sample is before reaping.
+ *
+ * tasklist_lock locking optimisation:
+ * If we are current and single threaded, we do not need to take the tasklist
+ * lock or the siglock.  No one else can take our signal_struct away, 
+ * no one else can reap the children to update signal->c* counters, and
+ * no one else can race with the signal-> fields.
+ * If we do not take the tasklist_lock, the signal-> fields could be read
+ * out of order while another thread was just exiting. So we place a 
+ * read memory barrier when we avoid the lock.  On the writer side, 
+ * write memory barrier is implied in  __exit_signal as __exit_signal releases 
+ * the siglock spinlock after updating the signal-> fields.
+ *
+ * We don't really need the siglock when we access the non c* fields
+ * of the signal_struct (for RUSAGE_SELF) even in multithreaded
+ * case, since we take the tasklist lock for read and the non c* signal-> 
+ * fields are updated only in __exit_signal, which is called with 
+ * tasklist_lock taken for write, hence these two threads cannot execute
+ * concurrently.
+ *
  */
 
 static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
@@ -1681,14 +1697,22 @@ static void k_getrusage(struct task_stru
 	struct task_struct *t;
 	unsigned long flags;
 	cputime_t utime, stime;
+	int need_lock = 0;
 
 	memset((char *) r, 0, sizeof *r);
-
-	if (unlikely(!p->signal))
-		return;
-
 	utime = stime = cputime_zero;
 
+	need_lock = !(p == current && thread_group_empty(p));
+	if (need_lock) {
+		read_lock(&tasklist_lock);
+		if (unlikely(!p->signal)) {
+			read_unlock(&tasklist_lock);
+			return;
+		}
+	} else
+		/* See locking comments above */
+		smp_rmb();
+
 	switch (who) {
 		case RUSAGE_BOTH:
 		case RUSAGE_CHILDREN:
@@ -1727,6 +1751,8 @@ static void k_getrusage(struct task_stru
 			BUG();
 	}
 
+	if (need_lock)
+		read_unlock(&tasklist_lock);
 	cputime_to_timeval(utime, &r->ru_utime);
 	cputime_to_timeval(stime, &r->ru_stime);
 }
@@ -1734,9 +1760,7 @@ static void k_getrusage(struct task_stru
 int getrusage(struct task_struct *p, int who, struct rusage __user *ru)
 {
 	struct rusage r;
-	read_lock(&tasklist_lock);
 	k_getrusage(p, who, &r);
-	read_unlock(&tasklist_lock);
 	return copy_to_user(ru, &r, sizeof(r)) ? -EFAULT : 0;
 }
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single threadedprocess at getrusage()
  2006-01-06  9:46               ` Ravikiran G Thirumalai
@ 2006-01-06 17:23                 ` Christoph Lameter
  2006-01-06 19:46                   ` Ravikiran G Thirumalai
  2006-01-06 23:52                   ` Andrew Morton
  2006-01-08 11:49                 ` Oleg Nesterov
  1 sibling, 2 replies; 25+ messages in thread
From: Christoph Lameter @ 2006-01-06 17:23 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Oleg Nesterov, Shai Fultheim, Nippun Goel, linux-kernel, Andrew Morton

On Fri, 6 Jan 2006, Ravikiran G Thirumalai wrote:

> +	need_lock = !(p == current && thread_group_empty(p));

Isnt 

need_lock = (p != current || !thread_group_empty(b))

clearer?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single threadedprocess at getrusage()
  2006-01-06 17:23                 ` Christoph Lameter
@ 2006-01-06 19:46                   ` Ravikiran G Thirumalai
  2006-03-20 18:04                     ` Oleg Nesterov
  2006-01-06 23:52                   ` Andrew Morton
  1 sibling, 1 reply; 25+ messages in thread
From: Ravikiran G Thirumalai @ 2006-01-06 19:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Oleg Nesterov, Shai Fultheim, Nippun Goel, linux-kernel, Andrew Morton

On Fri, Jan 06, 2006 at 09:23:30AM -0800, Christoph Lameter wrote:
> On Fri, 6 Jan 2006, Ravikiran G Thirumalai wrote:
> 
> > +	need_lock = !(p == current && thread_group_empty(p));
> 
> Isnt 
> 
> need_lock = (p != current || !thread_group_empty(b))
> 
> clearer?

All the same I felt, and the comments were bold and clear.  Also, the above was
c translation of what was said in the comment  :)...I am OK either ways.
So Here goes...

Following patch avoids taking the global tasklist_lock when possible,
if a process is single threaded during getrusage().  Any avoidance of
tasklist_lock is good for NUMA boxes (and possibly for large SMPs).
Thanks to Oleg Nesterov for review and suggestions.

Signed-off-by: Nippun Goel <nippung@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>

Index: linux-2.6.15-delme/kernel/sys.c
===================================================================
--- linux-2.6.15-delme.orig/kernel/sys.c	2006-01-05 15:40:38.000000000 -0800
+++ linux-2.6.15-delme/kernel/sys.c	2006-01-06 00:43:08.000000000 -0800
@@ -1664,9 +1664,6 @@ asmlinkage long sys_setrlimit(unsigned i
  * a lot simpler!  (Which we're not doing right now because we're not
  * measuring them yet).
  *
- * This expects to be called with tasklist_lock read-locked or better,
- * and the siglock not locked.  It may momentarily take the siglock.
- *
  * When sampling multiple threads for RUSAGE_SELF, under SMP we might have
  * races with threads incrementing their own counters.  But since word
  * reads are atomic, we either get new values or old values and we don't
@@ -1674,6 +1671,25 @@ asmlinkage long sys_setrlimit(unsigned i
  * the c* fields from p->signal from races with exit.c updating those
  * fields when reaping, so a sample either gets all the additions of a
  * given child after it's reaped, or none so this sample is before reaping.
+ *
+ * tasklist_lock locking optimisation:
+ * If we are current and single threaded, we do not need to take the tasklist
+ * lock or the siglock.  No one else can take our signal_struct away, 
+ * no one else can reap the children to update signal->c* counters, and
+ * no one else can race with the signal-> fields.
+ * If we do not take the tasklist_lock, the signal-> fields could be read
+ * out of order while another thread was just exiting. So we place a 
+ * read memory barrier when we avoid the lock.  On the writer side, 
+ * write memory barrier is implied in  __exit_signal as __exit_signal releases 
+ * the siglock spinlock after updating the signal-> fields.
+ *
+ * We don't really need the siglock when we access the non c* fields
+ * of the signal_struct (for RUSAGE_SELF) even in multithreaded
+ * case, since we take the tasklist lock for read and the non c* signal-> 
+ * fields are updated only in __exit_signal, which is called with 
+ * tasklist_lock taken for write, hence these two threads cannot execute
+ * concurrently.
+ *
  */
 
 static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
@@ -1681,14 +1697,22 @@ static void k_getrusage(struct task_stru
 	struct task_struct *t;
 	unsigned long flags;
 	cputime_t utime, stime;
+	int need_lock = 0;
 
 	memset((char *) r, 0, sizeof *r);
-
-	if (unlikely(!p->signal))
-		return;
-
 	utime = stime = cputime_zero;
 
+	need_lock = (p != current || !thread_group_empty(p));
+	if (need_lock) {
+		read_lock(&tasklist_lock);
+		if (unlikely(!p->signal)) {
+			read_unlock(&tasklist_lock);
+			return;
+		}
+	} else
+		/* See locking comments above */
+		smp_rmb();
+
 	switch (who) {
 		case RUSAGE_BOTH:
 		case RUSAGE_CHILDREN:
@@ -1727,6 +1751,8 @@ static void k_getrusage(struct task_stru
 			BUG();
 	}
 
+	if (need_lock)
+		read_unlock(&tasklist_lock);
 	cputime_to_timeval(utime, &r->ru_utime);
 	cputime_to_timeval(stime, &r->ru_stime);
 }
@@ -1734,9 +1760,7 @@ static void k_getrusage(struct task_stru
 int getrusage(struct task_struct *p, int who, struct rusage __user *ru)
 {
 	struct rusage r;
-	read_lock(&tasklist_lock);
 	k_getrusage(p, who, &r);
-	read_unlock(&tasklist_lock);
 	return copy_to_user(ru, &r, sizeof(r)) ? -EFAULT : 0;
 }
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single threadedprocess at getrusage()
  2006-01-06 17:23                 ` Christoph Lameter
  2006-01-06 19:46                   ` Ravikiran G Thirumalai
@ 2006-01-06 23:52                   ` Andrew Morton
  1 sibling, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2006-01-06 23:52 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: kiran, oleg, shai, nippung, linux-kernel

Christoph Lameter <clameter@engr.sgi.com> wrote:
>
> On Fri, 6 Jan 2006, Ravikiran G Thirumalai wrote:
> 
> > +	need_lock = !(p == current && thread_group_empty(p));
> 
> Isnt 
> 
> need_lock = (p != current || !thread_group_empty(b))
> 
> clearer?

I was actually going to change it to

	if (p != current || !thread_group_empty(p))
		need_lock = 1;

a) because my brain works that way and 

b) To make the currently-unneeded initialisation of need_lock do
   something useful ;)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single  threadedprocess at getrusage()
  2006-01-06  9:46               ` Ravikiran G Thirumalai
  2006-01-06 17:23                 ` Christoph Lameter
@ 2006-01-08 11:49                 ` Oleg Nesterov
  2006-01-08 19:58                   ` Ravikiran G Thirumalai
  1 sibling, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2006-01-08 11:49 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton

Sorry for delay,

Ravikiran G Thirumalai wrote:
> 
>  static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
> @@ -1681,14 +1697,22 @@ static void k_getrusage(struct task_stru
>         struct task_struct *t;
>         unsigned long flags;
>         cputime_t utime, stime;
> +       int need_lock = 0;

Unneeded initialization

>         memset((char *) r, 0, sizeof *r);
> -
> -       if (unlikely(!p->signal))
> -               return;
> -
>         utime = stime = cputime_zero;
> 
> +       need_lock = !(p == current && thread_group_empty(p));
> +       if (need_lock) {
> +               read_lock(&tasklist_lock);
> +               if (unlikely(!p->signal)) {
> +                       read_unlock(&tasklist_lock);
> +                       return;
> +               }
> +       } else
> +               /* See locking comments above */
> +               smp_rmb();

This patch doesn't try to optimize ->sighand.siglock locking,
and I think this is right. But this also means we don't need
rmb() here. It was needed to protect against "another thread
just exited, cpu can read ->c* values before thread_group_empty()
without taking siglock" case, now it is not possible.

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single threadedprocess at getrusage()
  2006-01-08 11:49                 ` Oleg Nesterov
@ 2006-01-08 19:58                   ` Ravikiran G Thirumalai
  2006-01-09 18:55                     ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Ravikiran G Thirumalai @ 2006-01-08 19:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton

On Sun, Jan 08, 2006 at 02:49:31PM +0300, Oleg Nesterov wrote:
> Sorry for delay,
> 
> Ravikiran G Thirumalai wrote:
> > 
> >  static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
> > @@ -1681,14 +1697,22 @@ static void k_getrusage(struct task_stru
> >         struct task_struct *t;
> >         unsigned long flags;
> >         cputime_t utime, stime;
> > +       int need_lock = 0;
> 
> Unneeded initialization

akpm changed the condition statement below with an if test.  So it is needed now.

> 
> >         memset((char *) r, 0, sizeof *r);
> > -
> > -       if (unlikely(!p->signal))
> > -               return;
> > -
> >         utime = stime = cputime_zero;
> > 
> > +       need_lock = !(p == current && thread_group_empty(p));
> > +       if (need_lock) {
> > +               read_lock(&tasklist_lock);
> > +               if (unlikely(!p->signal)) {
> > +                       read_unlock(&tasklist_lock);
> > +                       return;
> > +               }
> > +       } else
> > +               /* See locking comments above */
> > +               smp_rmb();
> 
> This patch doesn't try to optimize ->sighand.siglock locking,
> and I think this is right. But this also means we don't need
> rmb() here. It was needed to protect against "another thread
> just exited, cpu can read ->c* values before thread_group_empty()
> without taking siglock" case, now it is not possible.

Don't we still need rmb for the RUSAGE_SELF case? we do not take the
siglock for rusage self and the non c* signal fields are written to 
at __exit_signal...

What is wrong with optimizing by not taking the siglock in RUSAGE_BOTH
and RUSAGE_CHILDREN?  I would like to add that in too unless  I am
missing something and the optimization is incorrect.

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single  threadedprocess at getrusage()
  2006-01-08 19:58                   ` Ravikiran G Thirumalai
@ 2006-01-09 18:55                     ` Oleg Nesterov
  2006-01-09 20:54                       ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2006-01-09 18:55 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton

Ravikiran G Thirumalai wrote:
>
> On Sun, Jan 08, 2006 at 02:49:31PM +0300, Oleg Nesterov wrote:
> > > +       } else
> > > +               /* See locking comments above */
> > > +               smp_rmb();
> >
> > This patch doesn't try to optimize ->sighand.siglock locking,
> > and I think this is right. But this also means we don't need
> > rmb() here. It was needed to protect against "another thread
> > just exited, cpu can read ->c* values before thread_group_empty()
> > without taking siglock" case, now it is not possible.
>
> Don't we still need rmb for the RUSAGE_SELF case? we do not take the
> siglock for rusage self and the non c* signal fields are written to
> at __exit_signal...

I think it is unneeded because RUSAGE_SELF case is "racy" anyway even
if we held both locks, task_struct->xxx counters can change at any
moment.

But may be you are right.

> What is wrong with optimizing by not taking the siglock in RUSAGE_BOTH
> and RUSAGE_CHILDREN?  I would like to add that in too unless  I am
> missing something and the optimization is incorrect.

We can't have contention on ->siglock when need_lock == 0, so why should
we optimize this case?

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single threadedprocess at getrusage()
  2006-01-09 18:55                     ` Oleg Nesterov
@ 2006-01-09 20:54                       ` Ravikiran G Thirumalai
  2006-01-10 19:03                         ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Ravikiran G Thirumalai @ 2006-01-09 20:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton

On Mon, Jan 09, 2006 at 09:55:51PM +0300, Oleg Nesterov wrote:
> Ravikiran G Thirumalai wrote:
> >
> >
> > Don't we still need rmb for the RUSAGE_SELF case? we do not take the
> > siglock for rusage self and the non c* signal fields are written to
> > at __exit_signal...
> 
> I think it is unneeded because RUSAGE_SELF case is "racy" anyway even
> if we held both locks, task_struct->xxx counters can change at any
> moment.
> 
> But may be you are right.

Hmm...access to task_struct->xxx has been racy, but accessing the 
signal->* counters were not.  What if read of the signal->utime  was  a 
hoisted read and signal->stime was a read after the counter is updated?  
This was not a possibility earlier no? 
 
> 
> > What is wrong with optimizing by not taking the siglock in RUSAGE_BOTH
> > and RUSAGE_CHILDREN?  I would like to add that in too unless  I am
> > missing something and the optimization is incorrect.
> 
> We can't have contention on ->siglock when need_lock == 0, so why should
> we optimize this case?

We would be saving 1 buslocked operation in that case (on some arches), a 
cacheline fetch for exclusive (since signal and sighand are on different memory
locations), and disabling/enabling onchip interrupts.  But yes, this would be a 
smaller optimization....Unless you have strong objections this can also 
go in?

Thanks,
Kiran

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>

Index: linux-2.6.15-mm2/kernel/sys.c
===================================================================
--- linux-2.6.15-mm2.orig/kernel/sys.c	2006-01-09 12:44:30.000000000 -0800
+++ linux-2.6.15-mm2/kernel/sys.c	2006-01-09 12:45:07.000000000 -0800
@@ -1730,14 +1730,16 @@ static void k_getrusage(struct task_stru
 	switch (who) {
 		case RUSAGE_BOTH:
 		case RUSAGE_CHILDREN:
-			spin_lock_irqsave(&p->sighand->siglock, flags);
+			if (need_lock)
+				spin_lock_irqsave(&p->sighand->siglock, flags);
 			utime = p->signal->cutime;
 			stime = p->signal->cstime;
 			r->ru_nvcsw = p->signal->cnvcsw;
 			r->ru_nivcsw = p->signal->cnivcsw;
 			r->ru_minflt = p->signal->cmin_flt;
 			r->ru_majflt = p->signal->cmaj_flt;
-			spin_unlock_irqrestore(&p->sighand->siglock, flags);
+			if (need_lock)
+				spin_unlock_irqrestore(&p->sighand->siglock, flags);
 
 			if (who == RUSAGE_CHILDREN)
 				break;


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single  threadedprocess at getrusage()
  2006-01-09 20:54                       ` Ravikiran G Thirumalai
@ 2006-01-10 19:03                         ` Oleg Nesterov
  2006-01-16 20:56                           ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2006-01-10 19:03 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton

Ravikiran G Thirumalai wrote:
> 
> On Mon, Jan 09, 2006 at 09:55:51PM +0300, Oleg Nesterov wrote:
> > Ravikiran G Thirumalai wrote:
> > >
> > >
> > > Don't we still need rmb for the RUSAGE_SELF case? we do not take the
> > > siglock for rusage self and the non c* signal fields are written to
> > > at __exit_signal...
> >
> > I think it is unneeded because RUSAGE_SELF case is "racy" anyway even
> > if we held both locks, task_struct->xxx counters can change at any
> > moment.
> >
> > But may be you are right.
> 
> Hmm...access to task_struct->xxx has been racy, but accessing the
> signal->* counters were not.  What if read of the signal->utime  was  a
> hoisted read and signal->stime was a read after the counter is updated?
> This was not a possibility earlier no?

Sorry, I can't undestand. Could you please be more verbose ?

> >
> > > What is wrong with optimizing by not taking the siglock in RUSAGE_BOTH
> > > and RUSAGE_CHILDREN?  I would like to add that in too unless  I am
> > > missing something and the optimization is incorrect.
> >
> > We can't have contention on ->siglock when need_lock == 0, so why should
> > we optimize this case?
> 
> We would be saving 1 buslocked operation in that case (on some arches), a
> cacheline fetch for exclusive (since signal and sighand are on different memory
> locations), and disabling/enabling onchip interrupts.  But yes, this would be a
> smaller optimization....Unless you have strong objections this can also
> go in?

I don't have strong objections, but I am not a maintainer.

However, do you have any numbers or thoughts why this optimization
can make any _visible_ effect?

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single threadedprocess at getrusage()
  2006-01-10 19:03                         ` Oleg Nesterov
@ 2006-01-16 20:56                           ` Ravikiran G Thirumalai
  2006-01-17 19:59                             ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Ravikiran G Thirumalai @ 2006-01-16 20:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton, dipankar

Sorry for the delay..

On Tue, Jan 10, 2006 at 10:03:35PM +0300, Oleg Nesterov wrote:
> Ravikiran G Thirumalai wrote:
> > 
> > > >
> > > > Don't we still need rmb for the RUSAGE_SELF case? we do not take the
> > > > siglock for rusage self and the non c* signal fields are written to
> > > > at __exit_signal...
> > >
> > > I think it is unneeded because RUSAGE_SELF case is "racy" anyway even
> > > if we held both locks, task_struct->xxx counters can change at any
> > > moment.
> > >
> > > But may be you are right.
> > 
> > Hmm...access to task_struct->xxx has been racy, but accessing the
> > signal->* counters were not.  What if read of the signal->utime  was  a
> > hoisted read and signal->stime was a read after the counter is updated?
> > This was not a possibility earlier no?
> 
> Sorry, I can't undestand. Could you please be more verbose ?

What I meant to say was, if a thread has just exited, since we do not use
locks anymore in ST case, the read of signal->utime may happen out of order,

(excuse long lines)

Last thread (RUSAGE_SELF)		Exiting thread

k_getrusage()				__exit_signal()
	.					.
	load sig->utime (hoisted read) 		.
	.					sig->utime = cputime_add(sig->utime, tsk->utime);
	.					sig->stime = cputime_add(sig->stime, tsk->stime);
						.
						.
						spin_unlock(&sighand->siglock); --> (A)
						.
					__unhash_process()
						.
						detach_pid(p, PIDTYPE_PGID);
	if (!thread_group_empty())		.
	.
	don't take any lock based on if --> (B)
	.
	.
	utime = cputime_add(utime, p->signal->utime); /* use cached load above */
	stime = cputime_add(stime, p->signal->stime); /* load from memory */

So although writes happen in order due to (A) above, there is no guarantee
interms of read order when we do not take locks,(as far as my understanding
goes)  so I think a rmb() is needed at (B), else as in this example, some 
counters may have values before the exiting thread updated  the sig-> fields 
and some after the thread updated the sig-> fields.  This might have a 
significant effect than the task_struct->xxx inaccuracies.  Of course 
this is theoretical.  This was not a possibility earlier because 
__exit_signal and k_getrusage() could not run at the same time due to the 
exiting thread taking tasklist lock for write and k_getrusage thread taking 
the lock for read.
I am also cc'ing experts in memory re-ordering issues to check if I am
missing something :)

I think we need a rmb() at sys_times too based on the above. I will make a
patch for that.

> 
> > >
> > > > What is wrong with optimizing by not taking the siglock in RUSAGE_BOTH
> > > > and RUSAGE_CHILDREN?  I would like to add that in too unless  I am
> > > > missing something and the optimization is incorrect.
> > >
> > > We can't have contention on ->siglock when need_lock == 0, so why should
> > > we optimize this case?
> > 
> > We would be saving 1 buslocked operation in that case (on some arches), a
> > cacheline fetch for exclusive (since signal and sighand are on different memory
> > locations), and disabling/enabling onchip interrupts.  But yes, this would be a
> > smaller optimization....Unless you have strong objections this can also
> > go in?
> 
> I don't have strong objections, but I am not a maintainer.
> 
> However, do you have any numbers or thoughts why this optimization
> can make any _visible_ effect?

We know we don't need locks there, so I do not understand why we
should keep them.  Locks are always serializing and expensive operations.  I
believe on some arches disabling on-chip interrupts is also an expensive
operation...some arches might use hypervisor calls to do that which I guess
will have its own overhead...so why have it when we know we don't need it?

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single threadedprocess at getrusage()
  2006-01-17 19:59                             ` Oleg Nesterov
@ 2006-01-17 19:52                               ` Ravikiran G Thirumalai
  2006-01-18  9:17                                 ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Ravikiran G Thirumalai @ 2006-01-17 19:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton, dipankar

On Tue, Jan 17, 2006 at 10:59:02PM +0300, Oleg Nesterov wrote:
> Ravikiran G Thirumalai wrote:
> > 
> > Sorry for the delay..
> > 
> > On Tue, Jan 10, 2006 at 10:03:35PM +0300, Oleg Nesterov wrote:
> > >
> > > Sorry, I can't undestand. Could you please be more verbose ?
> > 
> > Last thread (RUSAGE_SELF)               Exiting thread
> >
> > [ ... ]
> >
> >         utime = cputime_add(utime, p->signal->utime); /* use cached load above */
> >         stime = cputime_add(stime, p->signal->stime); /* load from memory */
> 
> Thanks for your explanation, now I see what you mean.
> 
> But don't we already discussed this issue? I think that RUSAGE_SELF
> case always not 100% accurate, so it is Ok to ignore this race.

It is not 100% accurate as in we lose time accounting for one clock tick
for the task_struct->utime, stime counters.  But
task_struct->signal->utime,stime collect rusage times of an exiting thread,
so we would be introducing large inaccuracies if we don't use rmb here.
Take the case when an exiting thread has a large utime stime value, and
rusage reports utime before thread exit and stime after thread exit... the
result would look wierd.
So IMHO, while inaccuracies in task_struct->xxx time can be tolerated, it
might not be such a good idea to for task_struct->signal->xxx counters.

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single  threadedprocess at getrusage()
  2006-01-16 20:56                           ` Ravikiran G Thirumalai
@ 2006-01-17 19:59                             ` Oleg Nesterov
  2006-01-17 19:52                               ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2006-01-17 19:59 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton, dipankar

Ravikiran G Thirumalai wrote:
> 
> Sorry for the delay..
> 
> On Tue, Jan 10, 2006 at 10:03:35PM +0300, Oleg Nesterov wrote:
> >
> > Sorry, I can't undestand. Could you please be more verbose ?
> 
> Last thread (RUSAGE_SELF)               Exiting thread
>
> [ ... ]
>
>         utime = cputime_add(utime, p->signal->utime); /* use cached load above */
>         stime = cputime_add(stime, p->signal->stime); /* load from memory */

Thanks for your explanation, now I see what you mean.

But don't we already discussed this issue? I think that RUSAGE_SELF
case always not 100% accurate, so it is Ok to ignore this race.

What if that thread has not exited yet? We take tasklist lock, but
this can't help, because this thread possibly updates it's ->xtime
right now on another cpu, and we have exactly same problem.

> > However, do you have any numbers or thoughts why this optimization
> > can make any _visible_ effect?
> 
> We know we don't need locks there, so I do not understand why we
> should keep them.  Locks are always serializing and expensive operations.  I
> believe on some arches disabling on-chip interrupts is also an expensive
> operation...some arches might use hypervisor calls to do that which I guess
> will have its own overhead...so why have it when we know we don't need it?

I think it is better not to complicate the code unless we can see
some difference in practice.

That said, I don't have a strong feeling that I am right (on both
issues), so please feel free to ignore me.

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single  threadedprocess at getrusage()
  2006-01-17 19:52                               ` Ravikiran G Thirumalai
@ 2006-01-18  9:17                                 ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2006-01-18  9:17 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton, dipankar

Ravikiran G Thirumalai wrote:
> 
> On Tue, Jan 17, 2006 at 10:59:02PM +0300, Oleg Nesterov wrote:
> >
> > But don't we already discussed this issue? I think that RUSAGE_SELF
> > case always not 100% accurate, so it is Ok to ignore this race.
> 
> Take the case when an exiting thread has a large utime stime value, and
> rusage reports utime before thread exit and stime after thread exit... the
> result would look wierd.
> So IMHO, while inaccuracies in task_struct->xxx time can be tolerated, it
> might not be such a good idea to for task_struct->signal->xxx counters.

Yes, you are right. Now I don't understand why I didn't understand this
before. Thank you, Ravikiran.

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single  threadedprocess at getrusage()
  2006-01-06 19:46                   ` Ravikiran G Thirumalai
@ 2006-03-20 18:04                     ` Oleg Nesterov
  2006-03-22 22:18                       ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2006-03-20 18:04 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton

Hello Ravikiran,

Ravikiran G Thirumalai wrote:
> 
> Following patch avoids taking the global tasklist_lock when possible,
> if a process is single threaded during getrusage().  Any avoidance of
> tasklist_lock is good for NUMA boxes (and possibly for large SMPs).
>
> ...
>
>  static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
> @@ -1681,14 +1697,22 @@ static void k_getrusage(struct task_stru
>         struct task_struct *t;
>         unsigned long flags;
>         cputime_t utime, stime;
> +       int need_lock = 0;
> 
>         memset((char *) r, 0, sizeof *r);
> -
> -       if (unlikely(!p->signal))
> -               return;
> -
>         utime = stime = cputime_zero;
> 
> +       need_lock = (p != current || !thread_group_empty(p));
> +       if (need_lock) {
> +               read_lock(&tasklist_lock);
> +               if (unlikely(!p->signal)) {
> +                       read_unlock(&tasklist_lock);
> +                       return;
> +               }
> +       } else
> +               /* See locking comments above */
> +               smp_rmb();
> +

I think now it is possible to improve this patch.

Could you look at these patches?

	[PATCH] introduce lock_task_sighand() helper
	http://marc.theaimsgroup.com/?l=linux-kernel&m=114028190927763

	[PATCH 0/3] make threads traversal ->siglock safe
	http://marc.theaimsgroup.com/?l=linux-kernel&m=114064825626496

I think we can forget about tasklist_lock in k_getrusage() completely
and just use lock_task_sighand().

What do you think?

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single threadedprocess at getrusage()
  2006-03-20 18:04                     ` Oleg Nesterov
@ 2006-03-22 22:18                       ` Ravikiran G Thirumalai
  2006-03-23 18:18                         ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Ravikiran G Thirumalai @ 2006-03-22 22:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton

On Mon, Mar 20, 2006 at 09:04:56PM +0300, Oleg Nesterov wrote:
> Hello Ravikiran,

Hi Oleg, sorry for the late response..

> 
> Ravikiran G Thirumalai wrote:
> > 
> > Following patch avoids taking the global tasklist_lock when possible,
> > if a process is single threaded during getrusage().  Any avoidance of
> > tasklist_lock is good for NUMA boxes (and possibly for large SMPs).
> >
> > ...
> >
> >  static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
> > @@ -1681,14 +1697,22 @@ static void k_getrusage(struct task_stru
> >         struct task_struct *t;
> >         unsigned long flags;
> >         cputime_t utime, stime;
> > +       int need_lock = 0;
> > 
> >         memset((char *) r, 0, sizeof *r);
> > -
> > -       if (unlikely(!p->signal))
> > -               return;
> > -
> >         utime = stime = cputime_zero;
> > 
> > +       need_lock = (p != current || !thread_group_empty(p));
> > +       if (need_lock) {
> > +               read_lock(&tasklist_lock);
> > +               if (unlikely(!p->signal)) {
> > +                       read_unlock(&tasklist_lock);
> > +                       return;
> > +               }
> > +       } else
> > +               /* See locking comments above */
> > +               smp_rmb();
> > +
> 
> I think now it is possible to improve this patch.
> 
> Could you look at these patches?
> 
> 	[PATCH] introduce lock_task_sighand() helper
> 	http://marc.theaimsgroup.com/?l=linux-kernel&m=114028190927763
> 
> 	[PATCH 0/3] make threads traversal ->siglock safe
> 	http://marc.theaimsgroup.com/?l=linux-kernel&m=114064825626496
> 
> I think we can forget about tasklist_lock in k_getrusage() completely
> and just use lock_task_sighand().
> 
> What do you think?

Great!! Nice patches to avoid tasklist lock on thread traversal.

However, I was trying to comprehend the tasklist locking changes in 
2.6.16-rc6mm2 and hit upon:

__exit_signal
	cleanup_sighand(tsk);
		kmem_cache_free(sighand)
	spin_unlock(sighand->lock)

It looked suspicious to me until I realised sighand cache now had 
SLAB_DESTROY_BY_RCU. Can we please add comments (at cleanup_sighand or 
__exit_signal) to make it a bit clearer for people like me :)

How is the following patch to avoid tasklist lock completely at getrusage?
(Andrew, I can remake the patch against a reverted 
avoid-taking-global-tasklist_lock-for-single-threadedprocess-at-getrusage
if you prefer it that way)

Thanks,
Kiran


Change avoid-taking-global-tasklist_lock-for-single-threadedprocess-at-getrusage
patch to not take the global tasklist lock at all. We don't need to take
the tasklist lock for thread traversal of a process since Oleg's
do-__unhash_process-under-siglock.patch and related work.

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>

Index: linux-2.6.16-rc6mm2/kernel/sys.c
===================================================================
--- linux-2.6.16-rc6mm2.orig/kernel/sys.c	2006-03-21 17:04:09.000000000 -0800
+++ linux-2.6.16-rc6mm2/kernel/sys.c	2006-03-22 12:46:03.000000000 -0800
@@ -1860,23 +1860,20 @@ out:
  * fields when reaping, so a sample either gets all the additions of a
  * given child after it's reaped, or none so this sample is before reaping.
  *
- * tasklist_lock locking optimisation:
- * If we are current and single threaded, we do not need to take the tasklist
- * lock or the siglock.  No one else can take our signal_struct away,
- * no one else can reap the children to update signal->c* counters, and
- * no one else can race with the signal-> fields.
- * If we do not take the tasklist_lock, the signal-> fields could be read
- * out of order while another thread was just exiting. So we place a
- * read memory barrier when we avoid the lock.  On the writer side,
- * write memory barrier is implied in  __exit_signal as __exit_signal releases
- * the siglock spinlock after updating the signal-> fields.
- *
- * We don't really need the siglock when we access the non c* fields
- * of the signal_struct (for RUSAGE_SELF) even in multithreaded
- * case, since we take the tasklist lock for read and the non c* signal->
- * fields are updated only in __exit_signal, which is called with
- * tasklist_lock taken for write, hence these two threads cannot execute
- * concurrently.
+ * Locking:
+ * We need to take the siglock for CHILDEREN, SELF and BOTH 
+ * for  the cases current multithreaded, non-current single threaded
+ * non-current multithreaded.  Thread traversal is now safe with 
+ * the siglock held. 
+ * Strictly speaking, we donot need to take the siglock if we are current and 
+ * single threaded,  as no one else can take our signal_struct away, no one 
+ * else can  reap the  children to update signal->c* counters, and no one else 
+ * can race with the signal-> fields. If we do not take any lock, the 
+ * signal-> fields could be read out of order while another thread was just 
+ * exiting. So we should  place a read memory barrier when we avoid the lock.  
+ * On the writer side,  write memory barrier is implied in  __exit_signal 
+ * as __exit_signal releases  the siglock spinlock after updating the signal-> 
+ * fields. But we don't do this yet to keep things simple.
  *
  */
 
@@ -1885,35 +1882,25 @@ static void k_getrusage(struct task_stru
 	struct task_struct *t;
 	unsigned long flags;
 	cputime_t utime, stime;
-	int need_lock = 0;
 
 	memset((char *) r, 0, sizeof *r);
 	utime = stime = cputime_zero;
 
-	if (p != current || !thread_group_empty(p))
-		need_lock = 1;
-
-	if (need_lock) {
-		read_lock(&tasklist_lock);
-		if (unlikely(!p->signal)) {
-			read_unlock(&tasklist_lock);
-			return;
-		}
-	} else
-		/* See locking comments above */
-		smp_rmb();
+	rcu_read_lock();
+	if (!lock_task_sighand(p, &flags)) {
+		rcu_read_unlock();
+		return;
+	}	
 
 	switch (who) {
 		case RUSAGE_BOTH:
 		case RUSAGE_CHILDREN:
-			spin_lock_irqsave(&p->sighand->siglock, flags);
 			utime = p->signal->cutime;
 			stime = p->signal->cstime;
 			r->ru_nvcsw = p->signal->cnvcsw;
 			r->ru_nivcsw = p->signal->cnivcsw;
 			r->ru_minflt = p->signal->cmin_flt;
 			r->ru_majflt = p->signal->cmaj_flt;
-			spin_unlock_irqrestore(&p->sighand->siglock, flags);
 
 			if (who == RUSAGE_CHILDREN)
 				break;
@@ -1940,9 +1927,10 @@ static void k_getrusage(struct task_stru
 		default:
 			BUG();
 	}
-
-	if (need_lock)
-		read_unlock(&tasklist_lock);
+	
+	unlock_task_sighand(p, &flags);
+	rcu_read_unlock();
+	
 	cputime_to_timeval(utime, &r->ru_utime);
 	cputime_to_timeval(stime, &r->ru_stime);
 }


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [rfc][patch] Avoid taking global tasklist_lock for single  threadedprocess at getrusage()
  2006-03-22 22:18                       ` Ravikiran G Thirumalai
@ 2006-03-23 18:18                         ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2006-03-23 18:18 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Christoph Lameter, Shai Fultheim, Nippun Goel, linux-kernel,
	Andrew Morton

Ravikiran G Thirumalai wrote:
>
> __exit_signal
>         cleanup_sighand(tsk);
>                 kmem_cache_free(sighand)
>         spin_unlock(sighand->lock)
> 
> It looked suspicious to me until I realised sighand cache now had
> SLAB_DESTROY_BY_RCU. Can we please add comments (at cleanup_sighand or
> __exit_signal) to make it a bit clearer for people like me :)

Yes, this is really confusing, see also
	http://marc.theaimsgroup.com/?l=linux-kernel&m=114089590923893
I'll send a cleanup patch soon.

> How is the following patch to avoid tasklist lock completely at getrusage?

I think this patch is correct, thanks.

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2006-03-23 18:21 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-12-24 17:52 [rfc][patch] Avoid taking global tasklist_lock for single threaded process at getrusage() Oleg Nesterov
2005-12-27 20:21 ` Christoph Lameter
2005-12-28 12:38   ` [rfc][patch] Avoid taking global tasklist_lock for single threadedprocess " Oleg Nesterov
2005-12-28 18:33     ` Ravikiran G Thirumalai
2005-12-28 22:57       ` Ravikiran G Thirumalai
2005-12-30 17:57         ` Oleg Nesterov
2006-01-04 23:16           ` Ravikiran G Thirumalai
2006-01-05 19:17             ` Oleg Nesterov
2006-01-06  9:46               ` Ravikiran G Thirumalai
2006-01-06 17:23                 ` Christoph Lameter
2006-01-06 19:46                   ` Ravikiran G Thirumalai
2006-03-20 18:04                     ` Oleg Nesterov
2006-03-22 22:18                       ` Ravikiran G Thirumalai
2006-03-23 18:18                         ` Oleg Nesterov
2006-01-06 23:52                   ` Andrew Morton
2006-01-08 11:49                 ` Oleg Nesterov
2006-01-08 19:58                   ` Ravikiran G Thirumalai
2006-01-09 18:55                     ` Oleg Nesterov
2006-01-09 20:54                       ` Ravikiran G Thirumalai
2006-01-10 19:03                         ` Oleg Nesterov
2006-01-16 20:56                           ` Ravikiran G Thirumalai
2006-01-17 19:59                             ` Oleg Nesterov
2006-01-17 19:52                               ` Ravikiran G Thirumalai
2006-01-18  9:17                                 ` Oleg Nesterov
2006-01-03 18:18         ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).