linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* AIM7 40% regression with 2.6.26-rc1
@ 2008-05-06  5:48 Zhang, Yanmin
  2008-05-06 11:18 ` Matthew Wilcox
  2008-05-06 11:44 ` Ingo Molnar
  0 siblings, 2 replies; 140+ messages in thread
From: Zhang, Yanmin @ 2008-05-06  5:48 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: LKML

Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more than 40% with 2.6.26-rc1
on my 8-core stoakley, 16-core tigerton, and Itanium Montecito. Bisect located
below patch.

64ac24e738823161693bf791f87adc802cf529ff is first bad commit
commit 64ac24e738823161693bf791f87adc802cf529ff
Author: Matthew Wilcox <matthew@wil.cx>
Date:   Fri Mar 7 21:55:58 2008 -0500

    Generic semaphore implementation


After I manually reverted the patch against 2.6.26-rc1 while fixing lots of
conflictions/errors, aim7 regression became less than 2%.

-yanmin



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06  5:48 AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin
@ 2008-05-06 11:18 ` Matthew Wilcox
  2008-05-06 11:44 ` Ingo Molnar
  1 sibling, 0 replies; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-06 11:18 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: LKML

On Tue, May 06, 2008 at 01:48:24PM +0800, Zhang, Yanmin wrote:
> Comparing with kernel 2.6.25, ???AIM7 (use tmpfs) has ???more than 40% with 2.6.26-rc1
> on my 8-core stoakley, 16-core tigerton, and Itanium Montecito. Bisect located
> below patch.
> 
> 64ac24e738823161693bf791f87adc802cf529ff is first bad commit
> commit 64ac24e738823161693bf791f87adc802cf529ff
> Author: Matthew Wilcox <matthew@wil.cx>
> Date:   Fri Mar 7 21:55:58 2008 -0500
> 
>     Generic semaphore implementation
> 
> 
> After I manually reverted the patch against 2.6.26-rc1 while fixing lots of
> conflictions/errors, aim7 regression became less than 2%.

40%?!  That's shocking.  Can you tell which semaphore was heavily
contended?  I have a horrible feeling that it's the BKL.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06  5:48 AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin
  2008-05-06 11:18 ` Matthew Wilcox
@ 2008-05-06 11:44 ` Ingo Molnar
  2008-05-06 12:09   ` Matthew Wilcox
  2008-05-07  2:11   ` Zhang, Yanmin
  1 sibling, 2 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-06 11:44 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton


* Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:

> Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more than 40% with 
> 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton, and Itanium 
> Montecito. Bisect located below patch.
> 
> 64ac24e738823161693bf791f87adc802cf529ff is first bad commit
> commit 64ac24e738823161693bf791f87adc802cf529ff
> Author: Matthew Wilcox <matthew@wil.cx>
> Date:   Fri Mar 7 21:55:58 2008 -0500
> 
>     Generic semaphore implementation
> 
> After I manually reverted the patch against 2.6.26-rc1 while fixing 
> lots of conflictions/errors, aim7 regression became less than 2%.

hm, which exact semaphore would that be due to?

My first blind guess would be the BKL - there's not much other semaphore 
use left in the core kernel otherwise that would affect AIM7 normally. 
The VFS still makes frequent use of the BKL and AIM7 is very VFS 
intense. Getting rid of that BKL use from the VFS might be useful to 
performance anyway.

Could you try to check that it's indeed the BKL?

Easiest way to check it would be to run AIM7 it on 
sched-devel.git/latest and do scheduler tracing via:

   http://people.redhat.com/mingo/sched-devel.git/readme-tracer.txt

by doing:

   echo stacktrace > /debug/tracing/iter_ctl

you could get exact backtraces of all scheduling points in the trace. If 
the BKL's down() shows up in those traces then it's definitely the BKL 
that causes this. The backtraces will also tell us exactly which BKL use 
is the most frequent one.

To keep tracing overhead low on SMP i'd also suggest to only trace a 
single CPU, via:

  echo 1 > /debug/tracing/tracing_cpumask

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 11:44 ` Ingo Molnar
@ 2008-05-06 12:09   ` Matthew Wilcox
  2008-05-06 16:23     ` Matthew Wilcox
  2008-05-07  2:11   ` Zhang, Yanmin
  1 sibling, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-06 12:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds,
	Andrew Morton, linux-fsdevel

On Tue, May 06, 2008 at 01:44:49PM +0200, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> > After I manually reverted the patch against 2.6.26-rc1 while fixing 
> > lots of conflictions/errors, aim7 regression became less than 2%.
> 
> hm, which exact semaphore would that be due to?
> 
> My first blind guess would be the BKL - there's not much other semaphore 
> use left in the core kernel otherwise that would affect AIM7 normally. 
> The VFS still makes frequent use of the BKL and AIM7 is very VFS 
> intense. Getting rid of that BKL use from the VFS might be useful to 
> performance anyway.

That's slightly slanderous to the VFS ;-)  The BKL really isn't used
that much any more.  So little that I've gone through and produced a
list of places it's used:

fs/block_dev.c		opening and closing a block device.  Unlikely to be
			provoked by AIM7.
fs/char_dev.c		chrdev_open.  Unlikely to be provoked by AIM7.
fs/compat.c		mount.  Unlikely to be provoked by AIM7.
fs/compat_ioctl.c	held around calls to ioctl translator.
fs/exec.c		coredump.  If this is a contention problem ...
fs/fcntl.c		held around call to ->fasync.
fs/ioctl.c		held around f_op ->ioctl call (tmpfs doesn't have
			ioctl).  ditto bmap.  there's fasync, as previously
			mentioned.
fs/locks.c		hellhole.  I hope AIM7 doesn't use locks.
fs/namespace.c		mount, umount.  Unlikely to be provoked by AIM7.
fs/read_write.c		llseek.  tmpfs uses the unlocked version.
fs/super.c		shtdown, remount.  Unlikely to be provoked by AIM7.

So the only likely things I can see are:

 - file locks
 - fasync

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 12:09   ` Matthew Wilcox
@ 2008-05-06 16:23     ` Matthew Wilcox
  2008-05-06 16:36       ` Linus Torvalds
                         ` (2 more replies)
  0 siblings, 3 replies; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-06 16:23 UTC (permalink / raw)
  To: Ingo Molnar, J. Bruce Fields
  Cc: Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds,
	Andrew Morton, linux-fsdevel

On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> So the only likely things I can see are:
> 
>  - file locks
>  - fasync

I've wanted to fix file locks for a while.  Here's a first attempt.
It was done quickly, so I concede that it may well have bugs in it.
I found (and fixed) one with LTP.

It takes *no account* of nfsd, nor remote filesystems.  We need to have
a serious discussion about their requirements.

diff --git a/fs/locks.c b/fs/locks.c
index 663c069..cb09765 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -140,6 +140,8 @@ int lease_break_time = 45;
 #define for_each_lock(inode, lockp) \
 	for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)
 
+static DEFINE_SPINLOCK(file_lock_lock);
+
 static LIST_HEAD(file_lock_list);
 static LIST_HEAD(blocked_list);
 
@@ -510,9 +512,9 @@ static void __locks_delete_block(struct file_lock *waiter)
  */
 static void locks_delete_block(struct file_lock *waiter)
 {
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	__locks_delete_block(waiter);
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 }
 
 /* Insert waiter into blocker's block list.
@@ -649,7 +651,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
 {
 	struct file_lock *cfl;
 
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	for (cfl = filp->f_path.dentry->d_inode->i_flock; cfl; cfl = cfl->fl_next) {
 		if (!IS_POSIX(cfl))
 			continue;
@@ -662,7 +664,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
 			fl->fl_pid = pid_vnr(cfl->fl_nspid);
 	} else
 		fl->fl_type = F_UNLCK;
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 	return;
 }
 EXPORT_SYMBOL(posix_test_lock);
@@ -735,18 +737,21 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
 	int error = 0;
 	int found = 0;
 
-	lock_kernel();
-	if (request->fl_flags & FL_ACCESS)
+	if (request->fl_flags & FL_ACCESS) {
+		spin_lock(&file_lock_lock);
 		goto find_conflict;
+	}
 
 	if (request->fl_type != F_UNLCK) {
 		error = -ENOMEM;
+		
 		new_fl = locks_alloc_lock();
 		if (new_fl == NULL)
-			goto out;
+			goto out_unlocked;
 		error = 0;
 	}
 
+	spin_lock(&file_lock_lock);
 	for_each_lock(inode, before) {
 		struct file_lock *fl = *before;
 		if (IS_POSIX(fl))
@@ -772,10 +777,13 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
 	 * If a higher-priority process was blocked on the old file lock,
 	 * give it the opportunity to lock the file.
 	 */
-	if (found)
+	if (found) {
+		spin_unlock(&file_lock_lock);
 		cond_resched();
+		spin_lock(&file_lock_lock);
+	}
 
-find_conflict:
+ find_conflict:
 	for_each_lock(inode, before) {
 		struct file_lock *fl = *before;
 		if (IS_POSIX(fl))
@@ -796,8 +804,9 @@ find_conflict:
 	new_fl = NULL;
 	error = 0;
 
-out:
-	unlock_kernel();
+ out:
+	spin_unlock(&file_lock_lock);
+ out_unlocked:
 	if (new_fl)
 		locks_free_lock(new_fl);
 	return error;
@@ -826,7 +835,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
 		new_fl2 = locks_alloc_lock();
 	}
 
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	if (request->fl_type != F_UNLCK) {
 		for_each_lock(inode, before) {
 			fl = *before;
@@ -994,7 +1003,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
 		locks_wake_up_blocks(left);
 	}
  out:
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 	/*
 	 * Free any unused locks.
 	 */
@@ -1069,14 +1078,14 @@ int locks_mandatory_locked(struct inode *inode)
 	/*
 	 * Search the lock list for this inode for any POSIX locks.
 	 */
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
 		if (!IS_POSIX(fl))
 			continue;
 		if (fl->fl_owner != owner)
 			break;
 	}
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 	return fl ? -EAGAIN : 0;
 }
 
@@ -1190,7 +1199,7 @@ int __break_lease(struct inode *inode, unsigned int mode)
 
 	new_fl = lease_alloc(NULL, mode & FMODE_WRITE ? F_WRLCK : F_RDLCK);
 
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 
 	time_out_leases(inode);
 
@@ -1251,8 +1260,10 @@ restart:
 			break_time++;
 	}
 	locks_insert_block(flock, new_fl);
+	spin_unlock(&file_lock_lock);
 	error = wait_event_interruptible_timeout(new_fl->fl_wait,
 						!new_fl->fl_next, break_time);
+	spin_lock(&file_lock_lock);
 	__locks_delete_block(new_fl);
 	if (error >= 0) {
 		if (error == 0)
@@ -1266,8 +1277,8 @@ restart:
 		error = 0;
 	}
 
-out:
-	unlock_kernel();
+ out:
+	spin_unlock(&file_lock_lock);
 	if (!IS_ERR(new_fl))
 		locks_free_lock(new_fl);
 	return error;
@@ -1323,7 +1334,7 @@ int fcntl_getlease(struct file *filp)
 	struct file_lock *fl;
 	int type = F_UNLCK;
 
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	time_out_leases(filp->f_path.dentry->d_inode);
 	for (fl = filp->f_path.dentry->d_inode->i_flock; fl && IS_LEASE(fl);
 			fl = fl->fl_next) {
@@ -1332,7 +1343,7 @@ int fcntl_getlease(struct file *filp)
 			break;
 		}
 	}
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 	return type;
 }
 
@@ -1363,6 +1374,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
 	if (error)
 		return error;
 
+	spin_lock(&file_lock_lock);
 	time_out_leases(inode);
 
 	BUG_ON(!(*flp)->fl_lmops->fl_break);
@@ -1370,10 +1382,11 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
 	lease = *flp;
 
 	if (arg != F_UNLCK) {
+		spin_unlock(&file_lock_lock);
 		error = -ENOMEM;
 		new_fl = locks_alloc_lock();
 		if (new_fl == NULL)
-			goto out;
+			goto out_unlocked;
 
 		error = -EAGAIN;
 		if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
@@ -1382,6 +1395,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
 		    && ((atomic_read(&dentry->d_count) > 1)
 			|| (atomic_read(&inode->i_count) > 1)))
 			goto out;
+		spin_lock(&file_lock_lock);
 	}
 
 	/*
@@ -1429,11 +1443,14 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
 
 	locks_copy_lock(new_fl, lease);
 	locks_insert_lock(before, new_fl);
+	spin_unlock(&file_lock_lock);
 
 	*flp = new_fl;
 	return 0;
 
-out:
+ out:
+	spin_unlock(&file_lock_lock);
+ out_unlocked:
 	if (new_fl != NULL)
 		locks_free_lock(new_fl);
 	return error;
@@ -1471,12 +1488,10 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
 {
 	int error;
 
-	lock_kernel();
 	if (filp->f_op && filp->f_op->setlease)
 		error = filp->f_op->setlease(filp, arg, lease);
 	else
 		error = generic_setlease(filp, arg, lease);
-	unlock_kernel();
 
 	return error;
 }
@@ -1503,12 +1518,11 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
 	if (error)
 		return error;
 
-	lock_kernel();
-
 	error = vfs_setlease(filp, arg, &flp);
 	if (error || arg == F_UNLCK)
-		goto out_unlock;
+		return error;
 
+	lock_kernel();
 	error = fasync_helper(fd, filp, 1, &flp->fl_fasync);
 	if (error < 0) {
 		/* remove lease just inserted by setlease */
@@ -1519,7 +1533,7 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
 	}
 
 	error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
-out_unlock:
+ out_unlock:
 	unlock_kernel();
 	return error;
 }
@@ -2024,7 +2038,7 @@ void locks_remove_flock(struct file *filp)
 			fl.fl_ops->fl_release_private(&fl);
 	}
 
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	before = &inode->i_flock;
 
 	while ((fl = *before) != NULL) {
@@ -2042,7 +2056,7 @@ void locks_remove_flock(struct file *filp)
  		}
 		before = &fl->fl_next;
 	}
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 }
 
 /**
@@ -2057,12 +2071,12 @@ posix_unblock_lock(struct file *filp, struct file_lock *waiter)
 {
 	int status = 0;
 
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	if (waiter->fl_next)
 		__locks_delete_block(waiter);
 	else
 		status = -ENOENT;
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 	return status;
 }
 
@@ -2175,7 +2189,7 @@ static int locks_show(struct seq_file *f, void *v)
 
 static void *locks_start(struct seq_file *f, loff_t *pos)
 {
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	f->private = (void *)1;
 	return seq_list_start(&file_lock_list, *pos);
 }
@@ -2187,7 +2201,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
 
 static void locks_stop(struct seq_file *f, void *v)
 {
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 }
 
 struct seq_operations locks_seq_operations = {
@@ -2215,7 +2229,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
 {
 	struct file_lock *fl;
 	int result = 1;
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
 		if (IS_POSIX(fl)) {
 			if (fl->fl_type == F_RDLCK)
@@ -2232,7 +2246,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
 		result = 0;
 		break;
 	}
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 	return result;
 }
 
@@ -2255,7 +2269,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
 {
 	struct file_lock *fl;
 	int result = 1;
-	lock_kernel();
+	spin_lock(&file_lock_lock);
 	for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
 		if (IS_POSIX(fl)) {
 			if ((fl->fl_end < start) || (fl->fl_start > (start + len)))
@@ -2270,7 +2284,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
 		result = 0;
 		break;
 	}
-	unlock_kernel();
+	spin_unlock(&file_lock_lock);
 	return result;
 }
 

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:23     ` Matthew Wilcox
@ 2008-05-06 16:36       ` Linus Torvalds
  2008-05-06 16:42         ` Matthew Wilcox
  2008-05-06 16:44         ` J. Bruce Fields
  2008-05-06 17:21       ` Andrew Morton
  2008-05-08  3:24       ` AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin
  2 siblings, 2 replies; 140+ messages in thread
From: Linus Torvalds @ 2008-05-06 16:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, Andrew Morton, linux-fsdevel



On Tue, 6 May 2008, Matthew Wilcox wrote:
> 
> I've wanted to fix file locks for a while.  Here's a first attempt.
> It was done quickly, so I concede that it may well have bugs in it.
> I found (and fixed) one with LTP.

Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is 
there some user that doesn't have the inode info, or does anything that 
might cross inode boundaries?

This does seem to drop all locking around the "setlease()" calls down to 
the filesystem, which worries me. That said, we clearly do need to do 
this. Probably should have done it a long time ago.

Also, why do people do this:

> -find_conflict:
> + find_conflict:

Hmm?

		Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:42         ` Matthew Wilcox
@ 2008-05-06 16:39           ` Alan Cox
  2008-05-06 16:51             ` Matthew Wilcox
  2008-05-06 20:28           ` Linus Torvalds
  1 sibling, 1 reply; 140+ messages in thread
From: Alan Cox @ 2008-05-06 16:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, Andrew Morton, linux-fsdevel

> > Hmm?
> 
> So that find_conflict doesn't end up in the first column, which causes
> diff to treat it as a function name for the purposes of the @@ lines.

Please can we just fix the tools not mangle the kernel to work around
silly bugs ?


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:36       ` Linus Torvalds
@ 2008-05-06 16:42         ` Matthew Wilcox
  2008-05-06 16:39           ` Alan Cox
  2008-05-06 20:28           ` Linus Torvalds
  2008-05-06 16:44         ` J. Bruce Fields
  1 sibling, 2 replies; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-06 16:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, Andrew Morton, linux-fsdevel

On Tue, May 06, 2008 at 09:36:06AM -0700, Linus Torvalds wrote:
> On Tue, 6 May 2008, Matthew Wilcox wrote:
> > I've wanted to fix file locks for a while.  Here's a first attempt.
> > It was done quickly, so I concede that it may well have bugs in it.
> > I found (and fixed) one with LTP.
> 
> Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is 
> there some user that doesn't have the inode info, or does anything that 
> might cross inode boundaries?

/proc/locks and deadlock detection both cross inode boundaries (and even
filesystem boundaries).  The BKL-removal brigade tried this back in 2.4
and the locking ended up scaling worse than just plonking a single
spinlock around the whole thing.

> This does seem to drop all locking around the "setlease()" calls down to 
> the filesystem, which worries me. That said, we clearly do need to do 
> this. Probably should have done it a long time ago.

The only filesystems that are going to have their own setlease methods
will be remote ones (nfs, smbfs, etc).  They're going to need to sleep
while the server responds to them.  So holding a spinlock while we call
them is impolite at best.

> Also, why do people do this:
> 
> > -find_conflict:
> > + find_conflict:
> 
> Hmm?

So that find_conflict doesn't end up in the first column, which causes
diff to treat it as a function name for the purposes of the @@ lines.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:36       ` Linus Torvalds
  2008-05-06 16:42         ` Matthew Wilcox
@ 2008-05-06 16:44         ` J. Bruce Fields
  1 sibling, 0 replies; 140+ messages in thread
From: J. Bruce Fields @ 2008-05-06 16:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Ingo Molnar, Zhang, Yanmin, LKML, Alexander Viro,
	Andrew Morton, linux-fsdevel, richterd

On Tue, May 06, 2008 at 09:36:06AM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 6 May 2008, Matthew Wilcox wrote:
> > 
> > I've wanted to fix file locks for a while.  Here's a first attempt.
> > It was done quickly, so I concede that it may well have bugs in it.
> > I found (and fixed) one with LTP.
> 
> Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is 
> there some user that doesn't have the inode info, or does anything that 
> might cross inode boundaries?

The deadlock detection crosses inode boundaries.

--b.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:51             ` Matthew Wilcox
@ 2008-05-06 16:45               ` Alan Cox
  2008-05-06 17:42               ` Linus Torvalds
  1 sibling, 0 replies; 140+ messages in thread
From: Alan Cox @ 2008-05-06 16:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, Andrew Morton, linux-fsdevel

On Tue, 6 May 2008 10:51:12 -0600
Matthew Wilcox <matthew@wil.cx> wrote:

> On Tue, May 06, 2008 at 05:39:48PM +0100, Alan Cox wrote:
> > > > Hmm?
> > > 
> > > So that find_conflict doesn't end up in the first column, which causes
> > > diff to treat it as a function name for the purposes of the @@ lines.
> > 
> > Please can we just fix the tools not mangle the kernel to work around
> > silly bugs ?
> 
> The people who control the tools refuse to fix them.

That would be their problem. We should refuse to mash the kernel up
because they aren't doing their job. If need be someone can fork a
private git-patch.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:39           ` Alan Cox
@ 2008-05-06 16:51             ` Matthew Wilcox
  2008-05-06 16:45               ` Alan Cox
  2008-05-06 17:42               ` Linus Torvalds
  0 siblings, 2 replies; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-06 16:51 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, Andrew Morton, linux-fsdevel

On Tue, May 06, 2008 at 05:39:48PM +0100, Alan Cox wrote:
> > > Hmm?
> > 
> > So that find_conflict doesn't end up in the first column, which causes
> > diff to treat it as a function name for the purposes of the @@ lines.
> 
> Please can we just fix the tools not mangle the kernel to work around
> silly bugs ?

The people who control the tools refuse to fix them.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:23     ` Matthew Wilcox
  2008-05-06 16:36       ` Linus Torvalds
@ 2008-05-06 17:21       ` Andrew Morton
  2008-05-06 17:31         ` Matthew Wilcox
                           ` (3 more replies)
  2008-05-08  3:24       ` AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin
  2 siblings, 4 replies; 140+ messages in thread
From: Andrew Morton @ 2008-05-06 17:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, Linus Torvalds, linux-fsdevel

On Tue, 6 May 2008 10:23:32 -0600 Matthew Wilcox <matthew@wil.cx> wrote:

> On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> > So the only likely things I can see are:
> > 
> >  - file locks
> >  - fasync
> 
> I've wanted to fix file locks for a while.  Here's a first attempt.

Do we actually know that the locks code is implicated in this regression?

I'd initially thought "lseek" but afaict tmpfs doesn't hit default_llseek()
or remote_llseek().

tmpfs tends to do weird stuff - it would be interesting to know if the
regression is also present on ramfs or ext2/ext3/xfs/etc.

It would be interesting to see if the context switch rate has increased.

Finally: how come we regressed by swapping the semaphore implementation
anyway?  We went from one sleeping lock implementation to another - I'd
have expected performance to be pretty much the same.

<looks at the implementation>

down(), down_interruptible() and down_try() should use spin_lock_irq(), not
irqsave.

up() seems to be doing wake-one, FIFO which is nice.  Did the
implementation which we just removed also do that?  Was it perhaps
accidentally doing LIFO or something like that?


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 17:21       ` Andrew Morton
@ 2008-05-06 17:31         ` Matthew Wilcox
  2008-05-06 17:49           ` Ingo Molnar
  2008-05-06 17:39         ` Ingo Molnar
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-06 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, Linus Torvalds, linux-fsdevel

On Tue, May 06, 2008 at 10:21:53AM -0700, Andrew Morton wrote:
> Do we actually know that the locks code is implicated in this regression?

Not yet.  We don't even know it's the BKL.  It's just my best guess.
We're waiting for the original reporter to run some tests Ingo pointed
him at.

> I'd initially thought "lseek" but afaict tmpfs doesn't hit default_llseek()
> or remote_llseek().

Correct.

> Finally: how come we regressed by swapping the semaphore implementation
> anyway?  We went from one sleeping lock implementation to another - I'd
> have expected performance to be pretty much the same.
> 
> <looks at the implementation>
> 
> down(), down_interruptible() and down_try() should use spin_lock_irq(), not
> irqsave.

We talked about this ... the BKL actually requires that you be able to
acquire it with interrupts disabled.  Maybe we should make lock_kernel
do this:

	if (likely(!depth)) {
		unsigned long flags;
		local_save_flags(flags);
		down();
		local_irq_restore(flags);
	}

But tweaking down() is not worth it -- we should be eliminating users of
both the BKL and semaphores instead.

> up() seems to be doing wake-one, FIFO which is nice.  Did the
> implementation which we just removed also do that?  Was it perhaps
> accidentally doing LIFO or something like that?

That's a question for someone who knows x86 assembler, I think.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 17:21       ` Andrew Morton
  2008-05-06 17:31         ` Matthew Wilcox
@ 2008-05-06 17:39         ` Ingo Molnar
  2008-05-07  6:49           ` Zhang, Yanmin
  2008-05-06 17:45         ` Linus Torvalds
  2008-05-07 16:38         ` Matthew Wilcox
  3 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-06 17:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, Linus Torvalds, linux-fsdevel


* Andrew Morton <akpm@linux-foundation.org> wrote:

> Finally: how come we regressed by swapping the semaphore 
> implementation anyway?  We went from one sleeping lock implementation 
> to another - I'd have expected performance to be pretty much the same.
> 
> <looks at the implementation>
> 
> down(), down_interruptible() and down_try() should use 
> spin_lock_irq(), not irqsave.
> 
> up() seems to be doing wake-one, FIFO which is nice.  Did the 
> implementation which we just removed also do that?  Was it perhaps 
> accidentally doing LIFO or something like that?

i just checked the old implementation on x86. It used 
lib/semaphore-sleepers.c which does one weird thing:

  - __down() when it returns wakes up yet another task via 
    wake_up_locked().

i.e. we'll always keep yet another task in flight. This can mask wakeup 
latencies especially when it takes time.

The patch (hack) below tries to emulate this weirdness - it 'kicks' 
another task as well and keeps it busy. Most of the time this just 
causes extra scheduling, but if AIM7 is _just_ saturating the number of 
CPUs, it might make a difference. Yanmin, does the patch below make any 
difference to the AIM7 results?

( it would be useful data to get a meaningful context switch trace from 
  the whole regressed workload, and compare it to a context switch trace 
  with the revert added. )

	Ingo

---
 kernel/semaphore.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -261,4 +261,14 @@ static noinline void __sched __up(struct
 	list_del(&waiter->list);
 	waiter->up = 1;
 	wake_up_process(waiter->task);
+
+	if (likely(list_empty(&sem->wait_list)))
+		return;
+	/*
+	 * Opportunistically wake up another task as well but do not
+	 * remove it from the list:
+	 */
+	waiter = list_first_entry(&sem->wait_list,
+				  struct semaphore_waiter, list);
+	wake_up_process(waiter->task);
 }

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:51             ` Matthew Wilcox
  2008-05-06 16:45               ` Alan Cox
@ 2008-05-06 17:42               ` Linus Torvalds
  1 sibling, 0 replies; 140+ messages in thread
From: Linus Torvalds @ 2008-05-06 17:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alan Cox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, Andrew Morton, linux-fsdevel



On Tue, 6 May 2008, Matthew Wilcox wrote:

> On Tue, May 06, 2008 at 05:39:48PM +0100, Alan Cox wrote:
> > > > Hmm?
> > > 
> > > So that find_conflict doesn't end up in the first column, which causes
> > > diff to treat it as a function name for the purposes of the @@ lines.
> > 
> > Please can we just fix the tools not mangle the kernel to work around
> > silly bugs ?
> 
> The people who control the tools refuse to fix them.

That's just plain bullocks.

First off, that "@@" line thing in diffs is not important enough to screw 
up the source code for. Ever.  It's just a small hint to make it somewhat 
easier to see more context for humans.

Second, it's not even that bad to show the last label there, rather than 
the function name.

Third, you seem to be a git user, so if you actually really care that much 
about the @@ line, then git actually lets you set your very own pattern 
for those things.

In fact, you can even do it on a per-file basis based on things like 
filename rules (ie you can have different patterns for what to trigger on 
for a *.c file and for a *.S file, since in a *.S file the 'name:' thing 
_is_ the right pattern).

So not only are you making idiotic changes just for irrelevant tool usage, 
you're also apparently lying about people "refusing to fix" things as an 
excuse.

You can play with it. It's documented in gitattributes (see "Defining a 
custom hunk header"), and the default one is just the same one that GNU 
diff uses for "-p". I think.

You can add something like this to your ~/.gitconfig:

	[diff "default"]
		funcname=^[a-zA-Z_$].*(.*$

to only trigger the funcname pattern on a line that starts with a valid C 
identifier hing, and contains a '('.

And you can just override the default like the above (that way you don't 
have to specify attributes), but if you want to do things differently for 
*.c files than from *.S files, you can edit your .git/info/attributes file 
and make it contain something like

	*.S	diff=assembler
	*.c	diff=C

and now you can make your ~/.gitconfig actually show them differently, ie 
something like

	[diff "C"]
		funcname=^[a-zA-Z_$].*(.*$

	[diff "assembler"]
		funcname=^[a-zA-Z_$].*:

etc.

Of course, there is a real cost to this, but it's cheap enough in practice 
that you'll never notice.

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 17:21       ` Andrew Morton
  2008-05-06 17:31         ` Matthew Wilcox
  2008-05-06 17:39         ` Ingo Molnar
@ 2008-05-06 17:45         ` Linus Torvalds
  2008-05-07 16:38         ` Matthew Wilcox
  3 siblings, 0 replies; 140+ messages in thread
From: Linus Torvalds @ 2008-05-06 17:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel



On Tue, 6 May 2008, Andrew Morton wrote:
> 
> down(), down_interruptible() and down_try() should use spin_lock_irq(), not
> irqsave.

down_trylock() is used in atomic code. See for example kernel/printk.c. So 
no, that one needs to be irqsafe.

		Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 17:31         ` Matthew Wilcox
@ 2008-05-06 17:49           ` Ingo Molnar
  2008-05-06 18:07             ` Andrew Morton
  0 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-06 17:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, Linus Torvalds, linux-fsdevel


* Matthew Wilcox <matthew@wil.cx> wrote:

> > down(), down_interruptible() and down_try() should use 
> > spin_lock_irq(), not irqsave.
> 
> We talked about this ... the BKL actually requires that you be able to 
> acquire it with interrupts disabled. [...]

hm, where does it require it, besides the early bootup code? (which 
should just be fixed)

down_trylock() is OK as irqsave/irqrestore for legacy reasons, but that 
is fundamentally atomic anyway.

> > up() seems to be doing wake-one, FIFO which is nice.  Did the 
> > implementation which we just removed also do that?  Was it perhaps 
> > accidentally doing LIFO or something like that?
> 
> That's a question for someone who knows x86 assembler, I think.

the assembly is mostly just for the fastpath - and a 40% regression 
cannot be about fastpath differences. In the old code the scheduling 
happens in lib/semaphore-sleeper.c, and from the looks of it it appears 
to be a proper FIFO as well. (plus this small wakeup weirdness it has)

i reviewed the new code in kernel/semaphore.c as well and can see 
nothing bad in it - it does proper wake-up, FIFO queueing, like the 
mutex code.

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 17:49           ` Ingo Molnar
@ 2008-05-06 18:07             ` Andrew Morton
  2008-05-11 11:11               ` Matthew Wilcox
  0 siblings, 1 reply; 140+ messages in thread
From: Andrew Morton @ 2008-05-06 18:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: matthew, bfields, yanmin_zhang, linux-kernel, viro, torvalds,
	linux-fsdevel

On Tue, 6 May 2008 19:49:54 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Matthew Wilcox <matthew@wil.cx> wrote:
> 
> > > down(), down_interruptible() and down_try() should use 
> > > spin_lock_irq(), not irqsave.
> > 
> > We talked about this ... the BKL actually requires that you be able to 
> > acquire it with interrupts disabled. [...]
> 
> hm, where does it require it, besides the early bootup code? (which 
> should just be fixed)

Yeah, the early bootup code.  The kernel does accidental lock_kernel()s in
various places and if that renables interrupts then powerpc goeth crunch.

Matthew, that seemingly-unneeded irqsave in lib/semaphore.c is a prime site
for /* one of these things */, no?

> down_trylock() is OK as irqsave/irqrestore for legacy reasons, but that 
> is fundamentally atomic anyway.

yes, trylock should be made irq-safe.

> > > up() seems to be doing wake-one, FIFO which is nice.  Did the 
> > > implementation which we just removed also do that?  Was it perhaps 
> > > accidentally doing LIFO or something like that?
> > 
> > That's a question for someone who knows x86 assembler, I think.
> 
> the assembly is mostly just for the fastpath - and a 40% regression 
> cannot be about fastpath differences. In the old code the scheduling 
> happens in lib/semaphore-sleeper.c, and from the looks of it it appears 
> to be a proper FIFO as well. (plus this small wakeup weirdness it has)
> 
> i reviewed the new code in kernel/semaphore.c as well and can see 
> nothing bad in it - it does proper wake-up, FIFO queueing, like the 
> mutex code.
> 

There's the weird wakeup in down() which I understood for about five
minutes five years ago.  Perhaps that accidentally sped something up.
Oh well, more investigation needed..

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:42         ` Matthew Wilcox
  2008-05-06 16:39           ` Alan Cox
@ 2008-05-06 20:28           ` Linus Torvalds
  1 sibling, 0 replies; 140+ messages in thread
From: Linus Torvalds @ 2008-05-06 20:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, Andrew Morton, linux-fsdevel



On Tue, 6 May 2008, Matthew Wilcox wrote:
> On Tue, May 06, 2008 at 09:36:06AM -0700, Linus Torvalds wrote:
> > 
> > Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is 
> > there some user that doesn't have the inode info, or does anything that 
> > might cross inode boundaries?
> 
> /proc/locks and deadlock detection both cross inode boundaries (and even
> filesystem boundaries).  The BKL-removal brigade tried this back in 2.4
> and the locking ended up scaling worse than just plonking a single
> spinlock around the whole thing.

Ok, no worries. Just as long as I know why it's a single lock. Looks ok to 
me, apart from the need for testing (and talking to NFS etc people).

		Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 11:44 ` Ingo Molnar
  2008-05-06 12:09   ` Matthew Wilcox
@ 2008-05-07  2:11   ` Zhang, Yanmin
  2008-05-07  3:41     ` Zhang, Yanmin
  1 sibling, 1 reply; 140+ messages in thread
From: Zhang, Yanmin @ 2008-05-07  2:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton


On Tue, 2008-05-06 at 13:44 +0200, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> 
> > Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more than 40% with 
> > 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton, and Itanium 
> > Montecito. Bisect located below patch.
> > 
> > 64ac24e738823161693bf791f87adc802cf529ff is first bad commit
> > commit 64ac24e738823161693bf791f87adc802cf529ff
> > Author: Matthew Wilcox <matthew@wil.cx>
> > Date:   Fri Mar 7 21:55:58 2008 -0500
> > 
> >     Generic semaphore implementation
> > 
> > After I manually reverted the patch against 2.6.26-rc1 while fixing 
> > lots of conflictions/errors, aim7 regression became less than 2%.
> 
> hm, which exact semaphore would that be due to?
> 
> My first blind guess would be the BKL - there's not much other semaphore 
> use left in the core kernel otherwise that would affect AIM7 normally. 
> The VFS still makes frequent use of the BKL and AIM7 is very VFS 
> intense. Getting rid of that BKL use from the VFS might be useful to 
> performance anyway.
> 
> Could you try to check that it's indeed the BKL?
> 
> Easiest way to check it would be to run AIM7 it on 
> sched-devel.git/latest and do scheduler tracing via:
> 
>    http://people.redhat.com/mingo/sched-devel.git/readme-tracer.txt
Thank you guys for the quick response. I ran into many regressions with 2.6.26-rc1, but
just reported 2 of them because I located the patches. My machine is locating the root cause
of 30% regression of sysbench+mysql(oltp readonly) now. Bisect is not so qucik because either
kernel hang with testing or compilation fails.

Another specjbb2005 on Montvale is also under investigation.

Let me figure out how to clone your tree quickly as the network speed is very slow.

One clear weird behavior of aim7 is cpu idle is 0% with 2.6.25, but is more than 50% with
2.6.26-rc1. I have a patch to collect schedule info.

> 
> by doing:
> 
>    echo stacktrace > /debug/tracing/iter_ctl
> 
> you could get exact backtraces of all scheduling points in the trace. If 
> the BKL's down() shows up in those traces then it's definitely the BKL 
> that causes this. The backtraces will also tell us exactly which BKL use 
> is the most frequent one.
> 
> To keep tracing overhead low on SMP i'd also suggest to only trace a 
> single CPU, via:
> 
>   echo 1 > /debug/tracing/tracing_cpumask
> 
> 	Ingo


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07  2:11   ` Zhang, Yanmin
@ 2008-05-07  3:41     ` Zhang, Yanmin
  2008-05-07  3:59       ` Andrew Morton
                         ` (2 more replies)
  0 siblings, 3 replies; 140+ messages in thread
From: Zhang, Yanmin @ 2008-05-07  3:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton


On Wed, 2008-05-07 at 10:11 +0800, Zhang, Yanmin wrote:
> On Tue, 2008-05-06 at 13:44 +0200, Ingo Molnar wrote:
> > * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> > 
> > > Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more than 40% with 
> > > 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton, and Itanium 
> > > Montecito. Bisect located below patch.
> > > 
> > > 64ac24e738823161693bf791f87adc802cf529ff is first bad commit
> > > commit 64ac24e738823161693bf791f87adc802cf529ff
> > > Author: Matthew Wilcox <matthew@wil.cx>
> > > Date:   Fri Mar 7 21:55:58 2008 -0500
> > > 
> > >     Generic semaphore implementation
> > > 
> > > After I manually reverted the patch against 2.6.26-rc1 while fixing 
> > > lots of conflictions/errors, aim7 regression became less than 2%.
> > 
> > hm, which exact semaphore would that be due to?
> > 
> > My first blind guess would be the BKL - there's not much other semaphore 
> > use left in the core kernel otherwise that would affect AIM7 normally. 
> > The VFS still makes frequent use of the BKL and AIM7 is very VFS 
> > intense. Getting rid of that BKL use from the VFS might be useful to 
> > performance anyway.
> > 
> > Could you try to check that it's indeed the BKL?
> > 
> > Easiest way to check it would be to run AIM7 it on 
> > sched-devel.git/latest and do scheduler tracing via:
> > 
> >    http://people.redhat.com/mingo/sched-devel.git/readme-tracer.txt
> One clear weird behavior of aim7 is cpu idle is 0% with 2.6.25, but is more than 50% with
> 2.6.26-rc1. I have a patch to collect schedule info.
With my patch+gprof, I collected some data. Below was outputed by gprof.

index % time    self  children    called     name
                0.00    0.00       2/223305376     __down_write_nested [22749]
                0.00    0.00       3/223305376     journal_commit_transaction [10526]
                0.00    0.00       6/223305376     __down_read [22745]
                0.00    0.00       8/223305376     start_this_handle [19167]
                0.00    0.00      15/223305376     sys_pause [19808]
                0.00    0.00      17/223305376     log_wait_commit [11047]
                0.00    0.00      20/223305376     futex_wait [8122]
                0.00    0.00      64/223305376     pdflush [14335]
                0.00    0.00      71/223305376     do_get_write_access [5367]
                0.00    0.00      84/223305376     pipe_wait [14460]
                0.00    0.00     111/223305376     kjournald [10726]
                0.00    0.00     116/223305376     int_careful [9634]
                0.00    0.00     224/223305376     do_nanosleep [5418]
                0.00    0.00    1152/223305376     watchdog [22065]
                0.00    0.00    4087/223305376     worker_thread [22076]
                0.00    0.00    5003/223305376     __mutex_lock_killable_slowpath [23305]
                0.00    0.00    7810/223305376     ksoftirqd [10831]
                0.00    0.00    9389/223305376     __mutex_lock_slowpath [23306]
                0.00    0.00   10642/223305376     io_schedule [9813]
                0.00    0.00   23544/223305376     migration_thread [11495]
                0.00    0.00   35319/223305376     __cond_resched [22673]
                0.00    0.00   49065/223305376     retint_careful [16146]
                0.00    0.00  119757/223305376     sysret_careful [20074]
                0.00    0.00  151717/223305376     do_wait [5545]
                0.00    0.00  250221/223305376     do_exit [5356]
                0.00    0.00  303836/223305376     cpu_idle [4350]
                0.00    0.00 222333093/223305376     schedule_timeout [2]
[1]      0.0    0.00    0.00 223305376         schedule [1]
-----------------------------------------------
                0.00    0.00       2/222333093     io_schedule_timeout [9814]
                0.00    0.00       4/222333093     journal_stop [10588]
                0.00    0.00       8/222333093     cifs_oplock_thread [3760]
                0.00    0.00      14/222333093     do_sys_poll [5513]
                0.00    0.00      20/222333093     cifs_dnotify_thread [3733]
                0.00    0.00      32/222333093     read_chan [15648]
                0.00    0.00      47/222333093     wait_for_common [22017]
                0.00    0.00     658/222333093     do_select [5479]
                0.00    0.00    2000/222333093     inet_stream_connect [9324]
                0.00    0.00 222330308/222333093     __down [22577]
[2]      0.0    0.00    0.00 222333093         schedule_timeout [2]
                0.00    0.00 222333093/223305376     schedule [1]
-----------------------------------------------
                0.00    0.00       1/165565      flock_lock_file_wait [7735]
                0.00    0.00       7/165565      __posix_lock_file [23371]
                0.00    0.00     203/165565      de_put [4665]
                0.00    0.00     243/165565      opost [13633]
                0.00    0.00     333/165565      proc_root_readdir [14982]
                0.00    0.00     358/165565      write_chan [22090]
                0.00    0.00    6222/165565      proc_lookup_de [14908]
                0.00    0.00   32081/165565      sys_fcntl [19687]
                0.00    0.00   36045/165565      vfs_ioctl [21822]
                0.00    0.00   42025/165565      tty_release [20818]
                0.00    0.00   48047/165565      chrdev_open [3702]
[3]      0.0    0.00    0.00  165565         lock_kernel [3]
                0.00    0.00  152987/153190      down [4]
-----------------------------------------------
                0.00    0.00     203/153190      __reacquire_kernel_lock [23420]
                0.00    0.00  152987/153190      lock_kernel [3]
[4]      0.0    0.00    0.00  153190         down [4]
                0.00    0.00  153190/153190      __down [22577]
-----------------------------------------------
                0.00    0.00  153190/153190      down [4]
[22577   0.0    0.00    0.00  153190         __down [22577]
                0.00    0.00 222330308/222333093     schedule_timeout [2]


As system idle is more than 50%, so the schedule/schedule_timeout caller is important
information.
1) lock_kernel causes most schedule/schedule_timeout;
2) When lock_kernel calls down, then __down, __down calls schedule_timeout for
lots of times in a loop;
3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.

-yanmin



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07  3:41     ` Zhang, Yanmin
@ 2008-05-07  3:59       ` Andrew Morton
  2008-05-07  4:46         ` Zhang, Yanmin
  2008-05-07  6:26       ` Ingo Molnar
  2008-05-07 11:00       ` Andi Kleen
  2 siblings, 1 reply; 140+ messages in thread
From: Andrew Morton @ 2008-05-07  3:59 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Ingo Molnar, Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds

On Wed, 07 May 2008 11:41:52 +0800 "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote:

> As system idle is more than 50%, so the schedule/schedule_timeout caller is important
> information.
> 1) lock_kernel causes most schedule/schedule_timeout;
> 2) When lock_kernel calls down, then __down, __down calls ___schedule_timeout for
> lots of times in a loop;

Really?  Are you sure?  That would imply that we keep on waking up tasks
which then fail to acquire the lock.  But the code pretty plainly doesn't
do that.

Odd.

> 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.

Still :(

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07  3:59       ` Andrew Morton
@ 2008-05-07  4:46         ` Zhang, Yanmin
  0 siblings, 0 replies; 140+ messages in thread
From: Zhang, Yanmin @ 2008-05-07  4:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds


On Tue, 2008-05-06 at 20:59 -0700, Andrew Morton wrote:
> On Wed, 07 May 2008 11:41:52 +0800 "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote:
> 
> > As system idle is more than 50%, so the schedule/schedule_timeout caller is important
> > information.
> > 1) lock_kernel causes most schedule/schedule_timeout;
> > 2) When lock_kernel calls down, then __down, __down calls ___schedule_timeout for
> > lots of times in a loop;
> 
> Really?  Are you sure?  That would imply that we keep on waking up tasks
> which then fail to acquire the lock.  But the code pretty plainly doesn't
> do that.
Yes, totally based on the data.
The data means the calling times among functions. Initially , I just collected the caller
of schedule and schedule_timeout. Then I found most schedule/schedule_timeout are called by
__down which is called down. Then, I changes kernel to collect more functions' calling info.

If comparing the calling times of down, __down and schedule_timeout, we could find
schedule_timeout is called by __down for 222330308, but __down is called only for 153190.

> 
> Odd.
> 
> > 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.
> Still :(
Yes. The data has an error difference, but the difference is small. My patch doesn't
use lock to protect data in case it might introduces too much overhead.



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07  3:41     ` Zhang, Yanmin
  2008-05-07  3:59       ` Andrew Morton
@ 2008-05-07  6:26       ` Ingo Molnar
  2008-05-07  6:28         ` Ingo Molnar
  2008-05-07 11:00       ` Andi Kleen
  2 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-07  6:26 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton


* Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:

> 3) Caller of lock_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.

that's one often-forgotten BKL site: about 1000 ioctls are still running 
under the BKL. The TTY one is hurting the most. To make sure it's only 
that BKL acquire/release that hurts, could you try the hack patch below, 
does it make any difference to performance?

but even if taking the BKL does hurt, it's quite unexpected to cause a 
40% drop. Perhaps AIM7 has tons of threads that exit at once and all try 
to release their controlling terminal or something like that?

	Ingo

------------------------>
Subject: DANGEROUS tty hack: no BKL
From: Ingo Molnar <mingo@elte.hu>
Date: Wed May 07 08:21:22 CEST 2008

NOT-Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 drivers/char/tty_io.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux/drivers/char/tty_io.c
===================================================================
--- linux.orig/drivers/char/tty_io.c
+++ linux/drivers/char/tty_io.c
@@ -2844,9 +2844,10 @@ out:
 
 static int tty_release(struct inode *inode, struct file *filp)
 {
-	lock_kernel();
+	/* DANGEROUS - can crash your kernel! */
+//	lock_kernel();
 	release_dev(filp);
-	unlock_kernel();
+//	unlock_kernel();
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07  6:26       ` Ingo Molnar
@ 2008-05-07  6:28         ` Ingo Molnar
  2008-05-07  7:05           ` Zhang, Yanmin
  0 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-07  6:28 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton


* Ingo Molnar <mingo@elte.hu> wrote:

> > 3) Caller of lock_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.
> 
> that's one often-forgotten BKL site: about 1000 ioctls are still 
> running under the BKL. The TTY one is hurting the most. [...]

although it's an unlocked_ioctl() now in 2.6.26, so all the BKL locking 
has been nicely pushed down to deep inside the tty code.

> [...] To make sure it's only that BKL acquire/release that hurts, 
> could you try the hack patch below, does it make any difference to 
> performance?

if you use a serial console you will need the updated patch below.

	Ingo

---------------------->
Subject: no: tty bkl
From: Ingo Molnar <mingo@elte.hu>
Date: Wed May 07 08:21:22 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 drivers/char/tty_io.c        |    5 +++--
 drivers/serial/serial_core.c |    2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

Index: linux/drivers/char/tty_io.c
===================================================================
--- linux.orig/drivers/char/tty_io.c
+++ linux/drivers/char/tty_io.c
@@ -2844,9 +2844,10 @@ out:
 
 static int tty_release(struct inode *inode, struct file *filp)
 {
-	lock_kernel();
+	/* DANGEROUS - can crash your kernel! */
+//	lock_kernel();
 	release_dev(filp);
-	unlock_kernel();
+//	unlock_kernel();
 	return 0;
 }
 
Index: linux/drivers/serial/serial_core.c
===================================================================
--- linux.orig/drivers/serial/serial_core.c
+++ linux/drivers/serial/serial_core.c
@@ -1241,7 +1241,7 @@ static void uart_close(struct tty_struct
 	struct uart_state *state = tty->driver_data;
 	struct uart_port *port;
 
-	BUG_ON(!kernel_locked());
+//	BUG_ON(!kernel_locked());
 
 	if (!state || !state->port)
 		return;


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 17:39         ` Ingo Molnar
@ 2008-05-07  6:49           ` Zhang, Yanmin
  0 siblings, 0 replies; 140+ messages in thread
From: Zhang, Yanmin @ 2008-05-07  6:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Matthew Wilcox, J. Bruce Fields, LKML,
	Alexander Viro, Linus Torvalds, linux-fsdevel


On Tue, 2008-05-06 at 19:39 +0200, Ingo Molnar wrote:
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > Finally: how come we regressed by swapping the semaphore 
> > implementation anyway?  We went from one sleeping lock implementation 
> > to another - I'd have expected performance to be pretty much the same.
> i.e. we'll always keep yet another task in flight. This can mask wakeup 
> latencies especially when it takes time.
> 
> The patch (hack) below tries to emulate this weirdness - it 'kicks' 
> another task as well and keeps it busy. Most of the time this just 
> causes extra scheduling, but if AIM7 is _just_ saturating the number of 
> CPUs, it might make a difference. Yanmin, does the patch below make any 
> difference to the AIM7 results?
I tested it on my 8-core stoakley and the result is 12% worse than the one of
pure 2.6.26-rc1.

-yanmin

> 
> ( it would be useful data to get a meaningful context switch trace from 
>   the whole regressed workload, and compare it to a context switch trace 
>   with the revert added. )
> 
> 	Ingo
> 
> ---
>  kernel/semaphore.c |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> Index: linux/kernel/semaphore.c
> ===================================================================
> --- linux.orig/kernel/semaphore.c
> +++ linux/kernel/semaphore.c
> @@ -261,4 +261,14 @@ static noinline void __sched __up(struct
>  	list_del(&waiter->list);
>  	waiter->up = 1;
>  	wake_up_process(waiter->task);
> +
> +	if (likely(list_empty(&sem->wait_list)))
> +		return;
> +	/*
> +	 * Opportunistically wake up another task as well but do not
> +	 * remove it from the list:
> +	 */
> +	waiter = list_first_entry(&sem->wait_list,
> +				  struct semaphore_waiter, list);
> +	wake_up_process(waiter->task);
>  }


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07  6:28         ` Ingo Molnar
@ 2008-05-07  7:05           ` Zhang, Yanmin
  0 siblings, 0 replies; 140+ messages in thread
From: Zhang, Yanmin @ 2008-05-07  7:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton


On Wed, 2008-05-07 at 08:28 +0200, Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > 3) Caller of lock_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.
> > 
> > that's one often-forgotten BKL site: about 1000 ioctls are still 
> > running under the BKL. The TTY one is hurting the most. [...]
> 
> although it's an unlocked_ioctl() now in 2.6.26, so all the BKL locking 
> has been nicely pushed down to deep inside the tty code.
> 
> > [...] To make sure it's only that BKL acquire/release that hurts, 
> > could you try the hack patch below, does it make any difference to 
> > performance?
> 
> if you use a serial console you will need the updated patch below.
I tested it on my 8-core stoakley. The result is 4% worse than the one of pure
2.6.26-rc1. Still not good.

> 
> 	Ingo
> 
> ---------------------->
> Subject: no: tty bkl
> From: Ingo Molnar <mingo@elte.hu>
> Date: Wed May 07 08:21:22 CEST 2008
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  drivers/char/tty_io.c        |    5 +++--
>  drivers/serial/serial_core.c |    2 +-
>  2 files changed, 4 insertions(+), 3 deletions(-)
> 
> Index: linux/drivers/char/tty_io.c
> ===================================================================
> --- linux.orig/drivers/char/tty_io.c
> +++ linux/drivers/char/tty_io.c
> @@ -2844,9 +2844,10 @@ out:
>  
>  static int tty_release(struct inode *inode, struct file *filp)
>  {
> -	lock_kernel();
> +	/* DANGEROUS - can crash your kernel! */
> +//	lock_kernel();
>  	release_dev(filp);
> -	unlock_kernel();
> +//	unlock_kernel();
>  	return 0;
>  }
>  
> Index: linux/drivers/serial/serial_core.c
> ===================================================================
> --- linux.orig/drivers/serial/serial_core.c
> +++ linux/drivers/serial/serial_core.c
> @@ -1241,7 +1241,7 @@ static void uart_close(struct tty_struct
>  	struct uart_state *state = tty->driver_data;
>  	struct uart_port *port;
>  
> -	BUG_ON(!kernel_locked());
> +//	BUG_ON(!kernel_locked());
>  
>  	if (!state || !state->port)
>  		return;
> 


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07  3:41     ` Zhang, Yanmin
  2008-05-07  3:59       ` Andrew Morton
  2008-05-07  6:26       ` Ingo Molnar
@ 2008-05-07 11:00       ` Andi Kleen
  2008-05-07 11:46         ` Matthew Wilcox
  2008-05-07 13:59         ` Alan Cox
  2 siblings, 2 replies; 140+ messages in thread
From: Andi Kleen @ 2008-05-07 11:00 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Ingo Molnar, Matthew Wilcox, LKML, Alexander Viro,
	Linus Torvalds, Andrew Morton

"Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes:
> 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.

I have an older patchkit that introduced unlocked_fnctl for some cases. It was 
briefly in mm but then dropped. Sounds like it is worth resurrecting?

tty_* is being taken care of by Alan.

chrdev_open is more work.

-Andi (who BTW never quite understood why BKL is a semaphore now and not
a spinlock?)

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 11:00       ` Andi Kleen
@ 2008-05-07 11:46         ` Matthew Wilcox
  2008-05-07 12:21           ` Andi Kleen
  2008-05-07 13:59         ` Alan Cox
  1 sibling, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-07 11:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Linus Torvalds,
	Andrew Morton

On Wed, May 07, 2008 at 01:00:14PM +0200, Andi Kleen wrote:
> "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes:
> > 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.
> 
> I have an older patchkit that introduced unlocked_fnctl for some cases. It was 
> briefly in mm but then dropped. Sounds like it is worth resurrecting?

Not sure what you're talking about here, Andi.  The only lock_kernel in
fcntl.c is around the call to ->fasync.  And Yanmin's traces don't show
fasync as being a culprit, just the paths in locks.c

> tty_* is being taken care of by Alan.
> 
> chrdev_open is more work.
> 
> -Andi (who BTW never quite understood why BKL is a semaphore now and not
> a spinlock?)

See git commit 6478d8800b75253b2a934ddcb734e13ade023ad0

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 11:46         ` Matthew Wilcox
@ 2008-05-07 12:21           ` Andi Kleen
  2008-05-07 14:36             ` Linus Torvalds
  0 siblings, 1 reply; 140+ messages in thread
From: Andi Kleen @ 2008-05-07 12:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Linus Torvalds,
	Andrew Morton

Matthew Wilcox <matthew@wil.cx> writes:

> On Wed, May 07, 2008 at 01:00:14PM +0200, Andi Kleen wrote:
>> "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes:
>> > 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.
>> 
>> I have an older patchkit that introduced unlocked_fnctl for some cases. It was 
>> briefly in mm but then dropped. Sounds like it is worth resurrecting?
>
> Not sure what you're talking about here, Andi.  The only lock_kernel in
> fcntl.c is around the call to ->fasync.  And Yanmin's traces don't show
> fasync as being a culprit, just the paths in locks.c

I was talking about fasync.

>> -Andi (who BTW never quite understood why BKL is a semaphore now and not
>> a spinlock?)
>
> See git commit 6478d8800b75253b2a934ddcb734e13ade023ad0

I am aware of that commit, thank you, but the comment was refering to that it 
came with about zero justification why it was done. For the left over BKL 
regions which are relatively short surely a spinlock would be better than a 
semaphore? So PREEMPT_BKL should have been removed, not !PREEMPT_BKL.

If that was done all these regressions would disappear I bet. That said
of course it is still better to actually fix the lock_kernel()s, but shorter
time just fixing lock_kernel again would be easier.

-Andi


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 11:00       ` Andi Kleen
  2008-05-07 11:46         ` Matthew Wilcox
@ 2008-05-07 13:59         ` Alan Cox
  1 sibling, 0 replies; 140+ messages in thread
From: Alan Cox @ 2008-05-07 13:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Zhang, Yanmin, Ingo Molnar, Matthew Wilcox, LKML, Alexander Viro,
	Linus Torvalds, Andrew Morton

On Wed, 07 May 2008 13:00:14 +0200
Andi Kleen <andi@firstfloor.org> wrote:

> "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes:
> > 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.
> 
> I have an older patchkit that introduced unlocked_fnctl for some cases. It was 
> briefly in mm but then dropped. Sounds like it is worth resurrecting?
> 
> tty_* is being taken care of by Alan.

The tty open/close paths are probably a few months away from dropping the
BKL.

Alan

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 14:36             ` Linus Torvalds
@ 2008-05-07 14:35               ` Alan Cox
  2008-05-07 15:00                 ` Linus Torvalds
  2008-05-07 14:57               ` Andi Kleen
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 140+ messages in thread
From: Alan Cox @ 2008-05-07 14:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML,
	Alexander Viro, Andrew Morton

> But my preferred option would indeed be just turning it back into a 
> spinlock - and screw latency and BKL preemption - and having the RT people 
> who care deeply just work on removing the BKL in the long run.

It isn't as if the RT build can't use a different lock type to the
default build.

> Is BKL preemption worth it? Sounds very dubious. Sounds even more dubious 
> when we now apparently have even more reason to aim for removing the BKL 
> rather than trying to mess around with it.

We have some horrible long lasting BKL users left unfortunately.

Alan

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 12:21           ` Andi Kleen
@ 2008-05-07 14:36             ` Linus Torvalds
  2008-05-07 14:35               ` Alan Cox
                                 ` (3 more replies)
  0 siblings, 4 replies; 140+ messages in thread
From: Linus Torvalds @ 2008-05-07 14:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro,
	Andrew Morton



On Wed, 7 May 2008, Andi Kleen wrote:
> 
> I am aware of that commit, thank you, but the comment was refering to that it 
> came with about zero justification why it was done. For the left over BKL 
> regions which are relatively short surely a spinlock would be better than a 
> semaphore? So PREEMPT_BKL should have been removed, not !PREEMPT_BKL.

I do agree. 

I think turning the BKL into a semaphore was fine per se, but that was 
when semaphores were fast. 

Considering the apparent AIM regressions, we really either need to revert 
the semaphore consolidation, or we need to fix the kernel lock. And by 
"fixing", I don't mean removing it - it will happen, but it almost 
certainly won't happen for 2.6.26.

The easiest approach would seem to just turn the BKL into a mutex instead, 
which should hopefully be about as optimized as the old semaphores.

But my preferred option would indeed be just turning it back into a 
spinlock - and screw latency and BKL preemption - and having the RT people 
who care deeply just work on removing the BKL in the long run.

Is BKL preemption worth it? Sounds very dubious. Sounds even more dubious 
when we now apparently have even more reason to aim for removing the BKL 
rather than trying to mess around with it.

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 14:36             ` Linus Torvalds
  2008-05-07 14:35               ` Alan Cox
@ 2008-05-07 14:57               ` Andi Kleen
  2008-05-07 15:31                 ` Andrew Morton
  2008-05-07 15:19               ` Linus Torvalds
  2008-05-07 16:20               ` Ingo Molnar
  3 siblings, 1 reply; 140+ messages in thread
From: Andi Kleen @ 2008-05-07 14:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro,
	Andrew Morton


> I think turning the BKL into a semaphore was fine per se,

Are there really that many long lived BKL holders? I have some doubts.
Semaphore makes only sense when the critical region is at least
many thousands of cycles to amortize the scheduling overhead.

Ok perhaps this needs some numbers to decide.

 but that was
> when semaphores were fast. 

The semaphores should be still nearly as fast in theory, especially
for the contended case.

> Considering the apparent AIM regressions, we really either need to revert 
> the semaphore consolidation,

Or figure out what made the semaphore consolidation slower? As Ingo
pointed out earlier 40% is unlikely to be a fast path problem, but some
algorithmic problem. Surely that is fixable (even for .26)?

Perhaps we were lucky it showed so easily, not in something tricky.

-Andi

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 14:35               ` Alan Cox
@ 2008-05-07 15:00                 ` Linus Torvalds
  2008-05-07 15:02                   ` Linus Torvalds
  0 siblings, 1 reply; 140+ messages in thread
From: Linus Torvalds @ 2008-05-07 15:00 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML,
	Alexander Viro, Andrew Morton



On Wed, 7 May 2008, Alan Cox wrote:
>
> > But my preferred option would indeed be just turning it back into a 
> > spinlock - and screw latency and BKL preemption - and having the RT people 
> > who care deeply just work on removing the BKL in the long run.
> 
> It isn't as if the RT build can't use a different lock type to the
> default build.

Well, considering just *how* bad the new BKL apparently is, I think that's 
a separate issue. The semaphore implementation is simply not worth it. At 
a minimum, it should be a mutex.

> > Is BKL preemption worth it? Sounds very dubious. Sounds even more dubious 
> > when we now apparently have even more reason to aim for removing the BKL 
> > rather than trying to mess around with it.
> 
> We have some horrible long lasting BKL users left unfortunately.

Quite frankly, maybe we _need_ to have a bad BKL for those to ever get 
fixed. As it was, people worked on trying to make the BKL behave better, 
and it was a failure. Rather than spend the effort on trying to make it 
work better (at a horrible cost), why not just say "Hell no - if you have 
issues with it, you need to work with people to get rid of the BKL 
rather than cluge around it".

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 15:00                 ` Linus Torvalds
@ 2008-05-07 15:02                   ` Linus Torvalds
  0 siblings, 0 replies; 140+ messages in thread
From: Linus Torvalds @ 2008-05-07 15:02 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML,
	Alexander Viro, Andrew Morton



On Wed, 7 May 2008, Linus Torvalds wrote:
> 
> Quite frankly, maybe we _need_ to have a bad BKL for those to ever get 
> fixed. As it was, people worked on trying to make the BKL behave better, 
> and it was a failure. Rather than spend the effort on trying to make it 
> work better (at a horrible cost), why not just say "Hell no - if you have 
> issues with it, you need to work with people to get rid of the BKL 
> rather than cluge around it".

Put another way: if we had introduced the BKL-as-semaphore with a known 
40% performance drop in AIM7, I would simply never ever have accepted the 
patch in the first place, regardless of _any_ excuses. 

Performance is a feature too.

Now, just because the code is already merged should not be an excuse for 
it then being shown to be bad. It's not a valid excuse to say "but we 
already merged it, so we can't unmerge it". We sure as hell _can_ unmerge 
it. 

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 14:36             ` Linus Torvalds
  2008-05-07 14:35               ` Alan Cox
  2008-05-07 14:57               ` Andi Kleen
@ 2008-05-07 15:19               ` Linus Torvalds
  2008-05-07 17:14                 ` Ingo Molnar
  2008-05-08  2:44                 ` Zhang, Yanmin
  2008-05-07 16:20               ` Ingo Molnar
  3 siblings, 2 replies; 140+ messages in thread
From: Linus Torvalds @ 2008-05-07 15:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro,
	Andrew Morton



On Wed, 7 May 2008, Linus Torvalds wrote:
> 
> But my preferred option would indeed be just turning it back into a 
> spinlock - and screw latency and BKL preemption - and having the RT people 
> who care deeply just work on removing the BKL in the long run.

Here's a trial balloon patch to do that.

Yanmin - this is not well tested, but the code is fairly obvious, and it 
would be interesting to hear if this fixes the performance regression. 
Because if it doesn't, then it's not the BKL, or something totally 
different is going on.

Now, we should probably also test just converting the thing to a mutex, 
to see if that perhaps also fixes it.

			Linus

---
 arch/mn10300/Kconfig    |   11 ----
 include/linux/hardirq.h |   18 ++++---
 kernel/sched.c          |   27 ++---------
 lib/kernel_lock.c       |  120 +++++++++++++++++++++++++++++++---------------
 4 files changed, 95 insertions(+), 81 deletions(-)

diff --git a/arch/mn10300/Kconfig b/arch/mn10300/Kconfig
index 6a6409a..e856218 100644
--- a/arch/mn10300/Kconfig
+++ b/arch/mn10300/Kconfig
@@ -186,17 +186,6 @@ config PREEMPT
 	  Say Y here if you are building a kernel for a desktop, embedded
 	  or real-time system.  Say N if you are unsure.
 
-config PREEMPT_BKL
-	bool "Preempt The Big Kernel Lock"
-	depends on PREEMPT
-	default y
-	help
-	  This option reduces the latency of the kernel by making the
-	  big kernel lock preemptible.
-
-	  Say Y here if you are building a kernel for a desktop system.
-	  Say N if you are unsure.
-
 config MN10300_CURRENT_IN_E2
 	bool "Hold current task address in E2 register"
 	default y
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 897f723..181006c 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -72,6 +72,14 @@
 #define in_softirq()		(softirq_count())
 #define in_interrupt()		(irq_count())
 
+#if defined(CONFIG_PREEMPT)
+# define PREEMPT_INATOMIC_BASE kernel_locked()
+# define PREEMPT_CHECK_OFFSET 1
+#else
+# define PREEMPT_INATOMIC_BASE 0
+# define PREEMPT_CHECK_OFFSET 0
+#endif
+
 /*
  * Are we running in atomic context?  WARNING: this macro cannot
  * always detect atomic context; in particular, it cannot know about
@@ -79,17 +87,11 @@
  * used in the general case to determine whether sleeping is possible.
  * Do not use in_atomic() in driver code.
  */
-#define in_atomic()		((preempt_count() & ~PREEMPT_ACTIVE) != 0)
-
-#ifdef CONFIG_PREEMPT
-# define PREEMPT_CHECK_OFFSET 1
-#else
-# define PREEMPT_CHECK_OFFSET 0
-#endif
+#define in_atomic()	((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_INATOMIC_BASE)
 
 /*
  * Check whether we were atomic before we did preempt_disable():
- * (used by the scheduler)
+ * (used by the scheduler, *after* releasing the kernel lock)
  */
 #define in_atomic_preempt_off() \
 		((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_CHECK_OFFSET)
diff --git a/kernel/sched.c b/kernel/sched.c
index 58fb8af..c51b656 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4567,8 +4567,6 @@ EXPORT_SYMBOL(schedule);
 asmlinkage void __sched preempt_schedule(void)
 {
 	struct thread_info *ti = current_thread_info();
-	struct task_struct *task = current;
-	int saved_lock_depth;
 
 	/*
 	 * If there is a non-zero preempt_count or interrupts are disabled,
@@ -4579,16 +4577,7 @@ asmlinkage void __sched preempt_schedule(void)
 
 	do {
 		add_preempt_count(PREEMPT_ACTIVE);
-
-		/*
-		 * We keep the big kernel semaphore locked, but we
-		 * clear ->lock_depth so that schedule() doesnt
-		 * auto-release the semaphore:
-		 */
-		saved_lock_depth = task->lock_depth;
-		task->lock_depth = -1;
 		schedule();
-		task->lock_depth = saved_lock_depth;
 		sub_preempt_count(PREEMPT_ACTIVE);
 
 		/*
@@ -4609,26 +4598,15 @@ EXPORT_SYMBOL(preempt_schedule);
 asmlinkage void __sched preempt_schedule_irq(void)
 {
 	struct thread_info *ti = current_thread_info();
-	struct task_struct *task = current;
-	int saved_lock_depth;
 
 	/* Catch callers which need to be fixed */
 	BUG_ON(ti->preempt_count || !irqs_disabled());
 
 	do {
 		add_preempt_count(PREEMPT_ACTIVE);
-
-		/*
-		 * We keep the big kernel semaphore locked, but we
-		 * clear ->lock_depth so that schedule() doesnt
-		 * auto-release the semaphore:
-		 */
-		saved_lock_depth = task->lock_depth;
-		task->lock_depth = -1;
 		local_irq_enable();
 		schedule();
 		local_irq_disable();
-		task->lock_depth = saved_lock_depth;
 		sub_preempt_count(PREEMPT_ACTIVE);
 
 		/*
@@ -5853,8 +5831,11 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
 	spin_unlock_irqrestore(&rq->lock, flags);
 
 	/* Set the preempt count _outside_ the spinlocks! */
+#if defined(CONFIG_PREEMPT)
+	task_thread_info(idle)->preempt_count = (idle->lock_depth >= 0);
+#else
 	task_thread_info(idle)->preempt_count = 0;
-
+#endif
 	/*
 	 * The idle tasks have their own, simple scheduling class:
 	 */
diff --git a/lib/kernel_lock.c b/lib/kernel_lock.c
index cd3e825..06722aa 100644
--- a/lib/kernel_lock.c
+++ b/lib/kernel_lock.c
@@ -11,79 +11,121 @@
 #include <linux/semaphore.h>
 
 /*
- * The 'big kernel semaphore'
+ * The 'big kernel lock'
  *
- * This mutex is taken and released recursively by lock_kernel()
+ * This spinlock is taken and released recursively by lock_kernel()
  * and unlock_kernel().  It is transparently dropped and reacquired
  * over schedule().  It is used to protect legacy code that hasn't
  * been migrated to a proper locking design yet.
  *
- * Note: code locked by this semaphore will only be serialized against
- * other code using the same locking facility. The code guarantees that
- * the task remains on the same CPU.
- *
  * Don't use in new code.
  */
-static DECLARE_MUTEX(kernel_sem);
+static  __cacheline_aligned_in_smp DEFINE_SPINLOCK(kernel_flag);
+
 
 /*
- * Re-acquire the kernel semaphore.
+ * Acquire/release the underlying lock from the scheduler.
  *
- * This function is called with preemption off.
+ * This is called with preemption disabled, and should
+ * return an error value if it cannot get the lock and
+ * TIF_NEED_RESCHED gets set.
  *
- * We are executing in schedule() so the code must be extremely careful
- * about recursion, both due to the down() and due to the enabling of
- * preemption. schedule() will re-check the preemption flag after
- * reacquiring the semaphore.
+ * If it successfully gets the lock, it should increment
+ * the preemption count like any spinlock does.
+ *
+ * (This works on UP too - _raw_spin_trylock will never
+ * return false in that case)
  */
 int __lockfunc __reacquire_kernel_lock(void)
 {
-	struct task_struct *task = current;
-	int saved_lock_depth = task->lock_depth;
-
-	BUG_ON(saved_lock_depth < 0);
-
-	task->lock_depth = -1;
-	preempt_enable_no_resched();
-
-	down(&kernel_sem);
-
+	while (!_raw_spin_trylock(&kernel_flag)) {
+		if (test_thread_flag(TIF_NEED_RESCHED))
+			return -EAGAIN;
+		cpu_relax();
+	}
 	preempt_disable();
-	task->lock_depth = saved_lock_depth;
-
 	return 0;
 }
 
 void __lockfunc __release_kernel_lock(void)
 {
-	up(&kernel_sem);
+	_raw_spin_unlock(&kernel_flag);
+	preempt_enable_no_resched();
 }
 
 /*
- * Getting the big kernel semaphore.
+ * These are the BKL spinlocks - we try to be polite about preemption. 
+ * If SMP is not on (ie UP preemption), this all goes away because the
+ * _raw_spin_trylock() will always succeed.
  */
-void __lockfunc lock_kernel(void)
+#ifdef CONFIG_PREEMPT
+static inline void __lock_kernel(void)
 {
-	struct task_struct *task = current;
-	int depth = task->lock_depth + 1;
+	preempt_disable();
+	if (unlikely(!_raw_spin_trylock(&kernel_flag))) {
+		/*
+		 * If preemption was disabled even before this
+		 * was called, there's nothing we can be polite
+		 * about - just spin.
+		 */
+		if (preempt_count() > 1) {
+			_raw_spin_lock(&kernel_flag);
+			return;
+		}
 
-	if (likely(!depth))
 		/*
-		 * No recursion worries - we set up lock_depth _after_
+		 * Otherwise, let's wait for the kernel lock
+		 * with preemption enabled..
 		 */
-		down(&kernel_sem);
+		do {
+			preempt_enable();
+			while (spin_is_locked(&kernel_flag))
+				cpu_relax();
+			preempt_disable();
+		} while (!_raw_spin_trylock(&kernel_flag));
+	}
+}
 
-	task->lock_depth = depth;
+#else
+
+/*
+ * Non-preemption case - just get the spinlock
+ */
+static inline void __lock_kernel(void)
+{
+	_raw_spin_lock(&kernel_flag);
 }
+#endif
 
-void __lockfunc unlock_kernel(void)
+static inline void __unlock_kernel(void)
 {
-	struct task_struct *task = current;
+	/*
+	 * the BKL is not covered by lockdep, so we open-code the
+	 * unlocking sequence (and thus avoid the dep-chain ops):
+	 */
+	_raw_spin_unlock(&kernel_flag);
+	preempt_enable();
+}
 
-	BUG_ON(task->lock_depth < 0);
+/*
+ * Getting the big kernel lock.
+ *
+ * This cannot happen asynchronously, so we only need to
+ * worry about other CPU's.
+ */
+void __lockfunc lock_kernel(void)
+{
+	int depth = current->lock_depth+1;
+	if (likely(!depth))
+		__lock_kernel();
+	current->lock_depth = depth;
+}
 
-	if (likely(--task->lock_depth < 0))
-		up(&kernel_sem);
+void __lockfunc unlock_kernel(void)
+{
+	BUG_ON(current->lock_depth < 0);
+	if (likely(--current->lock_depth < 0))
+		__unlock_kernel();
 }
 
 EXPORT_SYMBOL(lock_kernel);

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 14:57               ` Andi Kleen
@ 2008-05-07 15:31                 ` Andrew Morton
  2008-05-07 16:22                   ` Matthew Wilcox
  0 siblings, 1 reply; 140+ messages in thread
From: Andrew Morton @ 2008-05-07 15:31 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML,
	Alexander Viro

On Wed, 07 May 2008 16:57:52 +0200 Andi Kleen <andi@firstfloor.org> wrote:

> Or figure out what made the semaphore consolidation slower? As Ingo
> pointed out earlier 40% is unlikely to be a fast path problem, but some
> algorithmic problem. Surely that is fixable (even for .26)?

Absolutely.  Yanmin is apparently showing that each call to __down()
results in 1,451 calls to schedule().  wtf?


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 14:36             ` Linus Torvalds
                                 ` (2 preceding siblings ...)
  2008-05-07 15:19               ` Linus Torvalds
@ 2008-05-07 16:20               ` Ingo Molnar
  2008-05-07 16:35                 ` Linus Torvalds
  3 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-07 16:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro,
	Andrew Morton


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I think turning the BKL into a semaphore was fine per se, but that was 
> when semaphores were fast.

hm, do we know it for a fact that the 40% AIM regression is due to the 
fastpath overhead of the BKL? It would be extraordinary if so.

I think it is far more likely that it's due to the different scheduling 
and wakeup behavior of the new kernel/semaphore.c code. So the fix would 
be to restore the old scheduling behavior - that's what Yanmin's manual 
revert did and that's what got him back the previous AIM7 performance.

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 15:31                 ` Andrew Morton
@ 2008-05-07 16:22                   ` Matthew Wilcox
  0 siblings, 0 replies; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-07 16:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Linus Torvalds, Zhang, Yanmin, Ingo Molnar, LKML,
	Alexander Viro

On Wed, May 07, 2008 at 08:31:05AM -0700, Andrew Morton wrote:
> On Wed, 07 May 2008 16:57:52 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> 
> > Or figure out what made the semaphore consolidation slower? As Ingo
> > pointed out earlier 40% is unlikely to be a fast path problem, but some
> > algorithmic problem. Surely that is fixable (even for .26)?
> 
> Absolutely.  Yanmin is apparently showing that each call to __down()
> results in 1,451 calls to schedule().  wtf?

I can't figure it out either.  Unless schedule() is broken somehow ...
but that should have shown up with semaphore-sleepers.c, shouldn't it?

One other difference between semaphore-sleepers and the new generic code
is that in effect, semaphore-sleepers does a little bit of spinning
before it sleeps.  That is, if up() and down() are called more-or-less
simultaneously, the increment of sem->count will happen before __down
calls schedule().  How about something like this:

diff --git a/kernel/semaphore.c b/kernel/semaphore.c
index 5c2942e..ef83f5a 100644
--- a/kernel/semaphore.c
+++ b/kernel/semaphore.c
@@ -211,6 +211,7 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
 	waiter.up = 0;
 
 	for (;;) {
+		int i;
 		if (state == TASK_INTERRUPTIBLE && signal_pending(task))
 			goto interrupted;
 		if (state == TASK_KILLABLE && fatal_signal_pending(task))
@@ -219,7 +220,15 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
 			goto timed_out;
 		__set_task_state(task, state);
 		spin_unlock_irq(&sem->lock);
+
+		for (i = 0; i < 10; i++) {
+			if (waiter.up)
+				goto skip_schedule;
+			cpu_relax();
+		}
+
 		timeout = schedule_timeout(timeout);
+ skip_schedule:
 		spin_lock_irq(&sem->lock);
 		if (waiter.up)
 			return 0;

Maybe it'd be enough to test it once ... or maybe we should use
spin_is_locked() ... Ingo?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 16:20               ` Ingo Molnar
@ 2008-05-07 16:35                 ` Linus Torvalds
  2008-05-07 17:05                   ` Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Linus Torvalds @ 2008-05-07 16:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro,
	Andrew Morton



On Wed, 7 May 2008, Ingo Molnar wrote:
> 
> I think it is far more likely that it's due to the different scheduling 
> and wakeup behavior of the new kernel/semaphore.c code. So the fix would 
> be to restore the old scheduling behavior - that's what Yanmin's manual 
> revert did and that's what got him back the previous AIM7 performance.

Yes, Yanmin's manual revert got rid of the new semaphores entirely. Which 
was what, 7500 lines of code removed that got reverted.

And the *WHOLE* and *ONLY* excuse for dropping the spinlock lock_kernel 
was this (and I quote your message):

    remove the !PREEMPT_BKL code.
    
    this removes 160 lines of legacy code.

in other words, your only stated valid reason for getting rid of the 
spinlock was 160 lines, and the comment didn't even match what it did (it 
removed the spinlocks entirely, not just the preemptible version).

In contrast, the revert adds 7500 lines. If you go by the only documented 
reason for the crap that is the current BKL, then I know which one I'll 
take. I'll take the spinlock back, and I'd rather put preemption back 
than ever take those semaphores.

And even that's ignoring another issue: did anybody ever even do that AIM7 
benchmark comparing spinlocks to the semaphore-BKL? It's quite possible 
that the semaphores (even the well-behaved ones) behaved worse than the 
spinlocks.

		Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 17:21       ` Andrew Morton
                           ` (2 preceding siblings ...)
  2008-05-06 17:45         ` Linus Torvalds
@ 2008-05-07 16:38         ` Matthew Wilcox
  2008-05-07 16:55           ` Linus Torvalds
  3 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-07 16:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, Linus Torvalds, linux-fsdevel

On Tue, May 06, 2008 at 10:21:53AM -0700, Andrew Morton wrote:
> up() seems to be doing wake-one, FIFO which is nice.  Did the
> implementation which we just removed also do that?  Was it perhaps
> accidentally doing LIFO or something like that?

If heavily contended, it could do this.

up() would increment sem->count and cal __up() which would call wake_up()
down() would decrement sem->count.
The unlucky task woken by __up would lose the race and go back to sleep.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 16:38         ` Matthew Wilcox
@ 2008-05-07 16:55           ` Linus Torvalds
  2008-05-07 17:08             ` Linus Torvalds
  0 siblings, 1 reply; 140+ messages in thread
From: Linus Torvalds @ 2008-05-07 16:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, linux-fsdevel



On Wed, 7 May 2008, Matthew Wilcox wrote:
> 
> If heavily contended, it could do this.

It doesn' have to be heavily contended - if it's just hot and a bit lucky, 
it would potentially never schedule at all, because it would never take 
the spinlock and serialize the callers.

It doesn't even need "unfairness" to work that way. The old semaphore 
implementation was very much designed to be lock-free, and if you had one 
CPU doing a lock while another did an unlock, the *common* situation was 
that the unlock would succeed first, because the unlocker was also the 
person who had the spinlock exclusively in its cache!

The above may count as "lucky", but the hot-cache-line thing is a big 
deal. It likely "lucky" into something that isn't a 50:50 chance, but 
something that is quite possible to trigger consistently if you just have 
mostly short holders of the lock.

Which, btw, is probably true. The BKL is normally held for short times, 
and released (by that thread) for relatively much longer times. Which 
is when spinlocks tend to work the best, even when they are fair (because 
it's not so much a fairness issue, it's simply a cost-of-taking-the-lock 
issue!)

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 16:35                 ` Linus Torvalds
@ 2008-05-07 17:05                   ` Ingo Molnar
  2008-05-07 17:24                     ` Linus Torvalds
  0 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-07 17:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro,
	Andrew Morton


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > I think it is far more likely that it's due to the different 
> > scheduling and wakeup behavior of the new kernel/semaphore.c code. 
> > So the fix would be to restore the old scheduling behavior - that's 
> > what Yanmin's manual revert did and that's what got him back the 
> > previous AIM7 performance.
> 
> Yes, Yanmin's manual revert got rid of the new semaphores entirely. 
> Which was what, 7500 lines of code removed that got reverted.

i wouldnt advocate a 7500 revert instead of a 160 lines change.

my suggestion was that the scheduling behavior of the new 
kernel/semaphore.c code is causing the problem - i.e. making it match 
the old semaphore code's behavior would give us back performance.

> And the *WHOLE* and *ONLY* excuse for dropping the spinlock 
> lock_kernel was this (and I quote your message):
> 
>     remove the !PREEMPT_BKL code.
>     
>     this removes 160 lines of legacy code.
> 
> in other words, your only stated valid reason for getting rid of the 
> spinlock was 160 lines, and the comment didn't even match what it did 
> (it removed the spinlocks entirely, not just the preemptible version).

it was removed by me in the course of this discussion:

   http://lkml.org/lkml/2008/1/2/58

the whole discussion started IIRC because !CONFIG_PREEMPT_BKL [the 
spinlock version] was broken for a longer period of time (it crashed 
trivially), because nobody apparently used it. People (Nick) asked why 
it was still there and i agreed and removed it. CONFIG_PREEMPT_BKL=y was 
the default, that was what all distros used. I.e. the spinlock code was 
in essence dead code at that point in time.

the spinlock code might in fact perform _better_, but nobody came up 
with such a workload before.

> In contrast, the revert adds 7500 lines. If you go by the only 
> documented reason for the crap that is the current BKL, then I know 
> which one I'll take. I'll take the spinlock back, and I'd rather put 
> preemption back than ever take those semaphores.
> 
> And even that's ignoring another issue: did anybody ever even do that 
> AIM7 benchmark comparing spinlocks to the semaphore-BKL? It's quite 
> possible that the semaphores (even the well-behaved ones) behaved 
> worse than the spinlocks.

that's a good question...

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 16:55           ` Linus Torvalds
@ 2008-05-07 17:08             ` Linus Torvalds
  2008-05-07 17:16               ` Andrew Morton
  2008-05-07 17:22               ` Ingo Molnar
  0 siblings, 2 replies; 140+ messages in thread
From: Linus Torvalds @ 2008-05-07 17:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, linux-fsdevel



On Wed, 7 May 2008, Linus Torvalds wrote:
>
> Which, btw, is probably true. The BKL is normally held for short times, 
> and released (by that thread) for relatively much longer times. Which 
> is when spinlocks tend to work the best, even when they are fair (because 
> it's not so much a fairness issue, it's simply a cost-of-taking-the-lock 
> issue!)

.. and don't get me wrong: the old semaphores (and the new mutexes) should 
also have this property when lucky: taking the lock is often a hot-path 
case.

And the spinlock+generic semaphore thing probably makes that "lucky" 
behavior be exponentially less likely, because now to hit the lucky case, 
rather than the hot path having just *one* access to the interesting cache 
line, it has basically something like 4 accesses (spinlock, count test, 
count decrement, spinunlock), in addition to various serializing 
instructions, so I suspect it quite often gets serialized simply because 
even the "fast path" is actually about ten times as long!

As a result, a slow "fast path" means that the thing gets saturated much 
more easily, and that in turn means that the "fast path" turns into a 
"slow path" more easily, which is how you end up in the scheduler rather 
than just taking the fast path.

This is why sleeping locks are more expensive in general: they have a 
*huge* cost from when they get contended. Hundreds of times higher than a 
spinlock. And the faster they are, the longer it takes for them to get 
contended under load. So slowing them down in the fast path is a double 
whammy, in that it shows their bad behaviour much earlier.

And the generic semaphores really are slower than the old optimized ones 
in that fast path. By a *big* amount.

Which is why I'm 100% convinced it's not even worth saving the old code. 
It needs to use mutexes, or spinlocks. I bet it has *nothing* to do with 
"slow path" other than the fact that it gets to that slow path much more 
these days.

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 15:19               ` Linus Torvalds
@ 2008-05-07 17:14                 ` Ingo Molnar
  2008-05-08  2:44                 ` Zhang, Yanmin
  1 sibling, 0 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-07 17:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro,
	Andrew Morton


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > But my preferred option would indeed be just turning it back into a 
> > spinlock - and screw latency and BKL preemption - and having the RT 
> > people who care deeply just work on removing the BKL in the long 
> > run.
> 
> Here's a trial balloon patch to do that.

here's a simpler trial baloon test-patch (well, hack) that is also 
reasonably well tested. It turns the BKL into a "spin-semaphore". If 
this resolves the performance problem then it's all due to the BKL's 
scheduling/preemption properties.

this approach is ugly (it's just a more expensive spinlock), but has an 
advantage: the code logic is obviously correct, and it would also make 
it much easier later on to turn the BKL back into a sleeping lock again 
- once the TTY code's BKL use is fixed. (i think Alan said it might 
happen in the next few months) The BKL is more expensive than a simple 
spinlock anyway.

	Ingo

------------->
Subject: BKL: spin on acquire
From: Ingo Molnar <mingo@elte.hu>
Date: Wed May 07 19:05:40 CEST 2008

NOT-Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 lib/kernel_lock.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Index: linux/lib/kernel_lock.c
===================================================================
--- linux.orig/lib/kernel_lock.c
+++ linux/lib/kernel_lock.c
@@ -46,7 +46,8 @@ int __lockfunc __reacquire_kernel_lock(v
 	task->lock_depth = -1;
 	preempt_enable_no_resched();
 
-	down(&kernel_sem);
+	while (down_trylock(&kernel_sem))
+		cpu_relax();
 
 	preempt_disable();
 	task->lock_depth = saved_lock_depth;
@@ -67,11 +68,13 @@ void __lockfunc lock_kernel(void)
 	struct task_struct *task = current;
 	int depth = task->lock_depth + 1;
 
-	if (likely(!depth))
+	if (likely(!depth)) {
 		/*
 		 * No recursion worries - we set up lock_depth _after_
 		 */
-		down(&kernel_sem);
+		while (down_trylock(&kernel_sem))
+			cpu_relax();
+	}
 
 	task->lock_depth = depth;
 }

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:08             ` Linus Torvalds
@ 2008-05-07 17:16               ` Andrew Morton
  2008-05-07 17:27                 ` Linus Torvalds
  2008-05-07 17:22               ` Ingo Molnar
  1 sibling, 1 reply; 140+ messages in thread
From: Andrew Morton @ 2008-05-07 17:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel

On Wed, 7 May 2008 10:08:18 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Which is why I'm 100% convinced it's not even worth saving the old code. 
> It needs to use mutexes, or spinlocks. I bet it has *nothing* to do with 
> "slow path" other than the fact that it gets to that slow path much more 
> these days.

Stupid question: why doesn't lock_kernel() use a mutex?

(stupid answer: it'll trigger might_sleep() checks when we do it early in
boot with irqs disabled, but we can fix that)

(And __might_sleep()'s system_state check might even save us from that)

Of course, we shouldn't change anything until we've worked out why the new
semaphores got slower.


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:08             ` Linus Torvalds
  2008-05-07 17:16               ` Andrew Morton
@ 2008-05-07 17:22               ` Ingo Molnar
  2008-05-07 17:25                 ` Ingo Molnar
  2008-05-07 17:31                 ` Linus Torvalds
  1 sibling, 2 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-07 17:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Which is why I'm 100% convinced it's not even worth saving the old 
> code. It needs to use mutexes, or spinlocks. I bet it has *nothing* to 
> do with "slow path" other than the fact that it gets to that slow path 
> much more these days.

i think your theory should be easy to test: Yanmin, could you turn on 
CONFIG_MUTEX_DEBUG=y and check by how much AIM7 regresses?

Because in the CONFIG_MUTEX_DEBUG=y case the mutex debug code does 
exactly that: it doesnt use the single-instruction fastpath [it uses 
asm-generic/mutex-null.h] but always drops into the slowpath (to be able 
to access debug state). That debug code is about as expensive as the 
generic semaphore code's current fastpath. (perhaps even more 
expensive.)

There's far more normal mutex fastpath use during an AIM7 run than any 
BKL use. So if it's due to any direct fastpath overhead and the 
resulting widening of the window for the real slowdown, we should see a 
severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:05                   ` Ingo Molnar
@ 2008-05-07 17:24                     ` Linus Torvalds
  2008-05-07 17:36                       ` Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Linus Torvalds @ 2008-05-07 17:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro,
	Andrew Morton



On Wed, 7 May 2008, Ingo Molnar wrote:
> 
> it was removed by me in the course of this discussion:
> 
>    http://lkml.org/lkml/2008/1/2/58
> 
> the whole discussion started IIRC because !CONFIG_PREEMPT_BKL [the 
> spinlock version] was broken for a longer period of time (it crashed 
> trivially), because nobody apparently used it.

Hmm. I've generally used PREEMPT_NONE, and always thought PREEMPT_BKL was 
the known-flaky one.

The thread you point to also says that it's PREEMPT_BKL=y that was the 
problem (ie "I've seen 1s+ desktop latencies due to PREEMPT_BKL when I was 
still using reiserfs."), not the plain spinlock approach.

But it would definitely be interesting to see the crash reports. And the 
help message always said "Say N if you are unsure." even if it ended up 
being marked 'y' by default at some point (and then in January was made 
first unconditional, and then removed entirely)

Because in many ways, the non-preempt BKL is the *much* simpler case. I 
don't see why it would crash - it just turns the BKL into a trivial 
counting spinlock that can sleep.

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:22               ` Ingo Molnar
@ 2008-05-07 17:25                 ` Ingo Molnar
  2008-05-07 17:31                 ` Linus Torvalds
  1 sibling, 0 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-07 17:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel


* Ingo Molnar <mingo@elte.hu> wrote:

> There's far more normal mutex fastpath use during an AIM7 run than any 
> BKL use. So if it's due to any direct fastpath overhead and the 
> resulting widening of the window for the real slowdown, we should see 
> a severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?

my own guesstimate about AIM7 performance impact resulting out of 
CONFIG_MUTEX_DEBUG=y: performance overhead will not be measurable, or 
will at most be in the sub-1% range. But i've been badly wrong before :)

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:16               ` Andrew Morton
@ 2008-05-07 17:27                 ` Linus Torvalds
  0 siblings, 0 replies; 140+ messages in thread
From: Linus Torvalds @ 2008-05-07 17:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel



On Wed, 7 May 2008, Andrew Morton wrote:

> On Wed, 7 May 2008 10:08:18 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > Which is why I'm 100% convinced it's not even worth saving the old code. 
> > It needs to use mutexes, or spinlocks. I bet it has *nothing* to do with 
> > "slow path" other than the fact that it gets to that slow path much more 
> > these days.
> 
> Stupid question: why doesn't lock_kernel() use a mutex?

Not stupid.

The only reason some code didn't get turned over to mutexes was literally 
that they didn't want the debugging because they were doing intentionally 
bad things. 

I think the BKL is one of them (the console semaphore was another, iirc). 

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:22               ` Ingo Molnar
  2008-05-07 17:25                 ` Ingo Molnar
@ 2008-05-07 17:31                 ` Linus Torvalds
  2008-05-07 17:47                   ` Linus Torvalds
  2008-05-07 17:49                   ` Ingo Molnar
  1 sibling, 2 replies; 140+ messages in thread
From: Linus Torvalds @ 2008-05-07 17:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel



On Wed, 7 May 2008, Ingo Molnar wrote:
> 
> There's far more normal mutex fastpath use during an AIM7 run than any 
> BKL use. So if it's due to any direct fastpath overhead and the 
> resulting widening of the window for the real slowdown, we should see a 
> severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?

Not agreed.

The BKL is special because it is a *single* lock.

All the "normal" mutex code use fine-grained locking, so even if you slow 
down the fast path, that won't cause the same kind of fastpath->slowpath 
increase.

In order to see the fastpath->slowpath thing, you do need to have many 
threads hitting the same lock: ie the slowdown has to result in real 
contention. 

Almost no mutexes have any potential for contention what-so-ever, except 
for things that very consciously try to hit it (multiple threads doing 
readdir and/or file creation on the *same* directory etc).

The BKL really is special. 

		Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:24                     ` Linus Torvalds
@ 2008-05-07 17:36                       ` Ingo Molnar
  2008-05-07 17:55                         ` Linus Torvalds
  0 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-07 17:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro,
	Andrew Morton


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Wed, 7 May 2008, Ingo Molnar wrote:
> > 
> > it was removed by me in the course of this discussion:
> > 
> >    http://lkml.org/lkml/2008/1/2/58
> > 
> > the whole discussion started IIRC because !CONFIG_PREEMPT_BKL [the 
> > spinlock version] was broken for a longer period of time (it crashed 
> > trivially), because nobody apparently used it.
> 
> Hmm. I've generally used PREEMPT_NONE, and always thought PREEMPT_BKL 
> was the known-flaky one.
> 
> The thread you point to also says that it's PREEMPT_BKL=y that was the 
> problem (ie "I've seen 1s+ desktop latencies due to PREEMPT_BKL when I 
> was still using reiserfs."), not the plain spinlock approach.

no, there was another problem (which i couldnt immediately find because 
lkml.org only indexes part of the threads, i'll research it some more), 
which was some cond_resched() thing in the !PREEMPT_BKL case.

> But it would definitely be interesting to see the crash reports. And 
> the help message always said "Say N if you are unsure." even if it 
> ended up being marked 'y' by default at some point (and then in 
> January was made first unconditional, and then removed entirely)
> 
> Because in many ways, the non-preempt BKL is the *much* simpler case. 
> I don't see why it would crash - it just turns the BKL into a trivial 
> counting spinlock that can sleep.

yeah. The latencies are a different problem, and indeed were reported 
against PREEMPT_BKL, and believed to be due to reiser3 and the tty code. 
(reiser3 runs almost all of its code under the BKL)

The !PREEMPT_BKL crash was some simple screwup on my part of getting 
atomicity checks wrong in cond_resched() - and it went unnoticed for a 
long time - or something like that. I'll try to find that discussion.

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:31                 ` Linus Torvalds
@ 2008-05-07 17:47                   ` Linus Torvalds
  2008-05-07 17:49                   ` Ingo Molnar
  1 sibling, 0 replies; 140+ messages in thread
From: Linus Torvalds @ 2008-05-07 17:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel



On Wed, 7 May 2008, Linus Torvalds wrote:
> 
> All the "normal" mutex code use fine-grained locking, so even if you slow 
> down the fast path, that won't cause the same kind of fastpath->slowpath 
> increase.

Put another way: let's say that the "good fastpath" is basically a single 
locked instruction - ~12 cycles on AMD, ~35 Core 2. That's the 
no-bouncing, no-contention case.

Doing it with debugging (call overhead, spinlocks, local irq saving rtc) 
will probably easily triple it or more, but we're not changing anything 
else. There's no "downstream" effect: the behaviour itself doesn't change. 
It doesn't get more bouncing, it doesn't start sleeping.

But what happens if the lock has the *potential* for conflicts is 
different.

There, a "longish pause + fast lock + short average code sequece + fast 
unlock" is quite likely to stay uncontended for a fair amount of time, and 
while it will be much slower than the no-contention-at-all case (because 
you do get a pretty likely cacheline event at the "fast lock" part), with 
a fairly low number of CPU's and a long enough pause, you *also* easily 
get into a pattern where the thing that got the lock will likely also get 
to unlock without dropping the cacheline.

So far so good.

But that basically depends on the fact that "lock + work + unlock" is 
_much_ shorter than the "longish pause" in between, so that even if you 
have <n> CPU's all doing the same thing, their pauses between the locked 
section are still bigger than <n> times that short time.

Once that is no longer true, you now start to bounce both at the lock 
*and* the unlock, and now that whole sequence got likely ten times slower. 
*AND* because it now actually has real contention, it actually got even 
worse: if the lock is a sleeping one, you get *another* order of magnitude 
just because you now started doing scheduling overhead too!

So the thing is, it just breaks down very badly. A spinlock that gets 
contention probably gets ten times slower due to bouncing the cacheline. A 
semaphore that gets contention probably gets a *hundred* times slower, or 
more.

And so my bet is that both the old and the new semaphores had the same bad 
break-down situation, but the new semaphores just are a lot easier to 
trigger it because they are at least three times costlier than the old 
ones, so you just hit the bad behaviour with much lower loads (or fewer 
number of CPU's).

But spinlocks really do behave much better when contended, because at 
least they don't get the even bigger hit of also hitting the scheduler. So 
the old semaphores would have behaved badly too *eventually*, they just 
needed a more extreme case to show that bad behavior.

		Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:31                 ` Linus Torvalds
  2008-05-07 17:47                   ` Linus Torvalds
@ 2008-05-07 17:49                   ` Ingo Molnar
  2008-05-07 18:02                     ` Linus Torvalds
  1 sibling, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-07 17:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > There's far more normal mutex fastpath use during an AIM7 run than 
> > any BKL use. So if it's due to any direct fastpath overhead and the 
> > resulting widening of the window for the real slowdown, we should 
> > see a severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?
> 
> Not agreed.
> 
> The BKL is special because it is a *single* lock.

ok, indeed my suggestion is wrong and this would not be a good 
comparison.

another idea: my trial-baloon patch should test your theory too, because 
the generic down_trylock() is still the 'fat' version, it does:

        spin_lock_irqsave(&sem->lock, flags);
        count = sem->count - 1;
        if (likely(count >= 0))
                sem->count = count;
        spin_unlock_irqrestore(&sem->lock, flags);

if there is a noticeable performance difference between your 
trial-ballon patch and mine, then the micro-cost of the BKL very much 
matters to this workload. Agreed about that?

but i'd be _hugely_ surprised about it. The tty code's BKL use should i 
think only happen when a task exits and releases the tty - and a task 
exit - even if this is a threaded test (which AIM7 can be - not sure 
which exact parameters Yanmin used) - the costs of thread creation and 
thread exit are just not in the same ballpark as any BKL micro-costs. 
Dunno, maybe i overlooked some high-freq BKL user. (but any such site 
would have shown up before) Even assuming a widening of the critical 
path and some catastrophic domino effect (that does show up as increased 
scheduling) i've never seen a 40% drop like this.

this regression, to me, has "different scheduling behavior" written all 
over it - but that's just an impression. I'm not going to bet against 
you though ;-)

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:36                       ` Ingo Molnar
@ 2008-05-07 17:55                         ` Linus Torvalds
  2008-05-07 17:59                           ` Matthew Wilcox
  0 siblings, 1 reply; 140+ messages in thread
From: Linus Torvalds @ 2008-05-07 17:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro,
	Andrew Morton



On Wed, 7 May 2008, Ingo Molnar wrote:
> 
> no, there was another problem (which i couldnt immediately find because 
> lkml.org only indexes part of the threads, i'll research it some more), 
> which was some cond_resched() thing in the !PREEMPT_BKL case.

Hmm. I do agree that _cond_resched() looks a bit iffy, although in a safe 
way. It uses just

	!(preempt_count() & PREEMPT_ACTIVE)

to see whether it can schedule, and it should probably use in_atomic() 
which ignores the kernel lock.

But right now, that whole thing is disabled if PREEMPT is on anyway, so in 
effect (with my test patch, at least) cond_preempt() would just be a no-op 
if PREEMPT is on, even if BKL isn't preemptable.

So it doesn't look buggy, but it looks like it might cause longer 
latencies than strictly necessary. And if somebody depends on 
cond_resched() to avoid some bad livelock situation, that would obviously 
not work (but that sounds like a fundamental bug anyway, I really hope 
nobody has ever written their code that way).

> The !PREEMPT_BKL crash was some simple screwup on my part of getting 
> atomicity checks wrong in cond_resched() - and it went unnoticed for a 
> long time - or something like that. I'll try to find that discussion.

Yes, some silly bug sounds more likely. Especially considering how many 
different cases there were (semaphores vs spinlocks vs preemptable 
spinlocks).

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:55                         ` Linus Torvalds
@ 2008-05-07 17:59                           ` Matthew Wilcox
  2008-05-07 18:17                             ` Linus Torvalds
  0 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-07 17:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andi Kleen, Zhang, Yanmin, LKML, Alexander Viro,
	Andrew Morton

On Wed, May 07, 2008 at 10:55:26AM -0700, Linus Torvalds wrote:
> Hmm. I do agree that _cond_resched() looks a bit iffy, although in a safe 
> way. It uses just
> 
> 	!(preempt_count() & PREEMPT_ACTIVE)
> 
> to see whether it can schedule, and it should probably use in_atomic() 
> which ignores the kernel lock.
> 
> But right now, that whole thing is disabled if PREEMPT is on anyway, so in 
> effect (with my test patch, at least) cond_preempt() would just be a no-op 
> if PREEMPT is on, even if BKL isn't preemptable.
> 
> So it doesn't look buggy, but it looks like it might cause longer 
> latencies than strictly necessary. And if somebody depends on 
> cond_resched() to avoid some bad livelock situation, that would obviously 
> not work (but that sounds like a fundamental bug anyway, I really hope 
> nobody has ever written their code that way).

Funny you should mention it; locks.c uses cond_resched() assuming that
it ignores the BKL.  Not through needing to avoid livelock, but it does
presume that other higher priority tasks contending for the lock will
get a chance to take it.  You'll notice the patch I posted yesterday
drops the file_lock_lock around the call to cond_resched().

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:49                   ` Ingo Molnar
@ 2008-05-07 18:02                     ` Linus Torvalds
  2008-05-07 18:17                       ` Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Linus Torvalds @ 2008-05-07 18:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel



On Wed, 7 May 2008, Ingo Molnar wrote:
> 
> another idea: my trial-baloon patch should test your theory too, because 
> the generic down_trylock() is still the 'fat' version, it does:

I agree that your trial-balloon should likely get rid of the big 
regression, since it avoids the scheduler.

So with your patch, lock_kernel() ends up being just a rather expensive 
spinlock. And yes, I'd expect that it should get rid of the 40% cost, 
because while it makes lock_kernel() more expensive than a spinlock and 
you might end up having a few more cacheline bounces on the lock due to 
that, that's still the "small" expense compared to going through the whole 
scheduler on conflicts.

So I'd expect that realistically the performance difference between your 
version and just plain spinlocks shouldn't be *that* big. I'd expect it to 
be visible, but in the (low) single-digit percentage range rather than in 
any 40% range. That's just a guess.

		Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 18:02                     ` Linus Torvalds
@ 2008-05-07 18:17                       ` Ingo Molnar
  2008-05-07 18:27                         ` Linus Torvalds
  0 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-07 18:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, 7 May 2008, Ingo Molnar wrote:
> > 
> > another idea: my trial-baloon patch should test your theory too, 
> > because the generic down_trylock() is still the 'fat' version, it 
> > does:
> 
> I agree that your trial-balloon should likely get rid of the big 
> regression, since it avoids the scheduler.
> 
> So with your patch, lock_kernel() ends up being just a rather 
> expensive spinlock. And yes, I'd expect that it should get rid of the 
> 40% cost, because while it makes lock_kernel() more expensive than a 
> spinlock and you might end up having a few more cacheline bounces on 
> the lock due to that, that's still the "small" expense compared to 
> going through the whole scheduler on conflicts.
> 
> So I'd expect that realistically the performance difference between 
> your version and just plain spinlocks shouldn't be *that* big. I'd 
> expect it to be visible, but in the (low) single-digit percentage 
> range rather than in any 40% range. That's just a guess.

third attempt - the patch below ontop of v2.6.25 should be quite similar 
fastpath atomic overhead to what generic semaphores do? So if Yanmin 
tests this patch ontop of v2.6.25, we should see the direct fastpath 
overhead - without any changes to the semaphore wakeup/scheduling logic 
otherwise.

[ this patch should in fact be a bit worse, because there's two more 
  atomics in the fastpath - the fastpath atomics of the old semaphore 
  code. ]

	Ingo

------------------>
Subject: v2.6.25 BKL: add atomic overhead
From: Ingo Molnar <mingo@elte.hu>
Date: Wed May 07 20:09:13 CEST 2008

---
 lib/kernel_lock.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Index: linux-2.6.25/lib/kernel_lock.c
===================================================================
--- linux-2.6.25.orig/lib/kernel_lock.c
+++ linux-2.6.25/lib/kernel_lock.c
@@ -24,6 +24,7 @@
  * Don't use in new code.
  */
 static DECLARE_MUTEX(kernel_sem);
+static DEFINE_SPINLOCK(global_lock);
 
 /*
  * Re-acquire the kernel semaphore.
@@ -47,6 +48,9 @@ int __lockfunc __reacquire_kernel_lock(v
 
 	down(&kernel_sem);
 
+	spin_lock(&global_lock);
+	spin_unlock(&global_lock);
+
 	preempt_disable();
 	task->lock_depth = saved_lock_depth;
 
@@ -55,6 +59,9 @@ int __lockfunc __reacquire_kernel_lock(v
 
 void __lockfunc __release_kernel_lock(void)
 {
+	spin_lock(&global_lock);
+	spin_unlock(&global_lock);
+
 	up(&kernel_sem);
 }
 
@@ -66,12 +73,16 @@ void __lockfunc lock_kernel(void)
 	struct task_struct *task = current;
 	int depth = task->lock_depth + 1;
 
-	if (likely(!depth))
+	if (likely(!depth)) {
 		/*
 		 * No recursion worries - we set up lock_depth _after_
 		 */
 		down(&kernel_sem);
 
+		spin_lock(&global_lock);
+		spin_unlock(&global_lock);
+	}
+
 	task->lock_depth = depth;
 }
 

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 17:59                           ` Matthew Wilcox
@ 2008-05-07 18:17                             ` Linus Torvalds
  2008-05-07 18:49                               ` Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Linus Torvalds @ 2008-05-07 18:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ingo Molnar, Andi Kleen, Zhang, Yanmin, LKML, Alexander Viro,
	Andrew Morton



On Wed, 7 May 2008, Matthew Wilcox wrote:
> > 
> > So it doesn't look buggy, but it looks like it might cause longer 
> > latencies than strictly necessary. And if somebody depends on 
> > cond_resched() to avoid some bad livelock situation, that would obviously 
> > not work (but that sounds like a fundamental bug anyway, I really hope 
> > nobody has ever written their code that way).
> 
> Funny you should mention it; locks.c uses cond_resched() assuming that
> it ignores the BKL.  Not through needing to avoid livelock, but it does
> presume that other higher priority tasks contending for the lock will
> get a chance to take it.  You'll notice the patch I posted yesterday
> drops the file_lock_lock around the call to cond_resched().

Well, this would only be noticeable with CONFIG_PREEMPT.

If you don't have preempt enabled, it looks like everything should work 
ok: the kernel lock wouldn't increase the preempt count, and 
_cond_resched() works fine.

If you're PREEMPT, then the kernel lock would increase the preempt count, 
and _cond_resched() would refuse to re-schedule it, *but* with PREEMPT 
you'd never see it *anyway*, because PREEMPT will disable cond_resched() 
entirely (because preemption takes care of normal scheduling latencies 
without it).

And I'm also sure that this all worked fine at some point, and it's 
largely a result just of the multiple different variations of BKL 
preemption coupled with some of them getting removed entirely, so the code 
that used to handle it just got corrupt over time. See commit 02b67cc3b, 
for example. 

.. Hmm ... Time passes. Linus looks at git history.

It does look like "cond_resched()" has not worked with the BKL since 2005, 
and hasn't taken the BKL into account. Commit 5bbcfd9000:

    [PATCH] cond_resched(): fix bogus might_sleep() warning

+       if (unlikely(preempt_count()))
+               return;

which talks about the BKS, ie it only took the *semaphore* implementation 
into account. Never the spinlock-with-preemption-count one.

Or am I blind?

		Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 18:17                       ` Ingo Molnar
@ 2008-05-07 18:27                         ` Linus Torvalds
  2008-05-07 18:43                           ` Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Linus Torvalds @ 2008-05-07 18:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel



On Wed, 7 May 2008, Ingo Molnar wrote:
> 
> [ this patch should in fact be a bit worse, because there's two more 
>   atomics in the fastpath - the fastpath atomics of the old semaphore 
>   code. ]

Well, it doesn't have the irq stuff, which is also pretty costly. Also, it 
doesn't nest the accesses the same way (with the counts being *inside* the 
spinlock and serialized against each other), so I'm not 100% sure you'd 
get the same behaviour.

But yes, it certainly has the potential to show the same slowdown. But 
it's not a very good patch, since not showing it doesn't really prove 
much.

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 18:27                         ` Linus Torvalds
@ 2008-05-07 18:43                           ` Ingo Molnar
  2008-05-07 19:01                             ` Linus Torvalds
  0 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-07 18:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > [ this patch should in fact be a bit worse, because there's two more
> >   atomics in the fastpath - the fastpath atomics of the old 
> >   semaphore code. ]
> 
> Well, it doesn't have the irq stuff, which is also pretty costly. 
> Also, it doesn't nest the accesses the same way (with the counts being 
> *inside* the spinlock and serialized against each other), so I'm not 
> 100% sure you'd get the same behaviour.
> 
> But yes, it certainly has the potential to show the same slowdown. But 
> it's not a very good patch, since not showing it doesn't really prove 
> much.

ok, the one below does irq ops and the counter behavior - and because 
the critical section also has the old-semaphore atomics i think this 
should definitely be a more expensive fastpath than what the new generic 
code introduces. So if this patch produces a 40% AIM7 slowdown on 
v2.6.25 it's the fastpath overhead (and its effects on slowpath 
probability) that makes the difference.

	Ingo

------------------->
Subject: add BKL atomic overhead
From: Ingo Molnar <mingo@elte.hu>
Date: Wed May 07 20:09:13 CEST 2008

NOT-Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 lib/kernel_lock.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

Index: linux-2.6.25/lib/kernel_lock.c
===================================================================
--- linux-2.6.25.orig/lib/kernel_lock.c
+++ linux-2.6.25/lib/kernel_lock.c
@@ -24,6 +24,8 @@
  * Don't use in new code.
  */
 static DECLARE_MUTEX(kernel_sem);
+static int global_count;
+static DEFINE_SPINLOCK(global_lock);
 
 /*
  * Re-acquire the kernel semaphore.
@@ -39,6 +41,7 @@ int __lockfunc __reacquire_kernel_lock(v
 {
 	struct task_struct *task = current;
 	int saved_lock_depth = task->lock_depth;
+	unsigned long flags;
 
 	BUG_ON(saved_lock_depth < 0);
 
@@ -47,6 +50,10 @@ int __lockfunc __reacquire_kernel_lock(v
 
 	down(&kernel_sem);
 
+	spin_lock_irqsave(&global_lock, flags);
+	global_count++;
+	spin_unlock_irqrestore(&global_lock, flags);
+
 	preempt_disable();
 	task->lock_depth = saved_lock_depth;
 
@@ -55,6 +62,10 @@ int __lockfunc __reacquire_kernel_lock(v
 
 void __lockfunc __release_kernel_lock(void)
 {
+	spin_lock_irqsave(&global_lock, flags);
+	global_count--;
+	spin_unlock_irqrestore(&global_lock, flags);
+
 	up(&kernel_sem);
 }
 
@@ -66,12 +77,17 @@ void __lockfunc lock_kernel(void)
 	struct task_struct *task = current;
 	int depth = task->lock_depth + 1;
 
-	if (likely(!depth))
+	if (likely(!depth)) {
 		/*
 		 * No recursion worries - we set up lock_depth _after_
 		 */
 		down(&kernel_sem);
 
+		spin_lock_irqsave(&global_lock, flags);
+		global_count++;
+		spin_unlock_irqrestore(&global_lock, flags);
+	}
+
 	task->lock_depth = depth;
 }
 

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 18:17                             ` Linus Torvalds
@ 2008-05-07 18:49                               ` Ingo Molnar
  0 siblings, 0 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-07 18:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Andi Kleen, Zhang, Yanmin, LKML, Alexander Viro,
	Andrew Morton


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> .. Hmm ... Time passes. Linus looks at git history.
> 
> It does look like "cond_resched()" has not worked with the BKL since 
> 2005, and hasn't taken the BKL into account. Commit 5bbcfd9000:
> 
>     [PATCH] cond_resched(): fix bogus might_sleep() warning
> 
> +       if (unlikely(preempt_count()))
> +               return;
> 
> which talks about the BKS, ie it only took the *semaphore* 
> implementation into account. Never the spinlock-with-preemption-count 
> one.
> 
> Or am I blind?

hm, i think you are right.

most latency reduction was concentrated on the PREEMPT+PREEMPT_BKL case, 
and not getting proper cond_resched() behavior in case of !PREEMPT_BKL 
would certainly not be noticed by distros or users.

We made CONFIG_PREEMPT_BKL=y the default on SMP in v2.6.8, in this 
post-2.6.7 commit that introduced the feature:

|  commit fb8f6499abc6a847109d9602b797aa6afd2d5a3d
|  Author: Ingo Molnar <mingo@elte.hu>
|  Date:   Fri Jan 7 21:59:57 2005 -0800
|
|     [PATCH] remove the BKL by turning it into a semaphore

There was constant trouble around all these variations of preemptability 
and their combination with debugging helpers. (So i was rather happy to 
get rid of !PREEMPT_BKL - in the (apparently wrong) assumption that no 
tears will be shed.)

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 18:43                           ` Ingo Molnar
@ 2008-05-07 19:01                             ` Linus Torvalds
  2008-05-07 19:09                               ` Ingo Molnar
  2008-05-07 19:24                               ` Matthew Wilcox
  0 siblings, 2 replies; 140+ messages in thread
From: Linus Torvalds @ 2008-05-07 19:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel



On Wed, 7 May 2008, Ingo Molnar wrote:
> 
> ok, the one below does irq ops and the counter behavior

No it doesn't. The counter isn't used for any actual *testing*, so the 
locking around it and the serialization of it has absolutely no impact on 
the scheduling behaviour!

Since the big slowdown was clearly accompanied by sleeping behaviour (the 
processes who didn't get the lock end up sleeping!), that is a *big* part 
of the slowdown.

Is it possible that your patch gets similar behaviour? Absolutely. But 
you're missing the whole point here. Anybody can make code behave badly 
and perform worse. But if you want to just verify that it's about the 
sleeping behaviour and timings of the BKL, then you need to do exactly 
that: emulate the sleeping behavior, not just the timings _outside_ of the 
sleeping behavior.

The thing is, we definitely are interested to see whether it's the BKL or 
some other semaphore that is the problem. But the best way to test that is 
to just try my patch that *guarantees* that the BKL doesn't have any 
semaphore behaviour AT ALL.

Could it be something else entirely? Yes. We know it's semaphore- related. 
We don't know for a fact that it's the BKL itself. There could be other 
semaphores that are that hot. It sounds unlikely, but quite frankly, 
regardless, I don't really see the point of your patches.

If Yanmin tries my patch, it is *guaranteed* to show something. It either 
shows that it's about the BKL (and that we absolutely have to do the BKL 
as something _else_ than a generic semaphore), or it shows that it's not 
about the BKL (and that _all_ the patches in this discussion are likely 
pointless).

In contrast, these "try to emulate bad behavior with the old known-ok 
semaphores" don't show anything AT ALL. We already know it's related to 
semaphores. And your patches aren't even guaranteed to show the same 
issue.

		Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 19:01                             ` Linus Torvalds
@ 2008-05-07 19:09                               ` Ingo Molnar
  2008-05-07 19:24                               ` Matthew Wilcox
  1 sibling, 0 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-07 19:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin,
	LKML, Alexander Viro, linux-fsdevel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> In contrast, these "try to emulate bad behavior with the old known-ok 
> semaphores" don't show anything AT ALL. We already know it's related 
> to semaphores. And your patches aren't even guaranteed to show the 
> same issue.

yeah, i was just trying to come up with patches to probe which one of 
the following two possibilities is actually the case:

 - if the regression is due to the difference in scheduling behavior of 
   new semaphores (different wakeup patterns, etc.), that's fixable in 
   the new semaphore code => then the BKL code need not change.

 - if the regression is due due to difference in the fastpath cost, then 
   the new semaphores can probably not be improved (much of their appeal 
   comes from them not being complex and not being in assembly) => then 
   the BKL code needs to change to become cheaper [i.e. then we want 
   your patch].

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 19:01                             ` Linus Torvalds
  2008-05-07 19:09                               ` Ingo Molnar
@ 2008-05-07 19:24                               ` Matthew Wilcox
  2008-05-07 19:44                                 ` Linus Torvalds
  1 sibling, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-07 19:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, linux-fsdevel

On Wed, May 07, 2008 at 12:01:28PM -0700, Linus Torvalds wrote:
> The thing is, we definitely are interested to see whether it's the BKL or 
> some other semaphore that is the problem. But the best way to test that is 
> to just try my patch that *guarantees* that the BKL doesn't have any 
> semaphore behaviour AT ALL.
> 
> Could it be something else entirely? Yes. We know it's semaphore- related. 
> We don't know for a fact that it's the BKL itself. There could be other 
> semaphores that are that hot. It sounds unlikely, but quite frankly, 
> regardless, I don't really see the point of your patches.
> 
> If Yanmin tries my patch, it is *guaranteed* to show something. It either 
> shows that it's about the BKL (and that we absolutely have to do the BKL 
> as something _else_ than a generic semaphore), or it shows that it's not 
> about the BKL (and that _all_ the patches in this discussion are likely 
> pointless).

One patch I'd still like Yanmin to test is my one from yesterday which
removes the BKL from fs/locks.c.

http://marc.info/?l=linux-fsdevel&m=121009123427437&w=2

Obviously, it won't help if the problem isn't the BKL.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 19:24                               ` Matthew Wilcox
@ 2008-05-07 19:44                                 ` Linus Torvalds
  2008-05-07 20:00                                   ` Oi. NFS people. Read this Matthew Wilcox
  0 siblings, 1 reply; 140+ messages in thread
From: Linus Torvalds @ 2008-05-07 19:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ingo Molnar, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, linux-fsdevel



On Wed, 7 May 2008, Matthew Wilcox wrote:
> 
> One patch I'd still like Yanmin to test is my one from yesterday which
> removes the BKL from fs/locks.c.

And I'd personally rather have the network-fs people test and comment on 
that one ;)

I think that patch is worth looking at regardless, but the problems with 
that one aren't about performance, but about what the implications are for 
the filesystems (if any)...

		Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Oi.  NFS people.  Read this.
  2008-05-07 19:44                                 ` Linus Torvalds
@ 2008-05-07 20:00                                   ` Matthew Wilcox
  2008-05-07 22:10                                     ` Trond Myklebust
  0 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-07 20:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML,
	Alexander Viro, linux-fsdevel

On Wed, May 07, 2008 at 12:44:48PM -0700, Linus Torvalds wrote:
> On Wed, 7 May 2008, Matthew Wilcox wrote:
> > 
> > One patch I'd still like Yanmin to test is my one from yesterday which
> > removes the BKL from fs/locks.c.
> 
> And I'd personally rather have the network-fs people test and comment on 
> that one ;)
> 
> I think that patch is worth looking at regardless, but the problems with 
> that one aren't about performance, but about what the implications are for 
> the filesystems (if any)...

Oh, well, they don't seem interested.

I can comment on some of the problems though.

fs/lockd/svcsubs.c, fs/nfs/delegation.c, fs/nfs/nfs4state.c,
fs/nfsd/nfs4state.c all walk the i_flock list under the BKL.  That won't
protect them against locks.c any more.  That's probably OK for fs/nfs/*
since they'll be protected by their own data structures (Someone please
check me on that?), but it's a bad idea for lockd/nfsd which are walking
the lists for filesystems.

Are we going to have to export the file_lock_lock?  I'd rather not.  But
we need to keep nfsd/lockd from tripping over locks.c.

Maybe we could come up with a decent API that lockd could use?  It all
seems a bit complex at the moment ... maybe lockd should be keeping
track of the locks it owns anyway (since surely the posix deadlock
detection code can't work properly if it's just passing all the locks
through).

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: Oi.  NFS people.  Read this.
  2008-05-07 20:00                                   ` Oi. NFS people. Read this Matthew Wilcox
@ 2008-05-07 22:10                                     ` Trond Myklebust
  2008-05-09  1:43                                       ` J. Bruce Fields
  0 siblings, 1 reply; 140+ messages in thread
From: Trond Myklebust @ 2008-05-07 22:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, J. Bruce Fields,
	Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel

On Wed, 2008-05-07 at 14:00 -0600, Matthew Wilcox wrote:
> On Wed, May 07, 2008 at 12:44:48PM -0700, Linus Torvalds wrote:
> > On Wed, 7 May 2008, Matthew Wilcox wrote:
> > > 
> > > One patch I'd still like Yanmin to test is my one from yesterday which
> > > removes the BKL from fs/locks.c.
> > 
> > And I'd personally rather have the network-fs people test and comment on 
> > that one ;)
> > 
> > I think that patch is worth looking at regardless, but the problems with 
> > that one aren't about performance, but about what the implications are for 
> > the filesystems (if any)...
> 
> Oh, well, they don't seem interested.

Poor timing: we're all preparing for and travelling to the annual
Connectathon interoperability testing conference which starts tomorrow.

> I can comment on some of the problems though.
> 
> fs/lockd/svcsubs.c, fs/nfs/delegation.c, fs/nfs/nfs4state.c,
> fs/nfsd/nfs4state.c all walk the i_flock list under the BKL.  That won't
> protect them against locks.c any more.  That's probably OK for fs/nfs/*
> since they'll be protected by their own data structures (Someone please
> check me on that?), but it's a bad idea for lockd/nfsd which are walking
> the lists for filesystems.

Yes. fs/nfs is just reusing the code in fs/locks.c in order to track the
locks it holds on the server. We could alternatively have coded a
private lock implementation, but this seemed easier.

> Are we going to have to export the file_lock_lock?  I'd rather not.  But
> we need to keep nfsd/lockd from tripping over locks.c.
> 
> Maybe we could come up with a decent API that lockd could use?  It all
> seems a bit complex at the moment ... maybe lockd should be keeping
> track of the locks it owns anyway (since surely the posix deadlock
> detection code can't work properly if it's just passing all the locks
> through).

I'm not sure what you mean when you talk about lockd keeping track of
the locks it owns. It has to keep those locks on inode->i_flock in order
to make them visible to the host filesystem...

All lockd really needs, is the ability to find a lock it owns, and then
obtain a copy. As for the nfs client, I suspect we can make do with
something similar...

Cheers
  Trond


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-07 15:19               ` Linus Torvalds
  2008-05-07 17:14                 ` Ingo Molnar
@ 2008-05-08  2:44                 ` Zhang, Yanmin
  2008-05-08  3:29                   ` Linus Torvalds
  2008-05-08  6:43                   ` AIM7 40% regression with 2.6.26-rc1 Ingo Molnar
  1 sibling, 2 replies; 140+ messages in thread
From: Zhang, Yanmin @ 2008-05-08  2:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Matthew Wilcox, Ingo Molnar, LKML, Alexander Viro,
	Andrew Morton


On Wed, 2008-05-07 at 08:19 -0700, Linus Torvalds wrote:
> 
> On Wed, 7 May 2008, Linus Torvalds wrote:
> > 
> > But my preferred option would indeed be just turning it back into a 
> > spinlock - and screw latency and BKL preemption - and having the RT people 
> > who care deeply just work on removing the BKL in the long run.
> 
> Here's a trial balloon patch to do that.
> 
> Yanmin - this is not well tested, but the code is fairly obvious, and it 
> would be interesting to hear if this fixes the performance regression. 
> Because if it doesn't, then it's not the BKL, or something totally 
> different is going on.
Congratulations! The patch really fixes the regression completely!
vmstat showed cpu idle is 0%, just like 2.6.25's.

Some config options in my .config file:

CONFIG_NR_CPUS=32
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y

yanmin

> 
> Now, we should probably also test just converting the thing to a mutex, 
> to see if that perhaps also fixes it.
> 
> 			Linus
> 
> ---
>  arch/mn10300/Kconfig    |   11 ----
>  include/linux/hardirq.h |   18 ++++---
>  kernel/sched.c          |   27 ++---------
>  lib/kernel_lock.c       |  120 +++++++++++++++++++++++++++++++---------------
>  4 files changed, 95 insertions(+), 81 deletions(-)
> 
> diff --git a/arch/mn10300/Kconfig b/arch/mn10300/Kconfig
> index 6a6409a..e856218 100644
> --- a/arch/mn10300/Kconfig
> +++ b/arch/mn10300/Kconfig
> @@ -186,17 +186,6 @@ config PREEMPT
>  	  Say Y here if you are building a kernel for a desktop, embedded
>  	  or real-time system.  Say N if you are unsure.
>  
> -config PREEMPT_BKL
> -	bool "Preempt The Big Kernel Lock"
> -	depends on PREEMPT
> -	default y
> -	help
> -	  This option reduces the latency of the kernel by making the
> -	  big kernel lock preemptible.
> -
> -	  Say Y here if you are building a kernel for a desktop system.
> -	  Say N if you are unsure.
> -
>  config MN10300_CURRENT_IN_E2
>  	bool "Hold current task address in E2 register"
>  	default y
> diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> index 897f723..181006c 100644
> --- a/include/linux/hardirq.h
> +++ b/include/linux/hardirq.h
> @@ -72,6 +72,14 @@
>  #define in_softirq()		(softirq_count())
>  #define in_interrupt()		(irq_count())
>  
> +#if defined(CONFIG_PREEMPT)
> +# define PREEMPT_INATOMIC_BASE kernel_locked()
> +# define PREEMPT_CHECK_OFFSET 1
> +#else
> +# define PREEMPT_INATOMIC_BASE 0
> +# define PREEMPT_CHECK_OFFSET 0
> +#endif
> +
>  /*
>   * Are we running in atomic context?  WARNING: this macro cannot
>   * always detect atomic context; in particular, it cannot know about
> @@ -79,17 +87,11 @@
>   * used in the general case to determine whether sleeping is possible.
>   * Do not use in_atomic() in driver code.
>   */
> -#define in_atomic()		((preempt_count() & ~PREEMPT_ACTIVE) != 0)
> -
> -#ifdef CONFIG_PREEMPT
> -# define PREEMPT_CHECK_OFFSET 1
> -#else
> -# define PREEMPT_CHECK_OFFSET 0
> -#endif
> +#define in_atomic()	((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_INATOMIC_BASE)
>  
>  /*
>   * Check whether we were atomic before we did preempt_disable():
> - * (used by the scheduler)
> + * (used by the scheduler, *after* releasing the kernel lock)
>   */
>  #define in_atomic_preempt_off() \
>  		((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_CHECK_OFFSET)
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 58fb8af..c51b656 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4567,8 +4567,6 @@ EXPORT_SYMBOL(schedule);
>  asmlinkage void __sched preempt_schedule(void)
>  {
>  	struct thread_info *ti = current_thread_info();
> -	struct task_struct *task = current;
> -	int saved_lock_depth;
>  
>  	/*
>  	 * If there is a non-zero preempt_count or interrupts are disabled,
> @@ -4579,16 +4577,7 @@ asmlinkage void __sched preempt_schedule(void)
>  
>  	do {
>  		add_preempt_count(PREEMPT_ACTIVE);
> -
> -		/*
> -		 * We keep the big kernel semaphore locked, but we
> -		 * clear ->lock_depth so that schedule() doesnt
> -		 * auto-release the semaphore:
> -		 */
> -		saved_lock_depth = task->lock_depth;
> -		task->lock_depth = -1;
>  		schedule();
> -		task->lock_depth = saved_lock_depth;
>  		sub_preempt_count(PREEMPT_ACTIVE);
>  
>  		/*
> @@ -4609,26 +4598,15 @@ EXPORT_SYMBOL(preempt_schedule);
>  asmlinkage void __sched preempt_schedule_irq(void)
>  {
>  	struct thread_info *ti = current_thread_info();
> -	struct task_struct *task = current;
> -	int saved_lock_depth;
>  
>  	/* Catch callers which need to be fixed */
>  	BUG_ON(ti->preempt_count || !irqs_disabled());
>  
>  	do {
>  		add_preempt_count(PREEMPT_ACTIVE);
> -
> -		/*
> -		 * We keep the big kernel semaphore locked, but we
> -		 * clear ->lock_depth so that schedule() doesnt
> -		 * auto-release the semaphore:
> -		 */
> -		saved_lock_depth = task->lock_depth;
> -		task->lock_depth = -1;
>  		local_irq_enable();
>  		schedule();
>  		local_irq_disable();
> -		task->lock_depth = saved_lock_depth;
>  		sub_preempt_count(PREEMPT_ACTIVE);
>  
>  		/*
> @@ -5853,8 +5831,11 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
>  	spin_unlock_irqrestore(&rq->lock, flags);
>  
>  	/* Set the preempt count _outside_ the spinlocks! */
> +#if defined(CONFIG_PREEMPT)
> +	task_thread_info(idle)->preempt_count = (idle->lock_depth >= 0);
> +#else
>  	task_thread_info(idle)->preempt_count = 0;
> -
> +#endif
>  	/*
>  	 * The idle tasks have their own, simple scheduling class:
>  	 */
> diff --git a/lib/kernel_lock.c b/lib/kernel_lock.c
> index cd3e825..06722aa 100644
> --- a/lib/kernel_lock.c
> +++ b/lib/kernel_lock.c
> @@ -11,79 +11,121 @@
>  #include <linux/semaphore.h>
>  
>  /*
> - * The 'big kernel semaphore'
> + * The 'big kernel lock'
>   *
> - * This mutex is taken and released recursively by lock_kernel()
> + * This spinlock is taken and released recursively by lock_kernel()
>   * and unlock_kernel().  It is transparently dropped and reacquired
>   * over schedule().  It is used to protect legacy code that hasn't
>   * been migrated to a proper locking design yet.
>   *
> - * Note: code locked by this semaphore will only be serialized against
> - * other code using the same locking facility. The code guarantees that
> - * the task remains on the same CPU.
> - *
>   * Don't use in new code.
>   */
> -static DECLARE_MUTEX(kernel_sem);
> +static  __cacheline_aligned_in_smp DEFINE_SPINLOCK(kernel_flag);
> +
>  
>  /*
> - * Re-acquire the kernel semaphore.
> + * Acquire/release the underlying lock from the scheduler.
>   *
> - * This function is called with preemption off.
> + * This is called with preemption disabled, and should
> + * return an error value if it cannot get the lock and
> + * TIF_NEED_RESCHED gets set.
>   *
> - * We are executing in schedule() so the code must be extremely careful
> - * about recursion, both due to the down() and due to the enabling of
> - * preemption. schedule() will re-check the preemption flag after
> - * reacquiring the semaphore.
> + * If it successfully gets the lock, it should increment
> + * the preemption count like any spinlock does.
> + *
> + * (This works on UP too - _raw_spin_trylock will never
> + * return false in that case)
>   */
>  int __lockfunc __reacquire_kernel_lock(void)
>  {
> -	struct task_struct *task = current;
> -	int saved_lock_depth = task->lock_depth;
> -
> -	BUG_ON(saved_lock_depth < 0);
> -
> -	task->lock_depth = -1;
> -	preempt_enable_no_resched();
> -
> -	down(&kernel_sem);
> -
> +	while (!_raw_spin_trylock(&kernel_flag)) {
> +		if (test_thread_flag(TIF_NEED_RESCHED))
> +			return -EAGAIN;
> +		cpu_relax();
> +	}
>  	preempt_disable();
> -	task->lock_depth = saved_lock_depth;
> -
>  	return 0;
>  }
>  
>  void __lockfunc __release_kernel_lock(void)
>  {
> -	up(&kernel_sem);
> +	_raw_spin_unlock(&kernel_flag);
> +	preempt_enable_no_resched();
>  }
>  
>  /*
> - * Getting the big kernel semaphore.
> + * These are the BKL spinlocks - we try to be polite about preemption. 
> + * If SMP is not on (ie UP preemption), this all goes away because the
> + * _raw_spin_trylock() will always succeed.
>   */
> -void __lockfunc lock_kernel(void)
> +#ifdef CONFIG_PREEMPT
> +static inline void __lock_kernel(void)
>  {
> -	struct task_struct *task = current;
> -	int depth = task->lock_depth + 1;
> +	preempt_disable();
> +	if (unlikely(!_raw_spin_trylock(&kernel_flag))) {
> +		/*
> +		 * If preemption was disabled even before this
> +		 * was called, there's nothing we can be polite
> +		 * about - just spin.
> +		 */
> +		if (preempt_count() > 1) {
> +			_raw_spin_lock(&kernel_flag);
> +			return;
> +		}
>  
> -	if (likely(!depth))
>  		/*
> -		 * No recursion worries - we set up lock_depth _after_
> +		 * Otherwise, let's wait for the kernel lock
> +		 * with preemption enabled..
>  		 */
> -		down(&kernel_sem);
> +		do {
> +			preempt_enable();
> +			while (spin_is_locked(&kernel_flag))
> +				cpu_relax();
> +			preempt_disable();
> +		} while (!_raw_spin_trylock(&kernel_flag));
> +	}
> +}
>  
> -	task->lock_depth = depth;
> +#else
> +
> +/*
> + * Non-preemption case - just get the spinlock
> + */
> +static inline void __lock_kernel(void)
> +{
> +	_raw_spin_lock(&kernel_flag);
>  }
> +#endif
>  
> -void __lockfunc unlock_kernel(void)
> +static inline void __unlock_kernel(void)
>  {
> -	struct task_struct *task = current;
> +	/*
> +	 * the BKL is not covered by lockdep, so we open-code the
> +	 * unlocking sequence (and thus avoid the dep-chain ops):
> +	 */
> +	_raw_spin_unlock(&kernel_flag);
> +	preempt_enable();
> +}
>  
> -	BUG_ON(task->lock_depth < 0);
> +/*
> + * Getting the big kernel lock.
> + *
> + * This cannot happen asynchronously, so we only need to
> + * worry about other CPU's.
> + */
> +void __lockfunc lock_kernel(void)
> +{
> +	int depth = current->lock_depth+1;
> +	if (likely(!depth))
> +		__lock_kernel();
> +	current->lock_depth = depth;
> +}
>  
> -	if (likely(--task->lock_depth < 0))
> -		up(&kernel_sem);
> +void __lockfunc unlock_kernel(void)
> +{
> +	BUG_ON(current->lock_depth < 0);
> +	if (likely(--current->lock_depth < 0))
> +		__unlock_kernel();
>  }
>  
>  EXPORT_SYMBOL(lock_kernel);


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 16:23     ` Matthew Wilcox
  2008-05-06 16:36       ` Linus Torvalds
  2008-05-06 17:21       ` Andrew Morton
@ 2008-05-08  3:24       ` Zhang, Yanmin
  2008-05-08  3:34         ` Linus Torvalds
  2 siblings, 1 reply; 140+ messages in thread
From: Zhang, Yanmin @ 2008-05-08  3:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ingo Molnar, J. Bruce Fields, LKML, Alexander Viro,
	Linus Torvalds, Andrew Morton, linux-fsdevel


On Tue, 2008-05-06 at 10:23 -0600, Matthew Wilcox wrote:
> On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> > So the only likely things I can see are:
> > 
> >  - file locks
> >  - fasync
> 
> I've wanted to fix file locks for a while.  Here's a first attempt.
> It was done quickly, so I concede that it may well have bugs in it.
> I found (and fixed) one with LTP.
> 
> It takes *no account* of nfsd, nor remote filesystems.  We need to have
> a serious discussion about their requirements.
I tested it on 8-core stoakley. aim7 result becomes 23% worse than the one of
pure 2.6.26-rc1.

I replied this email in case you have many patches and I might test what you don't
expect.

-yanmin

> 
> diff --git a/fs/locks.c b/fs/locks.c
> index 663c069..cb09765 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -140,6 +140,8 @@ int lease_break_time = 45;
>  #define for_each_lock(inode, lockp) \
>  	for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)
>  
> +static DEFINE_SPINLOCK(file_lock_lock);
> +
>  static LIST_HEAD(file_lock_list);
>  static LIST_HEAD(blocked_list);
>  
> @@ -510,9 +512,9 @@ static void __locks_delete_block(struct file_lock *waiter)
>   */
>  static void locks_delete_block(struct file_lock *waiter)
>  {
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	__locks_delete_block(waiter);
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  }
>  
>  /* Insert waiter into blocker's block list.
> @@ -649,7 +651,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
>  {
>  	struct file_lock *cfl;
>  
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	for (cfl = filp->f_path.dentry->d_inode->i_flock; cfl; cfl = cfl->fl_next) {
>  		if (!IS_POSIX(cfl))
>  			continue;
> @@ -662,7 +664,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
>  			fl->fl_pid = pid_vnr(cfl->fl_nspid);
>  	} else
>  		fl->fl_type = F_UNLCK;
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  	return;
>  }
>  EXPORT_SYMBOL(posix_test_lock);
> @@ -735,18 +737,21 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
>  	int error = 0;
>  	int found = 0;
>  
> -	lock_kernel();
> -	if (request->fl_flags & FL_ACCESS)
> +	if (request->fl_flags & FL_ACCESS) {
> +		spin_lock(&file_lock_lock);
>  		goto find_conflict;
> +	}
>  
>  	if (request->fl_type != F_UNLCK) {
>  		error = -ENOMEM;
> +		
>  		new_fl = locks_alloc_lock();
>  		if (new_fl == NULL)
> -			goto out;
> +			goto out_unlocked;
>  		error = 0;
>  	}
>  
> +	spin_lock(&file_lock_lock);
>  	for_each_lock(inode, before) {
>  		struct file_lock *fl = *before;
>  		if (IS_POSIX(fl))
> @@ -772,10 +777,13 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
>  	 * If a higher-priority process was blocked on the old file lock,
>  	 * give it the opportunity to lock the file.
>  	 */
> -	if (found)
> +	if (found) {
> +		spin_unlock(&file_lock_lock);
>  		cond_resched();
> +		spin_lock(&file_lock_lock);
> +	}
>  
> -find_conflict:
> + find_conflict:
>  	for_each_lock(inode, before) {
>  		struct file_lock *fl = *before;
>  		if (IS_POSIX(fl))
> @@ -796,8 +804,9 @@ find_conflict:
>  	new_fl = NULL;
>  	error = 0;
>  
> -out:
> -	unlock_kernel();
> + out:
> +	spin_unlock(&file_lock_lock);
> + out_unlocked:
>  	if (new_fl)
>  		locks_free_lock(new_fl);
>  	return error;
> @@ -826,7 +835,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
>  		new_fl2 = locks_alloc_lock();
>  	}
>  
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	if (request->fl_type != F_UNLCK) {
>  		for_each_lock(inode, before) {
>  			fl = *before;
> @@ -994,7 +1003,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
>  		locks_wake_up_blocks(left);
>  	}
>   out:
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  	/*
>  	 * Free any unused locks.
>  	 */
> @@ -1069,14 +1078,14 @@ int locks_mandatory_locked(struct inode *inode)
>  	/*
>  	 * Search the lock list for this inode for any POSIX locks.
>  	 */
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
>  		if (!IS_POSIX(fl))
>  			continue;
>  		if (fl->fl_owner != owner)
>  			break;
>  	}
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  	return fl ? -EAGAIN : 0;
>  }
>  
> @@ -1190,7 +1199,7 @@ int __break_lease(struct inode *inode, unsigned int mode)
>  
>  	new_fl = lease_alloc(NULL, mode & FMODE_WRITE ? F_WRLCK : F_RDLCK);
>  
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  
>  	time_out_leases(inode);
>  
> @@ -1251,8 +1260,10 @@ restart:
>  			break_time++;
>  	}
>  	locks_insert_block(flock, new_fl);
> +	spin_unlock(&file_lock_lock);
>  	error = wait_event_interruptible_timeout(new_fl->fl_wait,
>  						!new_fl->fl_next, break_time);
> +	spin_lock(&file_lock_lock);
>  	__locks_delete_block(new_fl);
>  	if (error >= 0) {
>  		if (error == 0)
> @@ -1266,8 +1277,8 @@ restart:
>  		error = 0;
>  	}
>  
> -out:
> -	unlock_kernel();
> + out:
> +	spin_unlock(&file_lock_lock);
>  	if (!IS_ERR(new_fl))
>  		locks_free_lock(new_fl);
>  	return error;
> @@ -1323,7 +1334,7 @@ int fcntl_getlease(struct file *filp)
>  	struct file_lock *fl;
>  	int type = F_UNLCK;
>  
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	time_out_leases(filp->f_path.dentry->d_inode);
>  	for (fl = filp->f_path.dentry->d_inode->i_flock; fl && IS_LEASE(fl);
>  			fl = fl->fl_next) {
> @@ -1332,7 +1343,7 @@ int fcntl_getlease(struct file *filp)
>  			break;
>  		}
>  	}
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  	return type;
>  }
>  
> @@ -1363,6 +1374,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
>  	if (error)
>  		return error;
>  
> +	spin_lock(&file_lock_lock);
>  	time_out_leases(inode);
>  
>  	BUG_ON(!(*flp)->fl_lmops->fl_break);
> @@ -1370,10 +1382,11 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
>  	lease = *flp;
>  
>  	if (arg != F_UNLCK) {
> +		spin_unlock(&file_lock_lock);
>  		error = -ENOMEM;
>  		new_fl = locks_alloc_lock();
>  		if (new_fl == NULL)
> -			goto out;
> +			goto out_unlocked;
>  
>  		error = -EAGAIN;
>  		if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
> @@ -1382,6 +1395,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
>  		    && ((atomic_read(&dentry->d_count) > 1)
>  			|| (atomic_read(&inode->i_count) > 1)))
>  			goto out;
> +		spin_lock(&file_lock_lock);
>  	}
>  
>  	/*
> @@ -1429,11 +1443,14 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
>  
>  	locks_copy_lock(new_fl, lease);
>  	locks_insert_lock(before, new_fl);
> +	spin_unlock(&file_lock_lock);
>  
>  	*flp = new_fl;
>  	return 0;
>  
> -out:
> + out:
> +	spin_unlock(&file_lock_lock);
> + out_unlocked:
>  	if (new_fl != NULL)
>  		locks_free_lock(new_fl);
>  	return error;
> @@ -1471,12 +1488,10 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
>  {
>  	int error;
>  
> -	lock_kernel();
>  	if (filp->f_op && filp->f_op->setlease)
>  		error = filp->f_op->setlease(filp, arg, lease);
>  	else
>  		error = generic_setlease(filp, arg, lease);
> -	unlock_kernel();
>  
>  	return error;
>  }
> @@ -1503,12 +1518,11 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
>  	if (error)
>  		return error;
>  
> -	lock_kernel();
> -
>  	error = vfs_setlease(filp, arg, &flp);
>  	if (error || arg == F_UNLCK)
> -		goto out_unlock;
> +		return error;
>  
> +	lock_kernel();
>  	error = fasync_helper(fd, filp, 1, &flp->fl_fasync);
>  	if (error < 0) {
>  		/* remove lease just inserted by setlease */
> @@ -1519,7 +1533,7 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
>  	}
>  
>  	error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
> -out_unlock:
> + out_unlock:
>  	unlock_kernel();
>  	return error;
>  }
> @@ -2024,7 +2038,7 @@ void locks_remove_flock(struct file *filp)
>  			fl.fl_ops->fl_release_private(&fl);
>  	}
>  
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	before = &inode->i_flock;
>  
>  	while ((fl = *before) != NULL) {
> @@ -2042,7 +2056,7 @@ void locks_remove_flock(struct file *filp)
>   		}
>  		before = &fl->fl_next;
>  	}
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  }
>  
>  /**
> @@ -2057,12 +2071,12 @@ posix_unblock_lock(struct file *filp, struct file_lock *waiter)
>  {
>  	int status = 0;
>  
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	if (waiter->fl_next)
>  		__locks_delete_block(waiter);
>  	else
>  		status = -ENOENT;
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  	return status;
>  }
>  
> @@ -2175,7 +2189,7 @@ static int locks_show(struct seq_file *f, void *v)
>  
>  static void *locks_start(struct seq_file *f, loff_t *pos)
>  {
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	f->private = (void *)1;
>  	return seq_list_start(&file_lock_list, *pos);
>  }
> @@ -2187,7 +2201,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
>  
>  static void locks_stop(struct seq_file *f, void *v)
>  {
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  }
>  
>  struct seq_operations locks_seq_operations = {
> @@ -2215,7 +2229,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
>  {
>  	struct file_lock *fl;
>  	int result = 1;
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
>  		if (IS_POSIX(fl)) {
>  			if (fl->fl_type == F_RDLCK)
> @@ -2232,7 +2246,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
>  		result = 0;
>  		break;
>  	}
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  	return result;
>  }
>  
> @@ -2255,7 +2269,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
>  {
>  	struct file_lock *fl;
>  	int result = 1;
> -	lock_kernel();
> +	spin_lock(&file_lock_lock);
>  	for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
>  		if (IS_POSIX(fl)) {
>  			if ((fl->fl_end < start) || (fl->fl_start > (start + len)))
> @@ -2270,7 +2284,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
>  		result = 0;
>  		break;
>  	}
> -	unlock_kernel();
> +	spin_unlock(&file_lock_lock);
>  	return result;
>  }
>  
> 


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-08  2:44                 ` Zhang, Yanmin
@ 2008-05-08  3:29                   ` Linus Torvalds
  2008-05-08  4:08                     ` Zhang, Yanmin
  2008-05-08  6:43                   ` AIM7 40% regression with 2.6.26-rc1 Ingo Molnar
  1 sibling, 1 reply; 140+ messages in thread
From: Linus Torvalds @ 2008-05-08  3:29 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Andi Kleen, Matthew Wilcox, Ingo Molnar, LKML, Alexander Viro,
	Andrew Morton



On Thu, 8 May 2008, Zhang, Yanmin wrote:
>
> Congratulations! The patch really fixes the regression completely!
> vmstat showed cpu idle is 0%, just like 2.6.25's.

Well, that shows that it was the BKL.

That said, "idle 0%" is easy when you spin. Do you also have actual 
performance numbers? I'd hope that not only do we use full CPU time, it's 
also at least as fast as the old semaphores were?

While I've been dissing sleeping locks (because their overhead is so 
high), at least in _theory_ they can get better behavior when not 
spinning. Now, that's not going to happen with the BKL, I'm 99.99% sure, 
but I'd still like to hear actual performance numbers too, just to be 
sure.

Anyway, at least the "was it the BKL or some other semaphore user" 
question is entirely off the table.

So we need to

 - fix the BKL. My patch may be a good starting point, but there are 
   alternatives:

    (a) reinstate the old BKL code entirely

	Quite frankly, I'd prefer not to. Not only did it have three 
	totally different cases, some of them were apparently broken (ie 
	BKL+regular preempt didn't cond_resched() right), and I just don't 
	think it's worth maintaining three different versions, when 
	distro's are going to pick one anyway. *We* should pick one, and 
	maintain it.

    (b) screw the special BKL preemption - it's a spinlock, we don't 
	preempt other spinlocks, but at least fix BKL+preempt+cond_resched
	thing. 

	This would be "my patch + fixes" where at least one of the fixes 
	is the known (apparently old) cond_preempt() bug.

    (c) Try to keep the 2.6.25 code as closely as possible, but just 
	switch over to mutexes instead.

	I dunno. I was never all that enamoured with the BKL as a sleeping 
	lock, so I'm biased against this one, but hey, it's just a 
	personal bias.

 - get rid of the BKL anyway, at least in anything that is noticeable.

	Matthew's patch to file locking is probably worth doing as-is, 
	simply because I haven't heard any better solutions. The BKL 
	certainly can't be it, and whatever comes out of the NFSD 
	discussion will almost certainly involve just making sure that 
	those leases just use the new fs/locks.c lock.

	This is also why I'd actually prefer the simplest possible 
	(non-preempting) spinlock BKL. Because it means that we can get 
	rid of all that "saved_lock_depth" crud (like my patch already 
	did). We shouldn't aim for a clever BKL, we should aim for a BKL 
	that nobody uses.

I'm certainly open to anything. Regardless, we should decide fairly soon, 
so that we have the choice made before -rc2 is out, and not drag this out, 
since regardless of the choice it needs to be tested and people comfy with 
it for the 2.6.26 release.

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-08  3:24       ` AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin
@ 2008-05-08  3:34         ` Linus Torvalds
  2008-05-08  4:37           ` Zhang, Yanmin
  0 siblings, 1 reply; 140+ messages in thread
From: Linus Torvalds @ 2008-05-08  3:34 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, LKML,
	Alexander Viro, Andrew Morton, linux-fsdevel



On Thu, 8 May 2008, Zhang, Yanmin wrote:
> 
> On Tue, 2008-05-06 at 10:23 -0600, Matthew Wilcox wrote:
> > On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> > > So the only likely things I can see are:
> > > 
> > >  - file locks
> > >  - fasync
> > 
> > I've wanted to fix file locks for a while.  Here's a first attempt.
> > It was done quickly, so I concede that it may well have bugs in it.
> > I found (and fixed) one with LTP.
> > 
> > It takes *no account* of nfsd, nor remote filesystems.  We need to have
> > a serious discussion about their requirements.
>
> I tested it on 8-core stoakley. aim7 result becomes 23% worse than the one of
> pure 2.6.26-rc1.

Ouch. That's really odd. The BKL->spinlock conversion looks really 
obvious, so it shouldn't be that noticeably slower.

The *one* difference is that the BKL has the whole "you can take it 
recursively and you can sleep without dropping it because the scheduler 
will drop it for you" thing. The spinlock conversion changed all of that 
into explicit "drop and retake" locks, and maybe that causes some issues. 

But 23% worse? That sounds really odd/extreme.

Can you do a oprofile run or something?

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-08  3:29                   ` Linus Torvalds
@ 2008-05-08  4:08                     ` Zhang, Yanmin
  2008-05-08  4:17                       ` Linus Torvalds
  0 siblings, 1 reply; 140+ messages in thread
From: Zhang, Yanmin @ 2008-05-08  4:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Matthew Wilcox, Ingo Molnar, LKML, Alexander Viro,
	Andrew Morton


On Wed, 2008-05-07 at 20:29 -0700, Linus Torvalds wrote:
> 
> On Thu, 8 May 2008, Zhang, Yanmin wrote:
> >
> > Congratulations! The patch really fixes the regression completely!
> > vmstat showed cpu idle is 0%, just like 2.6.25's.
> 
> Well, that shows that it was the BKL.
> 
> That said, "idle 0%" is easy when you spin. Do you also have actual 
> performance numbers?
Yes. My conclusion is based on the actual number. cpu idle 0% is just
a behavior it should be.

>  I'd hope that not only do we use full CPU time, it's 
> also at least as fast as the old semaphores were?
Yes.

> 
> While I've been dissing sleeping locks (because their overhead is so 
> high), at least in _theory_ they can get better behavior when not 
> spinning. Now, that's not going to happen with the BKL, I'm 99.99% sure, 
> but I'd still like to hear actual performance numbers too, just to be 
> sure.
For sure.

> 
> Anyway, at least the "was it the BKL or some other semaphore user" 
> question is entirely off the table.
> 
> So we need to
> 
>  - fix the BKL. My patch may be a good starting point, but there are 
>    alternatives:
> 
>     (a) reinstate the old BKL code entirely
> 
> 	Quite frankly, I'd prefer not to. Not only did it have three 
> 	totally different cases, some of them were apparently broken (ie 
> 	BKL+regular preempt didn't cond_resched() right), and I just don't 
> 	think it's worth maintaining three different versions, when 
> 	distro's are going to pick one anyway. *We* should pick one, and 
> 	maintain it.
> 
>     (b) screw the special BKL preemption - it's a spinlock, we don't 
> 	preempt other spinlocks, but at least fix BKL+preempt+cond_resched
> 	thing. 
> 
> 	This would be "my patch + fixes" where at least one of the fixes 
> 	is the known (apparently old) cond_preempt() bug.
> 
>     (c) Try to keep the 2.6.25 code as closely as possible, but just 
> 	switch over to mutexes instead.
> 
> 	I dunno. I was never all that enamoured with the BKL as a sleeping 
> 	lock, so I'm biased against this one, but hey, it's just a 
> 	personal bias.
> 
>  - get rid of the BKL anyway, at least in anything that is noticeable.
> 
> 	Matthew's patch to file locking is probably worth doing as-is, 
> 	simply because I haven't heard any better solutions. The BKL 
> 	certainly can't be it, and whatever comes out of the NFSD 
> 	discussion will almost certainly involve just making sure that 
> 	those leases just use the new fs/locks.c lock.
> 
> 	This is also why I'd actually prefer the simplest possible 
> 	(non-preempting) spinlock BKL. Because it means that we can get 
> 	rid of all that "saved_lock_depth" crud (like my patch already 
> 	did). We shouldn't aim for a clever BKL, we should aim for a BKL 
> 	that nobody uses.
> 
> I'm certainly open to anything. Regardless, we should decide fairly soon, 
> so that we have the choice made before -rc2 is out, and not drag this out, 
> since regardless of the choice it needs to be tested and people comfy with 
> it for the 2.6.26 release.
> 
> 			Linus


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-08  4:08                     ` Zhang, Yanmin
@ 2008-05-08  4:17                       ` Linus Torvalds
  2008-05-08 12:01                         ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Linus Torvalds @ 2008-05-08  4:17 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Andi Kleen, Matthew Wilcox, Ingo Molnar, LKML, Alexander Viro,
	Andrew Morton



On Thu, 8 May 2008, Zhang, Yanmin wrote:
> 
> On Wed, 2008-05-07 at 20:29 -0700, Linus Torvalds wrote:
> > 
> > That said, "idle 0%" is easy when you spin. Do you also have actual 
> > performance numbers?
>
> Yes. My conclusion is based on the actual number. cpu idle 0% is just
> a behavior it should be.

Thanks, that's all I wanted to verify.

I'll leave this overnight, and see if somebody has come up with some smart 
and wonderful patch. And if not, I think I'll apply mine as "known to fix 
a regression", and we can perhaps then improve on things further from 
there.

		Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-08  3:34         ` Linus Torvalds
@ 2008-05-08  4:37           ` Zhang, Yanmin
  2008-05-08 14:58             ` Linus Torvalds
  0 siblings, 1 reply; 140+ messages in thread
From: Zhang, Yanmin @ 2008-05-08  4:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, LKML,
	Alexander Viro, Andrew Morton, linux-fsdevel


On Wed, 2008-05-07 at 20:34 -0700, Linus Torvalds wrote:
> 
> On Thu, 8 May 2008, Zhang, Yanmin wrote:
> > 
> > On Tue, 2008-05-06 at 10:23 -0600, Matthew Wilcox wrote:
> > > On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> > > > So the only likely things I can see are:
> > > > 
> > > >  - file locks
> > > >  - fasync
> > > 
> > > I've wanted to fix file locks for a while.  Here's a first attempt.
> > > It was done quickly, so I concede that it may well have bugs in it.
> > > I found (and fixed) one with LTP.
> > > 
> > > It takes *no account* of nfsd, nor remote filesystems.  We need to have
> > > a serious discussion about their requirements.
> >
> > I tested it on 8-core stoakley. aim7 result becomes 23% worse than the one of
> > pure 2.6.26-rc1.
> 
> Ouch. That's really odd. The BKL->spinlock conversion looks really 
> obvious, so it shouldn't be that noticeably slower.
> 
> The *one* difference is that the BKL has the whole "you can take it 
> recursively and you can sleep without dropping it because the scheduler 
> will drop it for you" thing. The spinlock conversion changed all of that 
> into explicit "drop and retake" locks, and maybe that causes some issues. 
> 
> But 23% worse? That sounds really odd/extreme.
> 
> Can you do a oprofile run or something?
I collected oprofile data. It looks not useful, as cpu idle is more than 50%.

samples  %        app name                 symbol name
270157    9.4450  multitask                add_long
266419    9.3143  multitask                add_int
238934    8.3534  multitask                add_double
187184    6.5442  multitask                mul_double
159448    5.5745  multitask                add_float
156312    5.4649  multitask                sieve
148081    5.1771  multitask                mul_float
127192    4.4468  multitask                add_short
80480     2.8137  multitask                string_rtns_1
57520     2.0110  vmlinux                  clear_page_c
53935     1.8856  multitask                div_long
48753     1.7045  libc-2.6.90.so           strncat
40825     1.4273  multitask                array_rtns
32807     1.1470  vmlinux                  __copy_user_nocache
31995     1.1186  multitask                div_int
31143     1.0888  multitask                div_float
28821     1.0076  multitask                div_double
26400     0.9230  vmlinux                  find_lock_page
26159     0.9146  vmlinux                  unmap_vmas
25249     0.8827  multitask                div_short
21509     0.7520  vmlinux                  native_read_tsc
18865     0.6595  vmlinux                  copy_user_generic_string
17993     0.6291  vmlinux                  copy_page_c
16367     0.5722  vmlinux                  system_call
14616     0.5110  libc-2.6.90.so           msort_with_tmp
13630     0.4765  vmlinux                  native_sched_clock
12952     0.4528  vmlinux                  copy_page_range
12817     0.4481  libc-2.6.90.so           strcat
12708     0.4443  vmlinux                  calc_delta_mine
12611     0.4409  libc-2.6.90.so           memset
11631     0.4066  bash                     (no symbols)
9991      0.3493  vmlinux                  update_curr
9328      0.3261  vmlinux                  unlock_page



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-08  2:44                 ` Zhang, Yanmin
  2008-05-08  3:29                   ` Linus Torvalds
@ 2008-05-08  6:43                   ` Ingo Molnar
  2008-05-08  6:48                     ` Andrew Morton
  2008-05-08  7:14                     ` Zhang, Yanmin
  1 sibling, 2 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-08  6:43 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton


* Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:

> > Here's a trial balloon patch to do that.
> > 
> > Yanmin - this is not well tested, but the code is fairly obvious, 
> > and it would be interesting to hear if this fixes the performance 
> > regression. Because if it doesn't, then it's not the BKL, or 
> > something totally different is going on.
>
> Congratulations! The patch really fixes the regression completely! 
> vmstat showed cpu idle is 0%, just like 2.6.25's.

great! Yanmin, could you please also check the other patch i sent (also 
attached below), does it solve the regression similarly?

	Ingo

---
 lib/kernel_lock.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Index: linux/lib/kernel_lock.c
===================================================================
--- linux.orig/lib/kernel_lock.c
+++ linux/lib/kernel_lock.c
@@ -46,7 +46,8 @@ int __lockfunc __reacquire_kernel_lock(v
 	task->lock_depth = -1;
 	preempt_enable_no_resched();
 
-	down(&kernel_sem);
+	while (down_trylock(&kernel_sem))
+		cpu_relax();
 
 	preempt_disable();
 	task->lock_depth = saved_lock_depth;
@@ -67,11 +68,13 @@ void __lockfunc lock_kernel(void)
 	struct task_struct *task = current;
 	int depth = task->lock_depth + 1;
 
-	if (likely(!depth))
+	if (likely(!depth)) {
 		/*
 		 * No recursion worries - we set up lock_depth _after_
 		 */
-		down(&kernel_sem);
+		while (down_trylock(&kernel_sem))
+			cpu_relax();
+	}
 
 	task->lock_depth = depth;
 }

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-08  6:43                   ` AIM7 40% regression with 2.6.26-rc1 Ingo Molnar
@ 2008-05-08  6:48                     ` Andrew Morton
  2008-05-08  7:14                     ` Zhang, Yanmin
  1 sibling, 0 replies; 140+ messages in thread
From: Andrew Morton @ 2008-05-08  6:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zhang, Yanmin, Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML,
	Alexander Viro

On Thu, 8 May 2008 08:43:40 +0200 Ingo Molnar <mingo@elte.hu> wrote:

> great! Yanmin, could you please also check the other patch i sent (also 
> attached below), does it solve the regression similarly?
> 
> 	Ingo
> 
> ---
>  lib/kernel_lock.c |    9 ++++++---

but but but.  Some other users of down() have presumably also regressed.  We just
have't found the workload to demonstrate that yet.


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-08  6:43                   ` AIM7 40% regression with 2.6.26-rc1 Ingo Molnar
  2008-05-08  6:48                     ` Andrew Morton
@ 2008-05-08  7:14                     ` Zhang, Yanmin
  2008-05-08  7:39                       ` Ingo Molnar
  1 sibling, 1 reply; 140+ messages in thread
From: Zhang, Yanmin @ 2008-05-08  7:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton


On Thu, 2008-05-08 at 08:43 +0200, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> 
> > > Here's a trial balloon patch to do that.
> > > 
> > > Yanmin - this is not well tested, but the code is fairly obvious, 
> > > and it would be interesting to hear if this fixes the performance 
> > > regression. Because if it doesn't, then it's not the BKL, or 
> > > something totally different is going on.
> >
> > Congratulations! The patch really fixes the regression completely! 
> > vmstat showed cpu idle is 0%, just like 2.6.25's.
> 
> great! Yanmin, could you please also check the other patch i sent (also 
> attached below), does it solve the regression similarly?
With your patch, aim7 regression becomes less than 2%. I ran the testing twice.

Linus' patch could recover it completely. As aim7 result is quite stable(usually
fluctuating less than 1%), 1.5%~2% is a little big.

> 
> 	Ingo
> 
> ---
>  lib/kernel_lock.c |    9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> Index: linux/lib/kernel_lock.c
> ===================================================================
> --- linux.orig/lib/kernel_lock.c
> +++ linux/lib/kernel_lock.c
> @@ -46,7 +46,8 @@ int __lockfunc __reacquire_kernel_lock(v
>  	task->lock_depth = -1;
>  	preempt_enable_no_resched();
>  
> -	down(&kernel_sem);
> +	while (down_trylock(&kernel_sem))
> +		cpu_relax();
>  
>  	preempt_disable();
>  	task->lock_depth = saved_lock_depth;
> @@ -67,11 +68,13 @@ void __lockfunc lock_kernel(void)
>  	struct task_struct *task = current;
>  	int depth = task->lock_depth + 1;
>  
> -	if (likely(!depth))
> +	if (likely(!depth)) {
>  		/*
>  		 * No recursion worries - we set up lock_depth _after_
>  		 */
> -		down(&kernel_sem);
> +		while (down_trylock(&kernel_sem))
> +			cpu_relax();
> +	}
>  
>  	task->lock_depth = depth;
>  }


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-08  7:14                     ` Zhang, Yanmin
@ 2008-05-08  7:39                       ` Ingo Molnar
  2008-05-08  8:44                         ` Zhang, Yanmin
  0 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-08  7:39 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton


* Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:

> > great! Yanmin, could you please also check the other patch i sent 
> > (also attached below), does it solve the regression similarly?
>
> With your patch, aim7 regression becomes less than 2%. I ran the 
> testing twice.
> 
> Linus' patch could recover it completely. As aim7 result is quite 
> stable(usually fluctuating less than 1%), 1.5%~2% is a little big.

is this the old original aim7 you are running, or osdl-aim-7 or 
re-aim-7?

if it's aim7 then this is a workload that starts+stops 2000 parallel 
tasks that each start and exit at the same time. That might explain its 
sensitivity on the BKL - this is all about tty-controlled task startup 
and exit.

i could not get it to produce anywhere close to stable results though. I 
also frequently get into this problem:

  AIM Multiuser Benchmark - Suite VII Run Beginning
  Tasks    jobs/min  jti  jobs/min/task      real       cpu
   2000
  Failed to execute
          new_raph 200
  Unable to solve equation in 100 tries. P = 1.5708, P0 = 1.5708, delta = 6.12574e-17

  Failed to execute
          disk_cp /mnt/shm
  disk_cp (1): cannot open /mnt/shm/tmpa.common
  disk1.c: No such file or directory

  [.. etc. a large stream of them .. ]

system has 2GB of RAM and tmpfs mounted to the place where aim7 puts its 
work files.

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-08  7:39                       ` Ingo Molnar
@ 2008-05-08  8:44                         ` Zhang, Yanmin
  2008-05-08  9:21                           ` Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Zhang, Yanmin @ 2008-05-08  8:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2127 bytes --]


On Thu, 2008-05-08 at 09:39 +0200, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> 
> > > great! Yanmin, could you please also check the other patch i sent 
> > > (also attached below), does it solve the regression similarly?
> >
> > With your patch, aim7 regression becomes less than 2%. I ran the 
> > testing twice.
> > 
> > Linus' patch could recover it completely. As aim7 result is quite 
> > stable(usually fluctuating less than 1%), 1.5%~2% is a little big.
> 
> is this the old original aim7 you are running,
I useold AIM7 plus a small patch which is just to change a couple of data type to match
64bit.

>  or osdl-aim-7 or 
> re-aim-7?
> 
> if it's aim7 then this is a workload that starts+stops 2000 parallel 
> tasks that each start and exit at the same time.
Yes.

>  That might explain its 
> sensitivity on the BKL - this is all about tty-controlled task startup 
> and exit.
> 
> i could not get it to produce anywhere close to stable results though. I 
> also frequently get into this problem:
> 
>   AIM Multiuser Benchmark - Suite VII Run Beginning
>   Tasks    jobs/min  jti  jobs/min/task      real       cpu
>    2000
>   Failed to execute
>           new_raph 200
>   Unable to solve equation in 100 tries. P = 1.5708, P0 = 1.5708, delta = 6.12574e-17
> 
>   Failed to execute
>           disk_cp /mnt/shm
>   disk_cp (1): cannot open /mnt/shm/tmpa.common
>   disk1.c: No such file or directory
> 
>   [.. etc. a large stream of them .. ]
> 
> system has 2GB of RAM and tmpfs mounted to the place where aim7 puts its 
> work files.
My machine has 8GB. To simulate your environment, I reserve 6GB for hugetlb, then reran the testing
and didn't see any failure except:
AIM Multiuser Benchmark - Suite VII Run Beginning

Tasks    jobs/min  jti  jobs/min/task      real       cpu
 2000create_shared_memory(): can't create semaphore, pausing...
create_shared_memory(): can't create semaphore, pausing...


Above info doesn't mean errors.

Perhaps you could:
1) Apply the attched aim9 patch;
2) check if you have write right under /mnt/shm;
3) echo "/mnt/shm">aim7_path/config;


[-- Attachment #2: aim9.diff --]
[-- Type: text/x-patch, Size: 2481 bytes --]

diff -urp aim9.orig/creat-clo.c aim9/creat-clo.c
--- aim9.orig/creat-clo.c	2002-04-22 15:25:16.000000000 -0700
+++ aim9/creat-clo.c	2005-07-11 10:20:13.000000000 -0700
@@ -352,7 +352,7 @@ page_test(char *argv,
 	 */
 	oldbrk = sbrk(0);		/* get current break value */
 	newbrk = sbrk(1024 * 1024);	/* move up 1 megabyte */
-	if ((int)newbrk == -1) {
+	if (newbrk == (void *)-1L) {
 		perror("\npage_test");	/* tell more info */
 		fprintf(stderr, "page_test: Unable to do initial sbrk.\n");
 		return (-1);
@@ -365,7 +365,7 @@ page_test(char *argv,
 		newbrk = sbrk(-4096 * 16);	/* deallocate some space */
 		for (i = 0; i < 16; i++) {	/* now get it back in pieces */
 			newbrk = sbrk(4096);	/* Get pointer to new space */
-			if ((int)newbrk == -1) {
+			if (newbrk == (void *)-1L) {
 				perror("\npage_test");	/* tell more info */
 				fprintf(stderr,
 					"page_test: Unable to do sbrk.\n");
@@ -406,7 +406,7 @@ brk_test(char *argv,
 	 */
 	oldbrk = sbrk(0);		/* get old break value */
 	newbrk = sbrk(1024 * 1024);	/* move up 1 megabyte */
-	if ((int)newbrk == -1) {
+	if (newbrk == (void *)-1L) {
 		perror("\nbrk_test");	/* tell more info */
 		fprintf(stderr, "brk_test: Unable to do initial sbrk.\n");
 		return (-1);
@@ -419,7 +419,7 @@ brk_test(char *argv,
 		newbrk = sbrk(-4096 * 16);	/* deallocate some space */
 		for (i = 0; i < 16; i++) {	/* allocate it back */
 			newbrk = sbrk(4096);	/* 4k at a time (should be ~ 1 page) */
-			if ((int)newbrk == -1) {
+			if (newbrk == (void *)-1L) {
 				perror("\nbrk_test");	/* tell more info */
 				fprintf(stderr,
 					"brk_test: Unable to do sbrk.\n");
diff -urp aim9.orig/pipe_test.c aim9/pipe_test.c
--- aim9.orig/pipe_test.c	2002-04-22 15:25:16.000000000 -0700
+++ aim9/pipe_test.c	2005-07-11 10:21:19.000000000 -0700
@@ -493,8 +493,8 @@ readn(int fd,
 		buf += result;		/* update pointer */
 		if (--count <= 0) {
 			fprintf(stderr,
-				"\nMaximum iterations exceeded in readn(%d, %#x, %d)",
-				fd, (unsigned)buf, size);
+				"\nMaximum iterations exceeded in readn(%d, %p, %d)",
+				fd, buf, size);
 			return (-1);
 		}
 	}				/* and loop */
@@ -523,8 +523,8 @@ writen(int fd,
 		buf += result;		/* update pointer */
 		if (--count <= 0) {	/* handle too many loops */
 			fprintf(stderr,
-				"\nMaximum iterations exceeded in writen(%d, %#x, %d)",
-				fd, (unsigned)buf, size);
+				"\nMaximum iterations exceeded in writen(%d, %p, %d)",
+				fd, buf, size);
 			return (-1);
 		}
 	}				/* and loop */

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-08  8:44                         ` Zhang, Yanmin
@ 2008-05-08  9:21                           ` Ingo Molnar
  2008-05-08  9:29                             ` Ingo Molnar
  2008-05-08  9:30                             ` Zhang, Yanmin
  0 siblings, 2 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-08  9:21 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton


* Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:

> >           disk_cp /mnt/shm
> >   disk_cp (1): cannot open /mnt/shm/tmpa.common
> >   disk1.c: No such file or directory
> > 
> >   [.. etc. a large stream of them .. ]
> > 
> > system has 2GB of RAM and tmpfs mounted to the place where aim7 puts its 
> > work files.

> My machine has 8GB. To simulate your environment, I reserve 6GB for 
> hugetlb, then reran the testing and didn't see any failure except: AIM 
> Multiuser Benchmark - Suite VII Run Beginning
> 
> Tasks    jobs/min  jti  jobs/min/task      real       cpu
>  2000create_shared_memory(): can't create semaphore, pausing...
> create_shared_memory(): can't create semaphore, pausing...

that failure message you got worries me - it indicates that your test 
ran out of IPC semaphores. You can fix it via upping the semaphore 
limits via:

   echo "500 32000 128 512" > /proc/sys/kernel/sem

could you check that you still get similar results with this limit 
fixed?

note that once i've fixed the semaphore limits it started running fine 
here. And i see zero idle time during the run on a quad core box.

here are my numbers:

  # on v2.6.26-rc1-166-gc0a1811

  Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
  2000    55851.4         93      208.4   793.6   0.4654   # BKL: sleep
  2000    55402.2         79      210.1   800.1   0.4617

  2000    55728.4         93      208.9   795.5   0.4644   # BKL: spin
  2000    55787.2         93      208.7   794.5   0.4649   #

so the results are the same within noise.

I'll also check this workload on an 8-way box to make sure it's OK on 
larger CPU counts too.

could you double-check your test?

plus a tty tidbit as well, during the test i saw a few of these:

 Warning: dev (tty1) tty->count(639) != #fd's(638) in release_dev
 Warning: dev (tty1) tty->count(462) != #fd's(463) in release_dev
 Warning: dev (tty1) tty->count(274) != #fd's(275) in release_dev
 Warning: dev (tty1) tty->count(4) != #fd's(3) in release_dev
 Warning: dev (tty1) tty->count(164) != #fd's(163) in release_dev

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-08  9:21                           ` Ingo Molnar
@ 2008-05-08  9:29                             ` Ingo Molnar
  2008-05-08  9:30                             ` Zhang, Yanmin
  1 sibling, 0 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-08  9:29 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton


* Ingo Molnar <mingo@elte.hu> wrote:

> plus a tty tidbit as well, during the test i saw a few of these:
> 
>  Warning: dev (tty1) tty->count(639) != #fd's(638) in release_dev

false alarm there - these were due to the breakage in the hack-patch i 
used ...

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-08  9:21                           ` Ingo Molnar
  2008-05-08  9:29                             ` Ingo Molnar
@ 2008-05-08  9:30                             ` Zhang, Yanmin
  1 sibling, 0 replies; 140+ messages in thread
From: Zhang, Yanmin @ 2008-05-08  9:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton


On Thu, 2008-05-08 at 11:21 +0200, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> 
> > >           disk_cp /mnt/shm
> > >   disk_cp (1): cannot open /mnt/shm/tmpa.common
> > >   disk1.c: No such file or directory
> > > 
> > >   [.. etc. a large stream of them .. ]
> > > 
> > > system has 2GB of RAM and tmpfs mounted to the place where aim7 puts its 
> > > work files.
> 
> > My machine has 8GB. To simulate your environment, I reserve 6GB for 
> > hugetlb, then reran the testing and didn't see any failure except: AIM 
> > Multiuser Benchmark - Suite VII Run Beginning
> > 
> > Tasks    jobs/min  jti  jobs/min/task      real       cpu
> >  2000create_shared_memory(): can't create semaphore, pausing...
> > create_shared_memory(): can't create semaphore, pausing...
> 
> that failure message you got worries me - it indicates that your test 
> ran out of IPC semaphores. You can fix it via upping the semaphore 
> limits via:
> 
>    echo "500 32000 128 512" > /proc/sys/kernel/sem
A quick test showed it does work.

Thanks. I need to take shuttle bus or I need walk to home for 2 hours if missing it. :)

> 
> could you check that you still get similar results with this limit 
> fixed?
> 
> note that once i've fixed the semaphore limits it started running fine 
> here. And i see zero idle time during the run on a quad core box.
> 
> here are my numbers:
> 
>   # on v2.6.26-rc1-166-gc0a1811
> 
>   Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
>   2000    55851.4         93      208.4   793.6   0.4654   # BKL: sleep
>   2000    55402.2         79      210.1   800.1   0.4617
> 
>   2000    55728.4         93      208.9   795.5   0.4644   # BKL: spin
>   2000    55787.2         93      208.7   794.5   0.4649   #
> 
> so the results are the same within noise.
> 
> I'll also check this workload on an 8-way box to make sure it's OK on 
> larger CPU counts too.
> 
> could you double-check your test?
> 
> plus a tty tidbit as well, during the test i saw a few of these:
> 
>  Warning: dev (tty1) tty->count(639) != #fd's(638) in release_dev
>  Warning: dev (tty1) tty->count(462) != #fd's(463) in release_dev
>  Warning: dev (tty1) tty->count(274) != #fd's(275) in release_dev
>  Warning: dev (tty1) tty->count(4) != #fd's(3) in release_dev
>  Warning: dev (tty1) tty->count(164) != #fd's(163) in release_dev
> 
> 	Ingo


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08  4:17                       ` Linus Torvalds
@ 2008-05-08 12:01                         ` Ingo Molnar
  2008-05-08 12:28                           ` Ingo Molnar
                                             ` (2 more replies)
  0 siblings, 3 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-08 12:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > > That said, "idle 0%" is easy when you spin. Do you also have 
> > > actual performance numbers?
> >
> > Yes. My conclusion is based on the actual number. cpu idle 0% is 
> > just a behavior it should be.
> 
> Thanks, that's all I wanted to verify.
> 
> I'll leave this overnight, and see if somebody has come up with some 
> smart and wonderful patch. And if not, I think I'll apply mine as 
> "known to fix a regression", and we can perhaps then improve on things 
> further from there.

hey, i happen to have such a smart and wonderful patch =B-)

i reproduced the AIM7 workload and can confirm Yanmin's findings that 
-.26-rc1 regresses over .25 - by over 67% here.

Looking at the workload i found and fixed what i believe to be the real 
bug causing the AIM7 regression: it was inefficient wakeup / scheduling 
/ locking behavior of the new generic semaphore code, causing suboptimal 
performance.

The problem comes from the following code. The new semaphore code does 
this on down():

        spin_lock_irqsave(&sem->lock, flags);
        if (likely(sem->count > 0))
                sem->count--;
        else
                __down(sem);
        spin_unlock_irqrestore(&sem->lock, flags);

and this on up():

        spin_lock_irqsave(&sem->lock, flags);
        if (likely(list_empty(&sem->wait_list)))
                sem->count++;
        else
                __up(sem);
        spin_unlock_irqrestore(&sem->lock, flags);

where __up() does:

        list_del(&waiter->list);
        waiter->up = 1;
        wake_up_process(waiter->task);

and where __down() does this in essence:

        list_add_tail(&waiter.list, &sem->wait_list);
        waiter.task = task;
        waiter.up = 0;
        for (;;) {
                [...]
                spin_unlock_irq(&sem->lock);
                timeout = schedule_timeout(timeout);
                spin_lock_irq(&sem->lock);
                if (waiter.up)
                        return 0;
        }

the fastpath looks good and obvious, but note the following property of 
the contended path: if there's a task on the ->wait_list, the up() of 
the current owner will "pass over" ownership to that waiting task, in a 
wake-one manner, via the waiter->up flag and by removing the waiter from 
the wait list.

That is all and fine in principle, but as implemented in 
kernel/semaphore.c it also creates a nasty, hidden source of contention!

The contention comes from the following property of the new semaphore 
code: the new owner owns the semaphore exclusively, even if it is not 
running yet.

So if the old owner, even if just a few instructions later, does a 
down() [lock_kernel()] again, it will be blocked and will have to wait 
on the new owner to eventually be scheduled (possibly on another CPU)! 
Or if another other task gets to lock_kernel() sooner than the "new 
owner" scheduled, it will be blocked unnecessarily and for a very long 
time when there are 2000 tasks running.

I.e. the implementation of the new semaphores code does wake-one and 
lock ownership in a very restrictive way - it does not allow 
opportunistic re-locking of the lock at all and keeps the scheduler from 
picking task order intelligently.

This kind of scheduling, with 2000 AIM7 processes running, creates awful 
cross-scheduling between those 2000 tasks, causes reduced parallelism, a 
throttled runqueue length and a lot of idle time. With increasing number 
of CPUs it causes an exponentially worse behavior in AIM7, as the chance 
for a newly woken new-owner task to actually run anytime soon is less 
and less likely.

Note that it takes just a tiny bit of contention for the 'new-semaphore 
catastrophy' to happen: the wakeup latencies get added to whatever small 
contention there is, and quickly snowball out of control!

I believe Yanmin's findings and numbers support this analysis too.

The best fix for this problem is to use the same scheduling logic that 
the kernel/mutex.c code uses: keep the wake-one behavior (that is OK and 
wanted because we do not want to over-schedule), but also allow 
opportunistic locking of the lock even if a wakee is already "in 
flight".

The patch below implements this new logic. With this patch applied the 
AIM7 regression is largely fixed on my quad testbox:

  # v2.6.25 vanilla:
  ..................
  Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
  2000    56096.4         91      207.5   789.7   0.4675
  2000    55894.4         94      208.2   792.7   0.4658

  # v2.6.26-rc1-166-gc0a1811 vanilla:
  ...................................
  Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
  2000    33230.6         83      350.3   784.5   0.2769
  2000    31778.1         86      366.3   783.6   0.2648

  # v2.6.26-rc1-166-gc0a1811 + semaphore-speedup:
  ...............................................
  Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
  2000    55707.1         92      209.0   795.6   0.4642
  2000    55704.4         96      209.0   796.0   0.4642

i.e. a 67% speedup. We are now back to within 1% of the v2.6.25 
performance levels and have zero idle time during the test, as expected.

Btw., interactivity also improved dramatically with the fix - for 
example console-switching became almost instantaneous during this 
workload (which after all is running 2000 tasks at once!), without the 
patch it was stuck for a minute at times.

I also ran Linus's spinlock-BKL patch as well:

  # v2.6.26-rc1-166-gc0a1811 + Linus-BKL-spinlock-patch:
  ......................................................
  Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
  2000    55889.0         92      208.3   793.3   0.4657
  2000    55891.7         96      208.3   793.3   0.4658

it is about 0.3% faster - but note that that is within the general noise 
levels of this test. I'd expect Linus's spinlock-BKL patch to give some 
small speedup because the BKL acquire times are short and 2000 tasks 
running all at once really increases the context-switching cost and most 
BKL contentions are within the cost of context-switching.

But i believe the better solution for that is to remove BKL use from all 
hotpaths, not to hide some of its costs by reintroducing it as a 
spinlock. Reintroducing the spinlock based BKL would have other 
disadvantages as well: it could reintroduce per-CPU-ness assumptions in 
BKL-using code and other complications as well. It's also not a very 
realistic workload - with 2000 tasks running the system was barely 
serviceable.

I'd much rather make BKL costs more apparent and more visible - but 50% 
regression was of course too much. But 0.3% for a 2000-tasks workload, 
which is near the noise level ... is acceptable i think - especially as 
this discussion has now reinvigorated the remove-the-BKL discussions and 
patches.

Linus, we can do your spinlock-BKL patch too if you feel strongly about 
it, but i'd rather not - we fought so hard for the preemptible BKL :-)

The spinlock-based-BKL patch only worked around the real problem i 
believe, because it eliminated the use of the suboptimal new semaphore 
code: with spinlocks there's no scheduling at all, so the wakeup/locking 
bug of the new semaphore code did not apply. It was not about any 
fastpath overhead AFAICS. [we'd have seen that with the 
CONFIG_PREEMPT_BKL=y code as well, which has been the default setting 
since v2.6.8.]

There's another nice side-effect of this speedup patch, the new generic 
semaphore code got even smaller:

   text    data     bss     dec     hex filename
   1241       0       0    1241     4d9 semaphore.o.before
   1207       0       0    1207     4b7 semaphore.o.after

(because the waiter.up complication got removed.)

Longer-term we should look into using the mutex code for the generic 
semaphore code as well - but i's not easy due to legacies and it's 
outside of the scope of v2.6.26 and outside the scope of this patch as 
well.

Hm?

	Ingo

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/semaphore.c |   36 ++++++++++++++++--------------------
 1 file changed, 16 insertions(+), 20 deletions(-)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -54,10 +54,9 @@ void down(struct semaphore *sem)
 	unsigned long flags;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(sem->count > 0))
-		sem->count--;
-	else
+	if (unlikely(sem->count <= 0))
 		__down(sem);
+	sem->count--;
 	spin_unlock_irqrestore(&sem->lock, flags);
 }
 EXPORT_SYMBOL(down);
@@ -77,10 +76,10 @@ int down_interruptible(struct semaphore 
 	int result = 0;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(sem->count > 0))
-		sem->count--;
-	else
+	if (unlikely(sem->count <= 0))
 		result = __down_interruptible(sem);
+	if (!result)
+		sem->count--;
 	spin_unlock_irqrestore(&sem->lock, flags);
 
 	return result;
@@ -103,10 +102,10 @@ int down_killable(struct semaphore *sem)
 	int result = 0;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(sem->count > 0))
-		sem->count--;
-	else
+	if (unlikely(sem->count <= 0))
 		result = __down_killable(sem);
+	if (!result)
+		sem->count--;
 	spin_unlock_irqrestore(&sem->lock, flags);
 
 	return result;
@@ -157,10 +156,10 @@ int down_timeout(struct semaphore *sem, 
 	int result = 0;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(sem->count > 0))
-		sem->count--;
-	else
+	if (unlikely(sem->count <= 0))
 		result = __down_timeout(sem, jiffies);
+	if (!result)
+		sem->count--;
 	spin_unlock_irqrestore(&sem->lock, flags);
 
 	return result;
@@ -179,9 +178,8 @@ void up(struct semaphore *sem)
 	unsigned long flags;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(list_empty(&sem->wait_list)))
-		sem->count++;
-	else
+	sem->count++;
+	if (unlikely(!list_empty(&sem->wait_list)))
 		__up(sem);
 	spin_unlock_irqrestore(&sem->lock, flags);
 }
@@ -192,7 +190,6 @@ EXPORT_SYMBOL(up);
 struct semaphore_waiter {
 	struct list_head list;
 	struct task_struct *task;
-	int up;
 };
 
 /*
@@ -206,11 +203,11 @@ static inline int __sched __down_common(
 	struct task_struct *task = current;
 	struct semaphore_waiter waiter;
 
-	list_add_tail(&waiter.list, &sem->wait_list);
 	waiter.task = task;
-	waiter.up = 0;
 
 	for (;;) {
+		list_add_tail(&waiter.list, &sem->wait_list);
+
 		if (state == TASK_INTERRUPTIBLE && signal_pending(task))
 			goto interrupted;
 		if (state == TASK_KILLABLE && fatal_signal_pending(task))
@@ -221,7 +218,7 @@ static inline int __sched __down_common(
 		spin_unlock_irq(&sem->lock);
 		timeout = schedule_timeout(timeout);
 		spin_lock_irq(&sem->lock);
-		if (waiter.up)
+		if (sem->count > 0)
 			return 0;
 	}
 
@@ -259,6 +256,5 @@ static noinline void __sched __up(struct
 	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
 						struct semaphore_waiter, list);
 	list_del(&waiter->list);
-	waiter->up = 1;
 	wake_up_process(waiter->task);
 }

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 12:01                         ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Ingo Molnar
@ 2008-05-08 12:28                           ` Ingo Molnar
  2008-05-08 14:43                             ` Ingo Molnar
  2008-05-08 16:02                             ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Linus Torvalds
  2008-05-08 13:20                           ` Matthew Wilcox
  2008-05-08 13:56                           ` Arjan van de Ven
  2 siblings, 2 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-08 12:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin


* Ingo Molnar <mingo@elte.hu> wrote:

> +	if (unlikely(sem->count <= 0))
>  		__down(sem);
> +	sem->count--;

Peter pointed it out that because sem->count is u32, the <= 0 is in fact 
a "== 0" condition - the patch below does that. As expected gcc figured 
out the same thing too so the resulting code output did not change. (so 
this is just a cleanup)

i've got this lined up in sched.git and it's undergoing testing right 
now. If that testing goes fine and if there are no objections i'll send 
a pull request for it later today.

	Ingo

---------------->
Subject: semaphores: improve code
From: Ingo Molnar <mingo@elte.hu>
Date: Thu May 08 14:19:23 CEST 2008

No code changed:

kernel/semaphore.o:

   text	   data	    bss	    dec	    hex	filename
   1207	      0	      0	   1207	    4b7	semaphore.o.before
   1207	      0	      0	   1207	    4b7	semaphore.o.after

md5:
   c10198c2952bd345a1edaac6db891548  semaphore.o.before.asm
   c10198c2952bd345a1edaac6db891548  semaphore.o.after.asm

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/semaphore.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -54,7 +54,7 @@ void down(struct semaphore *sem)
 	unsigned long flags;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (unlikely(sem->count <= 0))
+	if (unlikely(!sem->count))
 		__down(sem);
 	sem->count--;
 	spin_unlock_irqrestore(&sem->lock, flags);
@@ -76,7 +76,7 @@ int down_interruptible(struct semaphore 
 	int result = 0;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (unlikely(sem->count <= 0))
+	if (unlikely(!sem->count))
 		result = __down_interruptible(sem);
 	if (!result)
 		sem->count--;
@@ -102,7 +102,7 @@ int down_killable(struct semaphore *sem)
 	int result = 0;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (unlikely(sem->count <= 0))
+	if (unlikely(!sem->count))
 		result = __down_killable(sem);
 	if (!result)
 		sem->count--;
@@ -156,7 +156,7 @@ int down_timeout(struct semaphore *sem, 
 	int result = 0;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (unlikely(sem->count <= 0))
+	if (unlikely(!sem->count))
 		result = __down_timeout(sem, jiffies);
 	if (!result)
 		sem->count--;

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 12:01                         ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Ingo Molnar
  2008-05-08 12:28                           ` Ingo Molnar
@ 2008-05-08 13:20                           ` Matthew Wilcox
  2008-05-08 15:01                             ` Ingo Molnar
  2008-05-08 13:56                           ` Arjan van de Ven
  2 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-08 13:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Thu, May 08, 2008 at 02:01:30PM +0200, Ingo Molnar wrote:
> Looking at the workload i found and fixed what i believe to be the real 
> bug causing the AIM7 regression: it was inefficient wakeup / scheduling 
> / locking behavior of the new generic semaphore code, causing suboptimal 
> performance.

I did note that earlier downthread ... although to be fair, I thought of
it in terms of three tasks with the third task coming in and stealing
the second tasks's wakeup rather than the first task starving the second
by repeatedly locking/unlocking the semaphore.

> So if the old owner, even if just a few instructions later, does a 
> down() [lock_kernel()] again, it will be blocked and will have to wait 
> on the new owner to eventually be scheduled (possibly on another CPU)! 
> Or if another other task gets to lock_kernel() sooner than the "new 
> owner" scheduled, it will be blocked unnecessarily and for a very long 
> time when there are 2000 tasks running.
> 
> I.e. the implementation of the new semaphores code does wake-one and 
> lock ownership in a very restrictive way - it does not allow 
> opportunistic re-locking of the lock at all and keeps the scheduler from 
> picking task order intelligently.

Fair is certainly the enemy of throughput (see also dbench arguments
passim).  It may be that some semaphore users really do want fairness --
it seems pretty clear that we don't want fairness for the BKL.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 12:01                         ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Ingo Molnar
  2008-05-08 12:28                           ` Ingo Molnar
  2008-05-08 13:20                           ` Matthew Wilcox
@ 2008-05-08 13:56                           ` Arjan van de Ven
  2 siblings, 0 replies; 140+ messages in thread
From: Arjan van de Ven @ 2008-05-08 13:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Thu, 8 May 2008 14:01:30 +0200
Ingo Molnar <mingo@elte.hu> wrote:
> 
> The contention comes from the following property of the new semaphore 
> code: the new owner owns the semaphore exclusively, even if it is not 
> running yet.
> 
> So if the old owner, even if just a few instructions later, does a 
> down() [lock_kernel()] again, it will be blocked and will have to
> wait on the new owner to eventually be scheduled (possibly on another
> CPU)! Or if another other task gets to lock_kernel() sooner than the
> "new owner" scheduled, it will be blocked unnecessarily and for a
> very long time when there are 2000 tasks running.

ok sounds like I like the fairness part of the new semaphores (but
obviously not the 67% performance downside; I'd expect to sacrifice a
little performance.. but this much??).

It sucks though; if this were a mutex, we could wake up the owner of
the bugger in the contented acquire path synchronously.... but these 
are semaphores, and don't have an owner ;( bah bah bah

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 12:28                           ` Ingo Molnar
@ 2008-05-08 14:43                             ` Ingo Molnar
  2008-05-08 15:10                               ` [git pull] scheduler fixes Ingo Molnar
  2008-05-08 16:02                             ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Linus Torvalds
  1 sibling, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-08 14:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin


* Ingo Molnar <mingo@elte.hu> wrote:

> Peter pointed it out that because sem->count is u32, the <= 0 is in 
> fact a "== 0" condition - the patch below does that. As expected gcc 
> figured out the same thing too so the resulting code output did not 
> change. (so this is just a cleanup)

a second update patch, i've further simplified the semaphore wakeup 
logic: there's no need for the wakeup to remove the task from the wait 
list. This will make them a slighly bit more fair, but more importantly, 
this closes a race in my first patch for the unlikely case of a signal 
(or a timeout) and an unlock coming in at the same time and the task not 
getting removed from the wait-list.

( my performance testing with 2000 AIM7 tasks on a quad never hit that
  race but x86.git QA actually triggered it after about 30 random kernel
  bootups and it caused a nasty crash and lockup. )

	Ingo

---------------->
Subject: sem: simplify queue management
From: Ingo Molnar <mingo@elte.hu>
Date: Tue May 06 19:32:42 CEST 2008

kernel/semaphore.o:

   text	   data	    bss	    dec	    hex	filename
   1040	      0	      0	   1040	    410	semaphore.o.before
    975	      0	      0	    975	    3cf	semaphore.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/semaphore.c |   32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -202,33 +202,34 @@ static inline int __sched __down_common(
 {
 	struct task_struct *task = current;
 	struct semaphore_waiter waiter;
+	int ret = 0;
 
 	waiter.task = task;
+	list_add_tail(&waiter.list, &sem->wait_list);
 
 	for (;;) {
-		list_add_tail(&waiter.list, &sem->wait_list);
-
-		if (state == TASK_INTERRUPTIBLE && signal_pending(task))
-			goto interrupted;
-		if (state == TASK_KILLABLE && fatal_signal_pending(task))
-			goto interrupted;
-		if (timeout <= 0)
-			goto timed_out;
+		if (state == TASK_INTERRUPTIBLE && signal_pending(task)) {
+			ret = -EINTR;
+			break;
+		}
+		if (state == TASK_KILLABLE && fatal_signal_pending(task)) {
+			ret = -EINTR;
+			break;
+		}
+		if (timeout <= 0) {
+			ret = -ETIME;
+			break;
+		}
 		__set_task_state(task, state);
 		spin_unlock_irq(&sem->lock);
 		timeout = schedule_timeout(timeout);
 		spin_lock_irq(&sem->lock);
 		if (sem->count > 0)
-			return 0;
+			break;
 	}
 
- timed_out:
-	list_del(&waiter.list);
-	return -ETIME;
-
- interrupted:
 	list_del(&waiter.list);
-	return -EINTR;
+	return ret;
 }
 
 static noinline void __sched __down(struct semaphore *sem)
@@ -255,6 +256,5 @@ static noinline void __sched __up(struct
 {
 	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
 						struct semaphore_waiter, list);
-	list_del(&waiter->list);
 	wake_up_process(waiter->task);
 }

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-08  4:37           ` Zhang, Yanmin
@ 2008-05-08 14:58             ` Linus Torvalds
  0 siblings, 0 replies; 140+ messages in thread
From: Linus Torvalds @ 2008-05-08 14:58 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, LKML,
	Alexander Viro, Andrew Morton, linux-fsdevel



On Thu, 8 May 2008, Zhang, Yanmin wrote:
>
> I collected oprofile data. It looks not useful, as cpu idle is more than 50%.

Ahh, so it's probably still the BKL that is the problem, it's just not in 
the file locking code. The changes to fs/locks.c probably didn't matter 
all that much, and the additional regression was likely just some 
perturbation.

So it's probably fasync that AIM7 tests. Quite possibly coupled with 
/dev/tty etc. No file locking.

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 13:20                           ` Matthew Wilcox
@ 2008-05-08 15:01                             ` Ingo Molnar
  0 siblings, 0 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-08 15:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra


* Matthew Wilcox <matthew@wil.cx> wrote:

> Fair is certainly the enemy of throughput (see also dbench arguments 
> passim).  It may be that some semaphore users really do want fairness 
> -- it seems pretty clear that we don't want fairness for the BKL.

i dont think we need to consider any theoretical arguments about 
fairness here as there's a fundamental down-to-earth maintenance issue 
that governs: old semaphores were similarly unfair too, so it is just a 
bad idea (and a bug) to change behavior when implementing new, generic 
semaphores that are supposed to be a seemless replacement! This is about 
legacy code that is intended to be phased out anyway.

This is already a killer argument and we wouldnt have to look any 
further.

but even on the more theoretical level i disagree: fairness of CPU time 
is something that is implemented by the scheduler in a natural way 
already. Putting extra ad-hoc synchronization and scheduling into the 
locking primitives around data structures only gives mathematical 
fairness and artificial micro-scheduling, it does not actually make the 
end result more useful! This is especially true for the BKL which is 
auto-dropped by the scheduler anyway. (so descheduling a task will 
automatically release it of its BKL ownership)

For example we've invested a _lot_ of time and effort into adding lock 
stealing (i.e. intentional "unfairness") to kernel/rtmutex.c. Which is a 
_lot_ harder to do atomically with PI constraints but still possible and 
makes sense in the grand scheme of things. kernel/mutex.c is also 
"unfair" - and that's correct IMO.

For the BKL in particular there's almost no sense to talk about any 
underlying resource and there's almost no expectation from users for 
that imaginery resource to be shared fairly.

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [git pull] scheduler fixes
  2008-05-08 14:43                             ` Ingo Molnar
@ 2008-05-08 15:10                               ` Ingo Molnar
  2008-05-08 15:33                                 ` Adrian Bunk
  2008-05-11 11:03                                 ` Matthew Wilcox
  0 siblings, 2 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-08 15:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin


* Ingo Molnar <mingo@elte.hu> wrote:

> > Peter pointed it out that because sem->count is u32, the <= 0 is in 
> > fact a "== 0" condition - the patch below does that. As expected gcc 
> > figured out the same thing too so the resulting code output did not 
> > change. (so this is just a cleanup)
> 
> a second update patch, i've further simplified the semaphore wakeup 
> logic: there's no need for the wakeup to remove the task from the wait 
> list. This will make them a slighly bit more fair, but more 
> importantly, this closes a race in my first patch for the unlikely 
> case of a signal (or a timeout) and an unlock coming in at the same 
> time and the task not getting removed from the wait-list.
> 
> ( my performance testing with 2000 AIM7 tasks on a quad never hit that
>   race but x86.git QA actually triggered it after about 30 random 
>   kernel bootups and it caused a nasty crash and lockup. )

ok, it's looking good here so far so here's the scheduler fixes tree 
that you can pull if my semaphore fix looks good to you too:

   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes.git for-linus

also includes a scheduler arithmetics fix from Mike. Find the shortlog 
and diff below.

	Ingo

------------------>
Ingo Molnar (1):
      semaphore: fix

Mike Galbraith (1):
      sched: fix weight calculations

 kernel/sched_fair.c |   11 ++++++--
 kernel/semaphore.c  |   64 ++++++++++++++++++++++++---------------------------
 2 files changed, 38 insertions(+), 37 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c863663..e24ecd3 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -662,10 +662,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 	if (!initial) {
 		/* sleeps upto a single latency don't count. */
 		if (sched_feat(NEW_FAIR_SLEEPERS)) {
+			unsigned long thresh = sysctl_sched_latency;
+
+			/*
+			 * convert the sleeper threshold into virtual time
+			 */
 			if (sched_feat(NORMALIZED_SLEEPER))
-				vruntime -= calc_delta_weight(sysctl_sched_latency, se);
-			else
-				vruntime -= sysctl_sched_latency;
+				thresh = calc_delta_fair(thresh, se);
+
+			vruntime -= thresh;
 		}
 
 		/* ensure we never gain time by being placed backwards. */
diff --git a/kernel/semaphore.c b/kernel/semaphore.c
index 5c2942e..5e41217 100644
--- a/kernel/semaphore.c
+++ b/kernel/semaphore.c
@@ -54,10 +54,9 @@ void down(struct semaphore *sem)
 	unsigned long flags;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(sem->count > 0))
-		sem->count--;
-	else
+	if (unlikely(!sem->count))
 		__down(sem);
+	sem->count--;
 	spin_unlock_irqrestore(&sem->lock, flags);
 }
 EXPORT_SYMBOL(down);
@@ -77,10 +76,10 @@ int down_interruptible(struct semaphore *sem)
 	int result = 0;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(sem->count > 0))
-		sem->count--;
-	else
+	if (unlikely(!sem->count))
 		result = __down_interruptible(sem);
+	if (!result)
+		sem->count--;
 	spin_unlock_irqrestore(&sem->lock, flags);
 
 	return result;
@@ -103,10 +102,10 @@ int down_killable(struct semaphore *sem)
 	int result = 0;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(sem->count > 0))
-		sem->count--;
-	else
+	if (unlikely(!sem->count))
 		result = __down_killable(sem);
+	if (!result)
+		sem->count--;
 	spin_unlock_irqrestore(&sem->lock, flags);
 
 	return result;
@@ -157,10 +156,10 @@ int down_timeout(struct semaphore *sem, long jiffies)
 	int result = 0;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(sem->count > 0))
-		sem->count--;
-	else
+	if (unlikely(!sem->count))
 		result = __down_timeout(sem, jiffies);
+	if (!result)
+		sem->count--;
 	spin_unlock_irqrestore(&sem->lock, flags);
 
 	return result;
@@ -179,9 +178,8 @@ void up(struct semaphore *sem)
 	unsigned long flags;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(list_empty(&sem->wait_list)))
-		sem->count++;
-	else
+	sem->count++;
+	if (unlikely(!list_empty(&sem->wait_list)))
 		__up(sem);
 	spin_unlock_irqrestore(&sem->lock, flags);
 }
@@ -192,7 +190,6 @@ EXPORT_SYMBOL(up);
 struct semaphore_waiter {
 	struct list_head list;
 	struct task_struct *task;
-	int up;
 };
 
 /*
@@ -205,33 +202,34 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
 {
 	struct task_struct *task = current;
 	struct semaphore_waiter waiter;
+	int ret = 0;
 
-	list_add_tail(&waiter.list, &sem->wait_list);
 	waiter.task = task;
-	waiter.up = 0;
+	list_add_tail(&waiter.list, &sem->wait_list);
 
 	for (;;) {
-		if (state == TASK_INTERRUPTIBLE && signal_pending(task))
-			goto interrupted;
-		if (state == TASK_KILLABLE && fatal_signal_pending(task))
-			goto interrupted;
-		if (timeout <= 0)
-			goto timed_out;
+		if (state == TASK_INTERRUPTIBLE && signal_pending(task)) {
+			ret = -EINTR;
+			break;
+		}
+		if (state == TASK_KILLABLE && fatal_signal_pending(task)) {
+			ret = -EINTR;
+			break;
+		}
+		if (timeout <= 0) {
+			ret = -ETIME;
+			break;
+		}
 		__set_task_state(task, state);
 		spin_unlock_irq(&sem->lock);
 		timeout = schedule_timeout(timeout);
 		spin_lock_irq(&sem->lock);
-		if (waiter.up)
-			return 0;
+		if (sem->count > 0)
+			break;
 	}
 
- timed_out:
-	list_del(&waiter.list);
-	return -ETIME;
-
- interrupted:
 	list_del(&waiter.list);
-	return -EINTR;
+	return ret;
 }
 
 static noinline void __sched __down(struct semaphore *sem)
@@ -258,7 +256,5 @@ static noinline void __sched __up(struct semaphore *sem)
 {
 	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
 						struct semaphore_waiter, list);
-	list_del(&waiter->list);
-	waiter->up = 1;
 	wake_up_process(waiter->task);
 }

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-08 15:10                               ` [git pull] scheduler fixes Ingo Molnar
@ 2008-05-08 15:33                                 ` Adrian Bunk
  2008-05-08 15:41                                   ` Ingo Molnar
  2008-05-11 11:03                                 ` Matthew Wilcox
  1 sibling, 1 reply; 140+ messages in thread
From: Adrian Bunk @ 2008-05-08 15:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Mike Galbraith

On Thu, May 08, 2008 at 05:10:28PM +0200, Ingo Molnar wrote:
>...
> also includes a scheduler arithmetics fix from Mike. Find the shortlog 
> and diff below.
> 
> 	Ingo
> 
> ------------------>
>...
> Mike Galbraith (1):
>       sched: fix weight calculations
>...

The commit description says:

<--  snip  -->

...

This bug could be related to the regression reported by Yanmin Zhang:

| Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has lots
| of regressions with 2.6.26-rc1:
|
| 1) 8-core stoakley: 28%;
| 2) 16-core tigerton: 20%;
| 3) Itanium Montvale: 50%.

Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
...

<--  snip  -->

Can we get that verified and the description updated before it hits 
Linus' tree?

Otherwise this "could be related" will become unchangable metadata that 
will stay forever - no matter whether there's any relation at all.

Thanks
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-08 15:33                                 ` Adrian Bunk
@ 2008-05-08 15:41                                   ` Ingo Molnar
  2008-05-08 19:42                                     ` Adrian Bunk
  0 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-08 15:41 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Mike Galbraith


* Adrian Bunk <bunk@kernel.org> wrote:

> Can we get that verified and the description updated before it hits 
> Linus' tree?

that's not needed. Mike's fix is correct, regardless of whether it fixes 
the other regression or not.

> Otherwise this "could be related" will become unchangable metadata 
> that will stay forever - no matter whether there's any relation at 
> all.

... and the problem with that is exactly what?

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 12:28                           ` Ingo Molnar
  2008-05-08 14:43                             ` Ingo Molnar
@ 2008-05-08 16:02                             ` Linus Torvalds
  2008-05-08 18:30                               ` Linus Torvalds
  1 sibling, 1 reply; 140+ messages in thread
From: Linus Torvalds @ 2008-05-08 16:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin



On Thu, 8 May 2008, Ingo Molnar wrote:
> 
> Peter pointed it out that because sem->count is u32, the <= 0 is in fact 
> a "== 0" condition - the patch below does that. As expected gcc figured 
> out the same thing too so the resulting code output did not change. (so 
> this is just a cleanup)

Why don't we just make it do the same thing that the x86 semaphores used 
to do: make it signed, and decrement unconditionally. And callt eh 
slow-path if it became negative.

IOW, make the fast-path be

	spin_lock_irqsave(&sem->lock, flags);
	if (--sem->count < 0)
		__down();
	spin_unlock_irqrestore(&sem->lock, flags);

and now we have an existing known-good implementation to look at?

Rather than making up a totally new and untested thing.

		Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 16:02                             ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Linus Torvalds
@ 2008-05-08 18:30                               ` Linus Torvalds
  2008-05-08 20:19                                 ` Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Linus Torvalds @ 2008-05-08 18:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin



On Thu, 8 May 2008, Linus Torvalds wrote:
> 
> Why don't we just make it do the same thing that the x86 semaphores used 
> to do: make it signed, and decrement unconditionally. And callt eh 
> slow-path if it became negative.
> ... 
> and now we have an existing known-good implementation to look at?

Ok, after having thought that over, and looked at the code, I think I like 
your version after all. The old implementation was pretty complex due to 
the need to be so extra careful about the count that could change outside 
of the lock, so everything considered, a new implementation that is 
simpler is probably the better choice.

Ergo, I will just pull your scheduler tree.

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-08 15:41                                   ` Ingo Molnar
@ 2008-05-08 19:42                                     ` Adrian Bunk
  0 siblings, 0 replies; 140+ messages in thread
From: Adrian Bunk @ 2008-05-08 19:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Mike Galbraith

On Thu, May 08, 2008 at 05:41:16PM +0200, Ingo Molnar wrote:
> 
> * Adrian Bunk <bunk@kernel.org> wrote:
> 
> > Can we get that verified and the description updated before it hits 
> > Linus' tree?
> 
> that's not needed. Mike's fix is correct, regardless of whether it fixes 
> the other regression or not.

Then scrap the part about it possibly fixing a regression and the
Reported-by: line.

> > Otherwise this "could be related" will become unchangable metadata 
> > that will stay forever - no matter whether there's any relation at 
> > all.
> 
> ... and the problem with that is exactly what?

It is important that our metadata is as complete and correct as 
reasonably possible. Our code is not as well documented as it should be, 
and in my experience often the only way to understand what happens and 
why it happens is to ask git for the metadata (and I'm actually doing 
this even for most of my "trivial" patches).

In 3 hours or 3 years someone might look at this commit trying to 
understand what it does and why it does this.

And there's a big difference between "we do it because it's correct from 
a theoretical point of view" and "it is supposed to fix a huge 
performance regression".

> 	Ingo

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 18:30                               ` Linus Torvalds
@ 2008-05-08 20:19                                 ` Ingo Molnar
  2008-05-08 20:27                                   ` Linus Torvalds
  0 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-08 20:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 8 May 2008, Linus Torvalds wrote:
> > 
> > Why don't we just make it do the same thing that the x86 semaphores used 
> > to do: make it signed, and decrement unconditionally. And callt eh 
> > slow-path if it became negative.
> > ... 
> > and now we have an existing known-good implementation to look at?
> 
> Ok, after having thought that over, and looked at the code, I think I 
> like your version after all. The old implementation was pretty complex 
> due to the need to be so extra careful about the count that could 
> change outside of the lock, so everything considered, a new 
> implementation that is simpler is probably the better choice.

yeah. I thought about that too, the problem i found is this thing in the 
old lib/semaphore-sleepers.c code's __down() path:

        remove_wait_queue_locked(&sem->wait, &wait);
        wake_up_locked(&sem->wait);
        spin_unlock_irqrestore(&sem->wait.lock, flags);
        tsk->state = TASK_RUNNING;

that mystery wakeup once i understood to be necessary for some weird 
ordering reason, but it would probably be hard to justify in the new 
code, because it's done unconditionally, regardless of whether there are 
sleepers around.

And once we deviate from the old code, we might as well go for the 
simplest approach - which also happens to be rather close to the mutex 
code's current slowpath - just with counting property added, legacy 
semantics and no lockdep coverage.

> Ergo, I will just pull your scheduler tree.

great! Meanwhile a 100 randconfigs booted fine with that tree so i'd say 
the implementation is robust.

i also did a quick re-test of AIM7 because the wakeup logic changed a 
bit from what i tested initially (from round-robin to strict FIFO), and 
as expected not much changed in the AIM7 results on the quad:

  Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
  2000    55019.9         96      211.6   806.5   0.4585
  2000    55116.2         90      211.2   804.7   0.4593
  2000    55082.3         82      211.3   805.5   0.4590

this is slightly lower but the test was not fully apples to apples 
because this also had some tracing active and other small details. It's 
still very close to the v2.6.25 numbers. I suspect some more performance 
could be won in this particular workload by getting rid of the BKL 
dependency altogether.

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 20:19                                 ` Ingo Molnar
@ 2008-05-08 20:27                                   ` Linus Torvalds
  2008-05-08 21:45                                     ` Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Linus Torvalds @ 2008-05-08 20:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin



On Thu, 8 May 2008, Ingo Molnar wrote:
>
> I suspect some more performance could be won in this particular workload 
> by getting rid of the BKL dependency altogether.

Did somebody have any trace on which BKL taker it is? It apparently wasn't 
file locking. Was it the tty layer?

		Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 20:27                                   ` Linus Torvalds
@ 2008-05-08 21:45                                     ` Ingo Molnar
  2008-05-08 22:02                                       ` Ingo Molnar
  2008-05-08 22:55                                       ` Linus Torvalds
  0 siblings, 2 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-08 21:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > I suspect some more performance could be won in this particular 
> > workload by getting rid of the BKL dependency altogether.
> 
> Did somebody have any trace on which BKL taker it is? It apparently 
> wasn't file locking. Was it the tty layer?

yeah, i captured a trace of all the down()s that happen in the workload, 
using ftrace/sched_switch + stacktrace + a tracepoint in down(). Here's 
the semaphore activities in the trace:

# grep lock_kernel /debug/tracing/trace | cut -d: -f2- | sort | \
                                          uniq -c | sort -n | cut -d= -f1-4

      9  down <= lock_kernel <= proc_lookup_de <= proc_lookup <
     12  down <= lock_kernel <= de_put <= proc_delete_inode <
     14  down <= lock_kernel <= proc_lookup_de <= proc_lookup <
     19  down <= lock_kernel <= vfs_ioctl <= do_vfs_ioctl <
     58  down <= lock_kernel <= tty_release <= __fput <
     62  down <= lock_kernel <= chrdev_open <= __dentry_open <
     70  down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs <
   2512  down <= lock_kernel <= opost <= write_chan <
   2574  down <= lock_kernel <= write_chan <= tty_write <

note that this is running the fixed semaphore code, so contended 
semaphores are really rare in the trace. The histogram above includes 
all calls to down().

here's the full trace file (from a single CPU, on a dual-core box 
running aim7):

   http://redhat.com/~mingo/misc/aim7-ftrace.txt

some other interesting stats. Top 5 wakeups sources:

 # grep wake_up aim7-ftrace.txt  | cut -d: -f2- | sort | uniq -c | 
                                   sort -n | cut -d= -f1-6 | tail -5
  [...]
    340  default_wake_function <= __wake_up_common <= __wake_up_sync <= unix_write_space <= sock_wfree <= skb_release_all <
    411  default_wake_function <= autoremove_wake_function <= __wake_up_common <= __wake_up_sync <= pipe_read <= do_sync_read <
    924  default_wake_function <= autoremove_wake_function <= __wake_up_common <= __wake_up_sync <= pipe_write <= do_sync_write <
   1301  default_wake_function <= __wake_up_common <= __wake_up <= n_tty_receive_buf <= pty_write <= write_chan <
   2065  wake_up_state <= prepare_signal <= send_signal <= __group_send_sig_info <= group_send_sig_info <= __kill_pgrp_info <

top 10 scheduling points:

# grep -w schedule aim7-ftrace.txt  | cut -d: -f2- | sort |
                            uniq -c | sort -n | cut -d= -f1-4 | tail -10
  [...]
    465  schedule <= __cond_resched <= _cond_resched <= shrink_active_list <
    582  schedule <= cpu_idle <= start_secondary <=  <
    929  schedule <= pipe_wait <= pipe_read <= do_sync_read <
    990  schedule <= __cond_resched <= _cond_resched <= shrink_page_list <
   1034  schedule <= io_schedule <= get_request_wait <= __make_request <
   1140  schedule <= worker_thread <= kthread <= child_rip <
   1512  schedule <= retint_careful <=  <= 0 <
   1571  schedule <= __cond_resched <= _cond_resched <= shrink_active_list <
   2034  schedule <= schedule_timeout <= do_select <= core_sys_select <
   2355  schedule <= sysret_careful <=  <= 0 <

as visible from the trace, this is a CONFIG_PREEMPT_NONE kernel, so most 
of the preemptions get triggered by wakeups and get executed from the 
return-from-syscall need_resched check. But there's a fair amount of 
select() related sleeping as well, and a fair amount of 
IRQ-hits-userspace driven preemptions as well.

a rather hectic workload, with a surprisingly large amount of TTY 
related activity.

and here's a few seconds worth of NMI driven readprofile output:

  1131 page_fault                                70.6875
  1181 find_lock_page                             8.2014
  1302 __isolate_lru_page                        12.8911
  1344 copy_page_c                               84.0000
  1588 page_lock_anon_vma                        34.5217
  1616 ext3fs_dirhash                             3.4753
  1683 ext3_htree_store_dirent                    6.8975
  1976 str2hashbuf                               12.7484
  1992 copy_user_generic_string                  31.1250
  2362 do_unlinkat                                6.5069
  2969 try_to_unmap                               2.0791
  3031 will_become_orphaned_pgrp                 24.4435
  4009 __copy_user_nocache                       12.3735
  4627 congestion_wait                           34.0221
  6624 clear_page_c                             414.0000
 18711 try_to_unmap_one                          30.8254
 34447 page_referenced                          140.0285
 38669 do_filp_open                              17.7789
166569 page_referenced_one                      886.0053
213361 *unknown*
216021 sync_page                                3375.3281
391888 page_check_address                       1414.7581
962212 total                                      0.3039

system overhead is consistently 20% during this test.

the page_check_address() overhead is surprising - tons of rmap 
contention? about 10% wall-clock overhead in that function alone - and 
this is just on a dual-core box!

Below is the instruction level profile of that function. The second 
column is the # of profile hits. The spin_lock() overhead is clearly 
visible.

	Ingo

                         ___----- # of profile hits
                        v
ffffffff80286244:     3478 <page_check_address>:
ffffffff80286244:     3478 	55                   	push   %rbp
ffffffff80286245:     1092 	48 89 d0             	mov    %rdx,%rax
ffffffff80286248:        0 	48 c1 e8 27          	shr    $0x27,%rax
ffffffff8028624c:        0 	48 89 e5             	mov    %rsp,%rbp
ffffffff8028624f:      717 	41 57                	push   %r15
ffffffff80286251:        1 	25 ff 01 00 00       	and    $0x1ff,%eax
ffffffff80286256:        0 	49 89 cf             	mov    %rcx,%r15
ffffffff80286259:     1354 	41 56                	push   %r14
ffffffff8028625b:        0 	49 89 fe             	mov    %rdi,%r14
ffffffff8028625e:        0 	48 89 d7             	mov    %rdx,%rdi
ffffffff80286261:     1507 	41 55                	push   %r13
ffffffff80286263:        0 	41 54                	push   %r12
ffffffff80286265:        0 	53                   	push   %rbx
ffffffff80286266:     1763 	48 83 ec 08          	sub    $0x8,%rsp
ffffffff8028626a:        0 	48 8b 56 48          	mov    0x48(%rsi),%rdx
ffffffff8028626e:        4 	48 8b 14 c2          	mov    (%rdx,%rax,8),%rdx
ffffffff80286272:     3174 	f6 c2 01             	test   $0x1,%dl
ffffffff80286275:        0 	0f 84 cc 00 00 00    	je     ffffffff80286347 <page_check_address+0x103>
ffffffff8028627b:        0 	48 89 f8             	mov    %rdi,%rax
ffffffff8028627e:    64468 	48 be 00 f0 ff ff ff 	mov    $0x3ffffffff000,%rsi
ffffffff80286285:        0 	3f 00 00 
ffffffff80286288:        1 	48 b9 00 00 00 00 00 	mov    $0xffff810000000000,%rcx
ffffffff8028628f:        0 	81 ff ff 
ffffffff80286292:     4686 	48 c1 e8 1b          	shr    $0x1b,%rax
ffffffff80286296:        0 	48 21 f2             	and    %rsi,%rdx
ffffffff80286299:        0 	25 f8 0f 00 00       	and    $0xff8,%eax
ffffffff8028629e:     7468 	48 01 d0             	add    %rdx,%rax
ffffffff802862a1:        0 	48 8b 14 08          	mov    (%rax,%rcx,1),%rdx
ffffffff802862a5:       11 	f6 c2 01             	test   $0x1,%dl
ffffffff802862a8:     4409 	0f 84 99 00 00 00    	je     ffffffff80286347 <page_check_address+0x103>
ffffffff802862ae:        0 	48 89 f8             	mov    %rdi,%rax
ffffffff802862b1:        0 	48 21 f2             	and    %rsi,%rdx
ffffffff802862b4:     1467 	48 c1 e8 12          	shr    $0x12,%rax
ffffffff802862b8:        0 	25 f8 0f 00 00       	and    $0xff8,%eax
ffffffff802862bd:        0 	48 01 d0             	add    %rdx,%rax
ffffffff802862c0:      944 	48 8b 14 08          	mov    (%rax,%rcx,1),%rdx
ffffffff802862c4:       17 	f6 c2 01             	test   $0x1,%dl
ffffffff802862c7:        1 	74 7e                	je     ffffffff80286347 <page_check_address+0x103>
ffffffff802862c9:      927 	48 89 d0             	mov    %rdx,%rax
ffffffff802862cc:       77 	48 c1 ef 09          	shr    $0x9,%rdi
ffffffff802862d0:        3 	48 21 f0             	and    %rsi,%rax
ffffffff802862d3:     1238 	81 e7 f8 0f 00 00    	and    $0xff8,%edi
ffffffff802862d9:        1 	48 01 c8             	add    %rcx,%rax
ffffffff802862dc:        0 	4c 8d 24 38          	lea    (%rax,%rdi,1),%r12
ffffffff802862e0:      792 	41 f6 04 24 81       	testb  $0x81,(%r12)
ffffffff802862e5:       25 	74 60                	je     ffffffff80286347 <page_check_address+0x103>
ffffffff802862e7:   118074 	48 c1 ea 0c          	shr    $0xc,%rdx
ffffffff802862eb:    41187 	48 b8 00 00 00 00 00 	mov    $0xffffe20000000000,%rax
ffffffff802862f2:        0 	e2 ff ff 
ffffffff802862f5:      182 	48 6b d2 38          	imul   $0x38,%rdx,%rdx
ffffffff802862f9:    25998 	48 8d 1c 02          	lea    (%rdx,%rax,1),%rbx
ffffffff802862fd:        0 	4c 8d 6b 10          	lea    0x10(%rbx),%r13
ffffffff80286301:        0 	4c 89 ef             	mov    %r13,%rdi
ffffffff80286304:    80598 	e8 4b 17 28 00       	callq  ffffffff80507a54 <_spin_lock>
ffffffff80286309:    36022 	49 8b 0c 24          	mov    (%r12),%rcx
ffffffff8028630d:     1623 	f6 c1 81             	test   $0x81,%cl
ffffffff80286310:        5 	74 32                	je     ffffffff80286344 <page_check_address+0x100>
ffffffff80286312:       16 	48 b8 00 00 00 00 00 	mov    $0x1e0000000000,%rax
ffffffff80286319:        0 	1e 00 00 
ffffffff8028631c:      359 	48 ba b7 6d db b6 6d 	mov    $0x6db6db6db6db6db7,%rdx
ffffffff80286323:        0 	db b6 6d 
ffffffff80286326:       12 	48 c1 e1 12          	shl    $0x12,%rcx
ffffffff8028632a:        0 	49 8d 04 06          	lea    (%r14,%rax,1),%rax
ffffffff8028632e:      492 	48 c1 e9 1e          	shr    $0x1e,%rcx
ffffffff80286332:       23 	48 c1 f8 03          	sar    $0x3,%rax
ffffffff80286336:        0 	48 0f af c2          	imul   %rdx,%rax
ffffffff8028633a:     1390 	48 39 c8             	cmp    %rcx,%rax
ffffffff8028633d:        0 	75 05                	jne    ffffffff80286344 <page_check_address+0x100>
ffffffff8028633f:        0 	4d 89 2f             	mov    %r13,(%r15)
ffffffff80286342:      165 	eb 06                	jmp    ffffffff8028634a <page_check_address+0x106>
ffffffff80286344:        0 	fe 43 10             	incb   0x10(%rbx)
ffffffff80286347:    11886 	45 31 e4             	xor    %r12d,%r12d
ffffffff8028634a:    17451 	5a                   	pop    %rdx
ffffffff8028634b:    14507 	5b                   	pop    %rbx
ffffffff8028634c:       42 	4c 89 e0             	mov    %r12,%rax
ffffffff8028634f:     1736 	41 5c                	pop    %r12
ffffffff80286351:       40 	41 5d                	pop    %r13
ffffffff80286353:     1727 	41 5e                	pop    %r14
ffffffff80286355:     1420 	41 5f                	pop    %r15
ffffffff80286357:       44 	c9                   	leaveq 
ffffffff80286358:     1685 	c3                   	retq   

gcc-4.2.3, the config is at:

  http://redhat.com/~mingo/misc/config-Thu_May__8_22_23_21_CEST_2008


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 21:45                                     ` Ingo Molnar
@ 2008-05-08 22:02                                       ` Ingo Molnar
  2008-05-08 22:55                                       ` Linus Torvalds
  1 sibling, 0 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-08 22:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin


* Ingo Molnar <mingo@elte.hu> wrote:

> yeah, i captured a trace of all the down()s that happen in the 
> workload, using ftrace/sched_switch + stacktrace + a tracepoint in 
> down(). Here's the semaphore activities in the trace:

i've updated the trace with a better one:

>    http://redhat.com/~mingo/misc/aim7-ftrace.txt

the first one had some idle time in it as well plus the effects of a 
2000-task Ctrl-Z - which skewed the histograms.

the new stats are more straightforward. down() callsite histogram:

     42  down <= lock_kernel <= de_put <= proc_delete_inode <
     42  down <= lock_kernel <= proc_lookup_de <= proc_lookup <
     78  down <= lock_kernel <= proc_lookup_de <= proc_lookup <
    310  down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs <
    332  down <= lock_kernel <= vfs_ioctl <= do_vfs_ioctl <
    380  down <= lock_kernel <= tty_release <= __fput <
    422  down <= lock_kernel <= chrdev_open <= __dentry_open <

hm, why is chrdev_open() called all that often?

top-5 wakeups:

      4  default_wake_function <= __wake_up_common <= complete <= migration_thread <= kthread <= child_rip <
      4  wake_up_process <= sched_exec <= do_execve <= sys_execve <= stub_execve <=  <
      8  wake_up_process <= __mutex_unlock_slowpath <= mutex_unlock <= do_filp_open <= do_sys_open <= sys_open <
     40  default_wake_function <= autoremove_wake_function <= __wake_up_common <= __wake_up <= sock_def_wakeup <= tcp_rcv_state_process <
     98  default_wake_function <= __wake_up_common <= __wake_up_sync <= do_notify_parent <= do_exit <= do_group_exit <

i.e. very little wakeup activities. Top 10 scheduling points:

     10  schedule <= kjournald <= kthread <= child_rip <
     12  schedule <= __down_write_nested <= __down_write <= down_write <
     29  schedule <= worker_thread <= kthread <= child_rip <
     40  schedule <= schedule_timeout <= inet_stream_connect <= sys_connect <
     59  schedule <= __cond_resched <= _cond_resched <= generic_file_buffered_write <
    111  schedule <= ksoftirqd <= kthread <= child_rip <
    119  schedule <= do_wait <= sys_wait4 <= system_call_after_swapgs <
    659  schedule <= do_exit <= do_group_exit <= sys_exit_group <
    781  schedule <= sysret_careful <=  <= 0 <
   1347  schedule <= retint_careful <=  <= 0 <

> and here's a few seconds worth of NMI driven readprofile output:

the NMI profiling results were accurate.

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 21:45                                     ` Ingo Molnar
  2008-05-08 22:02                                       ` Ingo Molnar
@ 2008-05-08 22:55                                       ` Linus Torvalds
  2008-05-08 23:07                                         ` Linus Torvalds
  2008-05-08 23:16                                         ` Alan Cox
  1 sibling, 2 replies; 140+ messages in thread
From: Linus Torvalds @ 2008-05-08 22:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin, Alan Cox



On Thu, 8 May 2008, Ingo Molnar wrote:
>
>    2512  down <= lock_kernel <= opost <= write_chan <
>    2574  down <= lock_kernel <= write_chan <= tty_write <

Ok. tty write handling. Nasty. But not as nasty as the open/close code, 
perhaps, and maybe we'll get it fixed some day.

In fact, I thought we had fixed most of this already, but hey, I was 
clearly wrong. I assume Alan looks at it occasionally and groans. Alan?

> 
> some other interesting stats. Top wakeups sources:
> 
>   [...]
>    1301  default_wake_function <= __wake_up_common <= __wake_up <= n_tty_receive_buf <= pty_write <= write_chan <
>    2065  wake_up_state <= prepare_signal <= send_signal <= __group_send_sig_info <= group_send_sig_info <= __kill_pgrp_info <

Ok, signals being the top one, but that tty code is pretty high again.

> and here's a few seconds worth of NMI driven readprofile output:
> 
> 216021 sync_page                                3375.3281
> 391888 page_check_address                       1414.7581
> 962212 total                                      0.3039
> 
> system overhead is consistently 20% during this test.
> 
> the page_check_address() overhead is surprising - tons of rmap 
> contention? about 10% wall-clock overhead in that function alone - and 
> this is just on a dual-core box!

No, it's not rmap contention. Your profile hits are just on the actual 
calculations, and it's all data-dependent arithmetic and loads. Some cache 
misses on the page tables, clearly, but it looks like a lot of it is even 
just the plain arithmetic (the imul followed by a data-dependent 'lea' 
instruction).

Some of it is that "page_to_pfn(page)", which involves a nasty division 
(divide by sizeof(struct page)). It gets turned into that shift and 
multiply, but it's still quite expensive with big constants etc.

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 22:55                                       ` Linus Torvalds
@ 2008-05-08 23:07                                         ` Linus Torvalds
  2008-05-08 23:14                                           ` Linus Torvalds
  2008-05-08 23:16                                         ` Alan Cox
  1 sibling, 1 reply; 140+ messages in thread
From: Linus Torvalds @ 2008-05-08 23:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin, Alan Cox



On Thu, 8 May 2008, Linus Torvalds wrote:
> 
> Some of it is that "page_to_pfn(page)", which involves a nasty division 
> (divide by sizeof(struct page)). It gets turned into that shift and 
> multiply, but it's still quite expensive with big constants etc.

Btw, sparse will complain about those, because the source code *looks* 
really cheap.

The normal "page_to_pfn()" looks trivial:

	((unsigned long)((page) - mem_map) + ARCH_PFN_OFFSET)

which looks like a trivial subtraction and addition of a constant, but the 
subtraction is on C pointers, and basically turns into

	((unsigned long)page - (unsigned long)mem_map) / sizeof(struct page)

and because "struct page" is not some nice power-of-two in size, that 
division is rather nasty even though it's a constant size.

		Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 23:07                                         ` Linus Torvalds
@ 2008-05-08 23:14                                           ` Linus Torvalds
  0 siblings, 0 replies; 140+ messages in thread
From: Linus Torvalds @ 2008-05-08 23:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin, Alan Cox



On Thu, 8 May 2008, Linus Torvalds wrote:
> 
> Btw, sparse will complain about those, because the source code *looks* 
> really cheap.

Sometimes you can fix it.

For example, this change:

	-       if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
	+       if (pte_present(*pte) && page == pfn_to_page(pte_pfn(*pte))) {

can simplify things: instead of moving from a 'struct page' to a pfn, it 
moves from a pfn to a 'struct page', and that is generally cheaper 
(multiply rather than divide by size of struct page). It's not always the 
same thing to do, but I think in this case we can. For me, the code 
generation changes:

	-       movabsq $7905747460161236407, %rdx      #, tmp111
	-       movabsq $32985348833280, %rax   #, tmp107
	-       leaq    (%r12,%rax), %rax       #, tmp106
	-       sarq    $3, %rax        #, tmp106
	-       imulq   %rdx, %rax      # tmp111, tmp106
	-       movabsq $70368744177663, %rdx   #, tmp113
	-       andq    %rdx, %rcx      # tmp113, pte$pte
	-       shrq    $12, %rcx       #, pte$pte
	-       cmpq    %rcx, %rax      # pte$pte, tmp106
	+       movabsq $70368744177663, %rax   #, tmp107
	+       andq    %rax, %rdx      # tmp107, pte$pte
	+       shrq    $12, %rdx       #, pte$pte
	+       imulq   $56, %rdx, %rax #, pte$pte, tmp109
	+       movabsq $-32985348833280, %rdx  #, tmp111
	+       addq    %rdx, %rax      # tmp111, tmp110
	+       cmpq    %rax, %r13      # tmp110, page

which isn't a *huge* deal, but it certainly looks better. One less big 
constant, and one less shift.

It's not going to make a huge difference, though. That function is just 
called too much, and it would still be entirely data-dependent all the way 
through.

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 22:55                                       ` Linus Torvalds
  2008-05-08 23:07                                         ` Linus Torvalds
@ 2008-05-08 23:16                                         ` Alan Cox
  2008-05-08 23:33                                           ` Linus Torvalds
  1 sibling, 1 reply; 140+ messages in thread
From: Alan Cox @ 2008-05-08 23:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

> In fact, I thought we had fixed most of this already, but hey, I was 
> clearly wrong. I assume Alan looks at it occasionally and groans. Alan?

I have pushed it down to n_tty line discipline code but not within that.
It is on the hit list but I'm working on more pressing stuff first (USB
layer, extracting commonality to start to tackle open etc etc)

I don't think fixing n_tty is now a big job if someone wants to take a
swing at it. The driver write/throttle/etc routines below the n_tty ldisc
layer are now BKL clean so it should just be the internal locking of the
buffers, window and the like to tackle.

Feel free to have a go 8)

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 23:33                                           ` Linus Torvalds
@ 2008-05-08 23:27                                             ` Alan Cox
  2008-05-09  6:50                                             ` Ingo Molnar
  2008-05-09  8:29                                             ` Andi Kleen
  2 siblings, 0 replies; 140+ messages in thread
From: Alan Cox @ 2008-05-08 23:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

>     380  down <= lock_kernel <= tty_release <= __fput <
>     422  down <= lock_kernel <= chrdev_open <= __dentry_open <
> 
> rather than the write routines. But it may be that Ingo was just profiling 
> two different sections, and it's really all of them.

tty release is probably a few months away from getting cured  - I'm
afraid it will almost certainly be the very last user of the BKL in tty
to get fixed as it depends on everything else being sanely locked.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 23:16                                         ` Alan Cox
@ 2008-05-08 23:33                                           ` Linus Torvalds
  2008-05-08 23:27                                             ` Alan Cox
                                                               ` (2 more replies)
  0 siblings, 3 replies; 140+ messages in thread
From: Linus Torvalds @ 2008-05-08 23:33 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ingo Molnar, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin



On Fri, 9 May 2008, Alan Cox wrote:
> 
> I don't think fixing n_tty is now a big job if someone wants to take a
> swing at it. The driver write/throttle/etc routines below the n_tty ldisc
> layer are now BKL clean so it should just be the internal locking of the
> buffers, window and the like to tackle.

Well, it turns out that Ingo's fixed statistics actually put the real cost 
in fcntl/ioctl/open/release:

    310  down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs <
    332  down <= lock_kernel <= vfs_ioctl <= do_vfs_ioctl <
    380  down <= lock_kernel <= tty_release <= __fput <
    422  down <= lock_kernel <= chrdev_open <= __dentry_open <

rather than the write routines. But it may be that Ingo was just profiling 
two different sections, and it's really all of them.

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: Oi.  NFS people.  Read this.
  2008-05-07 22:10                                     ` Trond Myklebust
@ 2008-05-09  1:43                                       ` J. Bruce Fields
  0 siblings, 0 replies; 140+ messages in thread
From: J. Bruce Fields @ 2008-05-09  1:43 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Matthew Wilcox, Linus Torvalds, Ingo Molnar, Andrew Morton,
	Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel

On Wed, May 07, 2008 at 03:10:27PM -0700, Trond Myklebust wrote:
> On Wed, 2008-05-07 at 14:00 -0600, Matthew Wilcox wrote:
> > On Wed, May 07, 2008 at 12:44:48PM -0700, Linus Torvalds wrote:
> > > On Wed, 7 May 2008, Matthew Wilcox wrote:
> > > > 
> > > > One patch I'd still like Yanmin to test is my one from yesterday which
> > > > removes the BKL from fs/locks.c.
> > > 
> > > And I'd personally rather have the network-fs people test and comment on 
> > > that one ;)
> > > 
> > > I think that patch is worth looking at regardless, but the problems with 
> > > that one aren't about performance, but about what the implications are for 
> > > the filesystems (if any)...
> > 
> > Oh, well, they don't seem interested.
> 
> Poor timing: we're all preparing for and travelling to the annual
> Connectathon interoperability testing conference which starts tomorrow.
> 
> > I can comment on some of the problems though.
> > 
> > fs/lockd/svcsubs.c, fs/nfs/delegation.c, fs/nfs/nfs4state.c,
> > fs/nfsd/nfs4state.c all walk the i_flock list under the BKL.  That won't
> > protect them against locks.c any more.  That's probably OK for fs/nfs/*
> > since they'll be protected by their own data structures (Someone please
> > check me on that?), but it's a bad idea for lockd/nfsd which are walking
> > the lists for filesystems.
> 
> Yes. fs/nfs is just reusing the code in fs/locks.c in order to track the
> locks it holds on the server. We could alternatively have coded a
> private lock implementation, but this seemed easier.

So, assuming nfs is taking care of its own locking (I don't know if
that's right), that leaves nlm_traverse_locks() and nlm_file_inuse()
(both in fs/lockd/svcsubs.c) as the problem spots.

> > Are we going to have to export the file_lock_lock?  I'd rather not.  But
> > we need to keep nfsd/lockd from tripping over locks.c.
> > 
> > Maybe we could come up with a decent API that lockd could use?  It all
> > seems a bit complex at the moment ... maybe lockd should be keeping
> > track of the locks it owns anyway (since surely the posix deadlock
> > detection code can't work properly if it's just passing all the locks
> > through).
> 
> I'm not sure what you mean when you talk about lockd keeping track of
> the locks it owns. It has to keep those locks on inode->i_flock in order
> to make them visible to the host filesystem...
> 
> All lockd really needs, is the ability to find a lock it owns, and then
> obtain a copy.

That sounds right.

--b.

> As for the nfs client, I suspect we can make do with
> something similar...



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 23:33                                           ` Linus Torvalds
  2008-05-08 23:27                                             ` Alan Cox
@ 2008-05-09  6:50                                             ` Ingo Molnar
  2008-05-09  8:29                                             ` Andi Kleen
  2 siblings, 0 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-09  6:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > I don't think fixing n_tty is now a big job if someone wants to take 
> > a swing at it. The driver write/throttle/etc routines below the 
> > n_tty ldisc layer are now BKL clean so it should just be the 
> > internal locking of the buffers, window and the like to tackle.
> 
> Well, it turns out that Ingo's fixed statistics actually put the real 
> cost in fcntl/ioctl/open/release:
> 
>     310  down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs <
>     332  down <= lock_kernel <= vfs_ioctl <= do_vfs_ioctl <
>     380  down <= lock_kernel <= tty_release <= __fput <
>     422  down <= lock_kernel <= chrdev_open <= __dentry_open <
> 
> rather than the write routines. But it may be that Ingo was just 
> profiling two different sections, and it's really all of them.

the first trace had general desktop load mixed into it as well - so 
while it's not interesting to AIM7 the BKL does matter in those 
situations and i'd not be surprised if it was responsible for certain 
categories of desktop lag.

The second trace was the correct 'pure' AIM7 workload which produces 
very little tty output. It is a quite stable workload and the trace i 
uploaded is representative of the totality of that workload. AIM7 runs 
for several minutes so there's no significant rampup/rampdown 
interaction either.

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
  2008-05-08 23:33                                           ` Linus Torvalds
  2008-05-08 23:27                                             ` Alan Cox
  2008-05-09  6:50                                             ` Ingo Molnar
@ 2008-05-09  8:29                                             ` Andi Kleen
  2 siblings, 0 replies; 140+ messages in thread
From: Andi Kleen @ 2008-05-09  8:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Ingo Molnar, Zhang, Yanmin, Matthew Wilcox, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> Well, it turns out that Ingo's fixed statistics actually put the real cost 
> in fcntl/ioctl/open/release:
>
>     310  down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs <

That must be ->fasync? If it was file locks the lock_kernel would not
be inlined into sys_fcntl. Or is that an truncated back trace?

-Andi (wondering if he should plug ->fasync_unlocked again ...)

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-08 15:10                               ` [git pull] scheduler fixes Ingo Molnar
  2008-05-08 15:33                                 ` Adrian Bunk
@ 2008-05-11 11:03                                 ` Matthew Wilcox
  2008-05-11 11:14                                   ` Matthew Wilcox
                                                     ` (3 more replies)
  1 sibling, 4 replies; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-11 11:03 UTC (permalink / raw)
  To: Ingo Molnar, Sven Wegener
  Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Thu, May 08, 2008 at 05:10:28PM +0200, Ingo Molnar wrote:
> @@ -258,7 +256,5 @@ static noinline void __sched __up(struct semaphore *sem)
>  {
>  	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
>  						struct semaphore_waiter, list);
> -	list_del(&waiter->list);
> -	waiter->up = 1;
>  	wake_up_process(waiter->task);
>  }

This might be the problem that causes the missing wakeups.  If you have a
semaphore with n=2, and four processes calling down(), tasks A and B
acquire the semaphore and tasks C and D go to sleep.  Task A calls up()
and wakes up C.  Then task B calls up() and doesn't wake up anyone
because C hasn't run yet.  I think we need another wakeup when task C
finishes in __down_common, like this (on top of your patch):

diff --git a/kernel/semaphore.c b/kernel/semaphore.c
index 5e41217..e520ad4 100644
--- a/kernel/semaphore.c
+++ b/kernel/semaphore.c
@@ -229,6 +229,11 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
 	}
 
 	list_del(&waiter.list);
+
+	/* It's possible we need to wake up the next task on the list too */
+	if (unlikely(sem->count > 1) && !list_empty(&sem->wait_list))
+		__up(sem);
+
 	return ret;
 }
 
Sven, can you try this with your workload?  I suspect this might be it
because XFS does use semaphores with n>1.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: AIM7 40% regression with 2.6.26-rc1
  2008-05-06 18:07             ` Andrew Morton
@ 2008-05-11 11:11               ` Matthew Wilcox
  0 siblings, 0 replies; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-11 11:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, bfields, yanmin_zhang, linux-kernel, viro, torvalds,
	linux-fsdevel

On Tue, May 06, 2008 at 11:07:52AM -0700, Andrew Morton wrote:
> Yeah, the early bootup code.  The kernel does accidental lock_kernel()s in
> various places and if that renables interrupts then powerpc goeth crunch.
> 
> Matthew, that seemingly-unneeded irqsave in lib/semaphore.c is a prime site
> for /* one of these things */, no?

I was just reviewing the code and I came across one of these:

/*
 * Some notes on the implementation:
 *
 * The spinlock controls access to the other members of the semaphore.
 * down_trylock() and up() can be called from interrupt context, so we
 * have to disable interrupts when taking the lock.  It turns out various
 * parts of the kernel expect to be able to use down() on a semaphore in
 * interrupt context when they know it will succeed, so we have to use
 * irqsave variants for down(), down_interruptible() and down_killable()
 * too.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 11:03                                 ` Matthew Wilcox
@ 2008-05-11 11:14                                   ` Matthew Wilcox
  2008-05-11 11:48                                   ` Matthew Wilcox
                                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-11 11:14 UTC (permalink / raw)
  To: Ingo Molnar, Sven Wegener
  Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Sun, May 11, 2008 at 05:03:06AM -0600, Matthew Wilcox wrote:
> This might be the problem that causes the missing wakeups.  If you have a
> semaphore with n=2, and four processes calling down(), tasks A and B
> acquire the semaphore and tasks C and D go to sleep.  Task A calls up()
> and wakes up C.  Then task B calls up() and doesn't wake up anyone
> because C hasn't run yet.  I think we need another wakeup when task C

Er, I mis-wrote there.

Task A calls up() and wakes up C.  Then task B calls up() and wakes up C
again because C hasn't removed itself from the list yet.  D never
receives a wakeup.  The solution is for C to pass a wakeup along to the
next in line.  (The solution remains the same).

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 11:03                                 ` Matthew Wilcox
  2008-05-11 11:14                                   ` Matthew Wilcox
@ 2008-05-11 11:48                                   ` Matthew Wilcox
  2008-05-11 12:50                                     ` Ingo Molnar
  2008-05-11 13:01                                   ` Ingo Molnar
  2008-05-11 14:10                                   ` Sven Wegener
  3 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-11 11:48 UTC (permalink / raw)
  To: Ingo Molnar, Sven Wegener
  Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Sun, May 11, 2008 at 05:03:06AM -0600, Matthew Wilcox wrote:
> This might be the problem that causes the missing wakeups.  If you have a
> semaphore with n=2, and four processes calling down(), tasks A and B
> acquire the semaphore and tasks C and D go to sleep.  Task A calls up()
[...]
> Sven, can you try this with your workload?  I suspect this might be it
> because XFS does use semaphores with n>1.

This is exactly it.  Or rather, it's even simpler.  Three tasks
involved; A and B call xlog_state_get_iclog_space() and end up calling
psema(&log->l_flushsema) [down() by any other name -- the semaphore is
initialised to 0; it's really a completion].  Then task C calls
xlog_state_do_callback() which does:

        while (flushcnt--)
                vsema(&log->l_flushsema);

[vsema is AKA up()]

It assumes this wakes up both A and B ... but it won't with Ingo's code;
it'll wake up A twice.

So I deem my fix "proven by thought experiment".  I haven't tried
booting it or anything.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 11:48                                   ` Matthew Wilcox
@ 2008-05-11 12:50                                     ` Ingo Molnar
  2008-05-11 12:52                                       ` Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-11 12:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin


* Matthew Wilcox <matthew@wil.cx> wrote:

> So I deem my fix "proven by thought experiment".  I haven't tried 
> booting it or anything.

i actually have two fixes, made earlier today. The 'fix3' one has been 
confirmed by Sven to fix the regression - but i think we need the 'fix 
#2' one below as well to make it complete.

	Ingo

-------------------->
Subject: semaphore: fix3
From: Ingo Molnar <mingo@elte.hu>
Date: Sun May 11 09:51:07 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/semaphore.c |    7 +++++++
 1 file changed, 7 insertions(+)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -258,5 +258,12 @@ static noinline void __sched __up(struct
 {
 	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
 						struct semaphore_waiter, list);
+
+	/*
+	 * Rotate sleepers - to make sure all of them get woken in case
+	 * of parallel up()s:
+	 */
+	list_move_tail(&waiter->list, &sem->wait_list);
+
 	wake_up_process(waiter->task);
 }

---------------->
Subject: semaphore: fix #2
From: Ingo Molnar <mingo@elte.hu>
Date: Thu May 08 11:53:48 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/semaphore.c |   16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -194,6 +194,13 @@ struct semaphore_waiter {
 	struct task_struct *task;
 };
 
+static noinline void __sched __up(struct semaphore *sem)
+{
+	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
+						struct semaphore_waiter, list);
+	wake_up_process(waiter->task);
+}
+
 /*
  * Because this function is inlined, the 'state' parameter will be
  * constant, and thus optimised away by the compiler.  Likewise the
@@ -231,6 +238,9 @@ static inline int __sched __down_common(
 	}
 
 	list_del(&waiter.list);
+	if (unlikely(!list_empty(&sem->wait_list)) && sem->count)
+		__up(sem);
+
 	return ret;
 }
 
@@ -254,9 +264,3 @@ static noinline int __sched __down_timeo
 	return __down_common(sem, TASK_UNINTERRUPTIBLE, jiffies);
 }
 
-static noinline void __sched __up(struct semaphore *sem)
-{
-	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
-						struct semaphore_waiter, list);
-	wake_up_process(waiter->task);
-}

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 12:50                                     ` Ingo Molnar
@ 2008-05-11 12:52                                       ` Ingo Molnar
  2008-05-11 13:02                                         ` Matthew Wilcox
  0 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-11 12:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin


* Ingo Molnar <mingo@elte.hu> wrote:

> > So I deem my fix "proven by thought experiment".  I haven't tried 
> > booting it or anything.
> 
> i actually have two fixes, made earlier today. The 'fix3' one has been 
> confirmed by Sven to fix the regression - but i think we need the 'fix
> #2' one below as well to make it complete.

i just combined them into a single fix, see below.

	Ingo

--------------------------->
Subject: semaphore: fix #3
From: Ingo Molnar <mingo@elte.hu>
Date: Sun May 11 09:51:07 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/semaphore.c |   23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -194,6 +194,13 @@ struct semaphore_waiter {
 	struct task_struct *task;
 };
 
+static noinline void __sched __up(struct semaphore *sem)
+{
+	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
+						struct semaphore_waiter, list);
+	wake_up_process(waiter->task);
+}
+
 /*
  * Because this function is inlined, the 'state' parameter will be
  * constant, and thus optimised away by the compiler.  Likewise the
@@ -231,6 +238,9 @@ static inline int __sched __down_common(
 	}
 
 	list_del(&waiter.list);
+	if (unlikely(!list_empty(&sem->wait_list)) && sem->count)
+		__up(sem);
+
 	return ret;
 }
 
@@ -254,9 +264,10 @@ static noinline int __sched __down_timeo
 	return __down_common(sem, TASK_UNINTERRUPTIBLE, jiffies);
 }
 
-static noinline void __sched __up(struct semaphore *sem)
-{
-	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
-						struct semaphore_waiter, list);
-	wake_up_process(waiter->task);
-}
+
+	/*
+	 * Rotate sleepers - to make sure all of them get woken in case
+	 * of parallel up()s:
+	 */
+	list_move_tail(&waiter->list, &sem->wait_list);
+

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 11:03                                 ` Matthew Wilcox
  2008-05-11 11:14                                   ` Matthew Wilcox
  2008-05-11 11:48                                   ` Matthew Wilcox
@ 2008-05-11 13:01                                   ` Ingo Molnar
  2008-05-11 13:06                                     ` Matthew Wilcox
  2008-05-11 14:10                                   ` Sven Wegener
  3 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-11 13:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin


* Matthew Wilcox <matthew@wil.cx> wrote:

> +	/* It's possible we need to wake up the next task on the list too */
> +	if (unlikely(sem->count > 1) && !list_empty(&sem->wait_list))
> +		__up(sem);

this needs to check for ret != 0 as well, otherwise we can be woken but 
a timeout can also trigger => we lose a wakeup. I.e. like the patch 
below. Hm?

	Ingo

----------------------------->
Subject: semaphore: fix #3
From: Ingo Molnar <mingo@elte.hu>
Date: Sun May 11 09:51:07 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/semaphore.c |   24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -194,6 +194,13 @@ struct semaphore_waiter {
 	struct task_struct *task;
 };
 
+static noinline void __sched __up(struct semaphore *sem)
+{
+	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
+						struct semaphore_waiter, list);
+	wake_up_process(waiter->task);
+}
+
 /*
  * Because this function is inlined, the 'state' parameter will be
  * constant, and thus optimised away by the compiler.  Likewise the
@@ -231,6 +238,10 @@ static inline int __sched __down_common(
 	}
 
 	list_del(&waiter.list);
+	if (unlikely(!list_empty(&sem->wait_list)) &&
+					((sem->count > 1) || ret))
+		__up(sem);
+
 	return ret;
 }
 
@@ -254,9 +265,10 @@ static noinline int __sched __down_timeo
 	return __down_common(sem, TASK_UNINTERRUPTIBLE, jiffies);
 }
 
-static noinline void __sched __up(struct semaphore *sem)
-{
-	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
-						struct semaphore_waiter, list);
-	wake_up_process(waiter->task);
-}
+
+	/*
+	 * Rotate sleepers - to make sure all of them get woken in case
+	 * of parallel up()s:
+	 */
+	list_move_tail(&waiter->list, &sem->wait_list);
+

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 12:52                                       ` Ingo Molnar
@ 2008-05-11 13:02                                         ` Matthew Wilcox
  2008-05-11 13:26                                           ` Matthew Wilcox
  2008-05-11 13:54                                           ` Ingo Molnar
  0 siblings, 2 replies; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-11 13:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Sun, May 11, 2008 at 02:52:16PM +0200, Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > So I deem my fix "proven by thought experiment".  I haven't tried 
> > > booting it or anything.
> > 
> > i actually have two fixes, made earlier today. The 'fix3' one has been 
> > confirmed by Sven to fix the regression - but i think we need the 'fix
> > #2' one below as well to make it complete.
> 
> i just combined them into a single fix, see below.

That's mangled ... why did you move __up around?

>  	list_del(&waiter.list);
> +	if (unlikely(!list_empty(&sem->wait_list)) && sem->count)
> +		__up(sem);

That's an unnecessary wakeup compared to my patch.

>  	return ret;
>  }
>  
> @@ -254,9 +264,10 @@ static noinline int __sched __down_timeo
>  	return __down_common(sem, TASK_UNINTERRUPTIBLE, jiffies);
>  }
>  
> -static noinline void __sched __up(struct semaphore *sem)
> -{
> -	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
> -						struct semaphore_waiter, list);
> -	wake_up_process(waiter->task);
> -}
> +
> +	/*
> +	 * Rotate sleepers - to make sure all of them get woken in case
> +	 * of parallel up()s:
> +	 */
> +	list_move_tail(&waiter->list, &sem->wait_list);

Seems like extra cache line dirtying for no real gain over my solution.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 13:01                                   ` Ingo Molnar
@ 2008-05-11 13:06                                     ` Matthew Wilcox
  2008-05-11 13:45                                       ` Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-11 13:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Sun, May 11, 2008 at 03:01:27PM +0200, Ingo Molnar wrote:
> * Matthew Wilcox <matthew@wil.cx> wrote:
> 
> > +	/* It's possible we need to wake up the next task on the list too */
> > +	if (unlikely(sem->count > 1) && !list_empty(&sem->wait_list))
> > +		__up(sem);
> 
> this needs to check for ret != 0 as well, otherwise we can be woken but 
> a timeout can also trigger => we lose a wakeup. I.e. like the patch 
> below. Hm?

Still mangled ... and I don't see how we lose a wakeup.  We test for
having the semaphore before we check for having been interrupted, and we
hold the lock the whole time.

IOW, what I think you're checking for is:

task A			task B
			if sem->count >0
				break;
sem->count++
wake_up_process(B)
			if (state == TASK_INTERRUPTIBLE && signal_pending(task))
				break;

which can't happen because of sem->lock.

> 	Ingo
> 
> ----------------------------->
> Subject: semaphore: fix #3
> From: Ingo Molnar <mingo@elte.hu>
> Date: Sun May 11 09:51:07 CEST 2008
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  kernel/semaphore.c |   24 ++++++++++++++++++------
>  1 file changed, 18 insertions(+), 6 deletions(-)
> 
> Index: linux/kernel/semaphore.c
> ===================================================================
> --- linux.orig/kernel/semaphore.c
> +++ linux/kernel/semaphore.c
> @@ -194,6 +194,13 @@ struct semaphore_waiter {
>  	struct task_struct *task;
>  };
>  
> +static noinline void __sched __up(struct semaphore *sem)
> +{
> +	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
> +						struct semaphore_waiter, list);
> +	wake_up_process(waiter->task);
> +}
> +
>  /*
>   * Because this function is inlined, the 'state' parameter will be
>   * constant, and thus optimised away by the compiler.  Likewise the
> @@ -231,6 +238,10 @@ static inline int __sched __down_common(
>  	}
>  
>  	list_del(&waiter.list);
> +	if (unlikely(!list_empty(&sem->wait_list)) &&
> +					((sem->count > 1) || ret))
> +		__up(sem);
> +
>  	return ret;
>  }
>  
> @@ -254,9 +265,10 @@ static noinline int __sched __down_timeo
>  	return __down_common(sem, TASK_UNINTERRUPTIBLE, jiffies);
>  }
>  
> -static noinline void __sched __up(struct semaphore *sem)
> -{
> -	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
> -						struct semaphore_waiter, list);
> -	wake_up_process(waiter->task);
> -}
> +
> +	/*
> +	 * Rotate sleepers - to make sure all of them get woken in case
> +	 * of parallel up()s:
> +	 */
> +	list_move_tail(&waiter->list, &sem->wait_list);
> +

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 13:02                                         ` Matthew Wilcox
@ 2008-05-11 13:26                                           ` Matthew Wilcox
  2008-05-11 14:00                                             ` Ingo Molnar
  2008-05-11 13:54                                           ` Ingo Molnar
  1 sibling, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-11 13:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Sun, May 11, 2008 at 07:02:26AM -0600, Matthew Wilcox wrote:
> > +	list_move_tail(&waiter->list, &sem->wait_list);
> 
> Seems like extra cache line dirtying for no real gain over my solution.

Actually, let me just go into this a little further.

In principle, you'd think that we'd want to wake up all the tasks
possible as soon as possible.  In practice, Dave Chinner has said that
the l_flushsema introduces a thundering herd (a few hundred tasks can
build up behind it on systems such as Columbia apparently) that then
run into a bottleneck as soon as they're unleashed.

Current XFS CVS has a fix from myself and Christoph that gets rid of the
l_flushsema and replaces it with a staggered wakeup of each task that's
waiting as the previously woken task clears the critical section.

Obviously, generic up() can't possibly do as well, but by staggering
the release of tasks from __down_common(), we mitigate the herd somewhat.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 13:06                                     ` Matthew Wilcox
@ 2008-05-11 13:45                                       ` Ingo Molnar
  0 siblings, 0 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-11 13:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin


* Matthew Wilcox <matthew@wil.cx> wrote:

> IOW, what I think you're checking for is:
> 
> task A			task B
> 			if sem->count >0
> 				break;
> sem->count++
> wake_up_process(B)
> 			if (state == TASK_INTERRUPTIBLE && signal_pending(task))
> 				break;
> 
> which can't happen because of sem->lock.

ok, agreed, that race cannot happen.

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 13:02                                         ` Matthew Wilcox
  2008-05-11 13:26                                           ` Matthew Wilcox
@ 2008-05-11 13:54                                           ` Ingo Molnar
  2008-05-11 14:22                                             ` Matthew Wilcox
  1 sibling, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-11 13:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

* Matthew Wilcox <matthew@wil.cx> wrote:

> > +	/*
> > +	 * Rotate sleepers - to make sure all of them get woken in case
> > +	 * of parallel up()s:
> > +	 */
> > +	list_move_tail(&waiter->list, &sem->wait_list);
> 
> Seems like extra cache line dirtying for no real gain over my 
> solution.

the gain is rather obvious: two parallel up()s (or just up()s which come 
close enough after each other) will wake up two tasks in parallel. With 
your patch, the first guy wakes up and then it wakes up the second guy. 
I.e. your patch serializes the wakeup chain, mine keeps it parallel.

the cache line dirtying is rather secondary to any solution - the first 
goal for any locking primitive is to get scheduling precise: to not wake 
up more tasks than optimal and to not wake up less tasks than optimal.

i.e. can you see any conceptual hole in the patch below?

	Ingo

---
 kernel/semaphore.c |    6 ++++++
 1 file changed, 6 insertions(+)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -258,5 +258,11 @@ static noinline void __sched __up(struct
 {
 	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
 						struct semaphore_waiter, list);
+	/*
+	 * Rotate sleepers - to make sure all of them get woken in case
+	 * of parallel up()s:
+	 */
+	list_move_tail(&waiter->list, &sem->wait_list);
+
 	wake_up_process(waiter->task);
 }

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 13:26                                           ` Matthew Wilcox
@ 2008-05-11 14:00                                             ` Ingo Molnar
  2008-05-11 14:18                                               ` Matthew Wilcox
  0 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-11 14:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin


* Matthew Wilcox <matthew@wil.cx> wrote:

> Current XFS CVS has a fix from myself and Christoph that gets rid of 
> the l_flushsema and replaces it with a staggered wakeup of each task 
> that's waiting as the previously woken task clears the critical 
> section.

the solution is to reduce semaphore usage by converting them to mutexes. 
Is anyone working on removing legacy semaphore use from XFS?

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 11:03                                 ` Matthew Wilcox
                                                     ` (2 preceding siblings ...)
  2008-05-11 13:01                                   ` Ingo Molnar
@ 2008-05-11 14:10                                   ` Sven Wegener
  3 siblings, 0 replies; 140+ messages in thread
From: Sven Wegener @ 2008-05-11 14:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ingo Molnar, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Sun, 11 May 2008, Matthew Wilcox wrote:

> On Thu, May 08, 2008 at 05:10:28PM +0200, Ingo Molnar wrote:
>> @@ -258,7 +256,5 @@ static noinline void __sched __up(struct semaphore *sem)
>>  {
>>  	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
>>  						struct semaphore_waiter, list);
>> -	list_del(&waiter->list);
>> -	waiter->up = 1;
>>  	wake_up_process(waiter->task);
>>  }
>
> This might be the problem that causes the missing wakeups.  If you have a
> semaphore with n=2, and four processes calling down(), tasks A and B
> acquire the semaphore and tasks C and D go to sleep.  Task A calls up()
> and wakes up C.  Then task B calls up() and doesn't wake up anyone
> because C hasn't run yet.  I think we need another wakeup when task C
> finishes in __down_common, like this (on top of your patch):
>
> diff --git a/kernel/semaphore.c b/kernel/semaphore.c
> index 5e41217..e520ad4 100644
> --- a/kernel/semaphore.c
> +++ b/kernel/semaphore.c
> @@ -229,6 +229,11 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
> 	}
>
> 	list_del(&waiter.list);
> +
> +	/* It's possible we need to wake up the next task on the list too */
> +	if (unlikely(sem->count > 1) && !list_empty(&sem->wait_list))
> +		__up(sem);
> +
> 	return ret;
> }
>
> Sven, can you try this with your workload?  I suspect this might be it
> because XFS does use semaphores with n>1.

This one fixes the regression too, after applying it on top of bf726e.

Sven

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 14:00                                             ` Ingo Molnar
@ 2008-05-11 14:18                                               ` Matthew Wilcox
  2008-05-11 14:42                                                 ` Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-11 14:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Sun, May 11, 2008 at 04:00:17PM +0200, Ingo Molnar wrote:
> * Matthew Wilcox <matthew@wil.cx> wrote:
> 
> > Current XFS CVS has a fix from myself and Christoph that gets rid of 
> > the l_flushsema and replaces it with a staggered wakeup of each task 
> > that's waiting as the previously woken task clears the critical 
> > section.
> 
> the solution is to reduce semaphore usage by converting them to mutexes. 
> Is anyone working on removing legacy semaphore use from XFS?

This race is completely irrelevant to converting semaphores to mutexes.
It can only occur for semaphores which /can't/ be converted to mutexes.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 13:54                                           ` Ingo Molnar
@ 2008-05-11 14:22                                             ` Matthew Wilcox
  2008-05-11 14:32                                               ` Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-11 14:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Sun, May 11, 2008 at 03:54:14PM +0200, Ingo Molnar wrote:
> the gain is rather obvious: two parallel up()s (or just up()s which come 
> close enough after each other) will wake up two tasks in parallel. With 
> your patch, the first guy wakes up and then it wakes up the second guy. 
> I.e. your patch serializes the wakeup chain, mine keeps it parallel.

Yup.  I explained why that's actually beneficial in an earlier email.

> the cache line dirtying is rather secondary to any solution - the first 
> goal for any locking primitive is to get scheduling precise: to not wake 
> up more tasks than optimal and to not wake up less tasks than optimal.

That's a laudable goal, but ultimately it's secondary to performance (or
this thread wouldn't exist).

> i.e. can you see any conceptual hole in the patch below?

No conceptual holes, just a performance one.

Either we want your patch below or mine; definitely not both.

diff --git a/kernel/semaphore.c b/kernel/semaphore.c
index 5e41217..e520ad4 100644
--- a/kernel/semaphore.c
+++ b/kernel/semaphore.c
@@ -229,6 +229,11 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
 	}
 
 	list_del(&waiter.list);
+
+	/* It's possible we need to wake up the next task on the list too */
+	if (unlikely(sem->count > 1) && !list_empty(&sem->wait_list))
+		__up(sem);
+
 	return ret;
 }
 

> 	Ingo
> 
> ---
>  kernel/semaphore.c |    6 ++++++
>  1 file changed, 6 insertions(+)
> 
> Index: linux/kernel/semaphore.c
> ===================================================================
> --- linux.orig/kernel/semaphore.c
> +++ linux/kernel/semaphore.c
> @@ -258,5 +258,11 @@ static noinline void __sched __up(struct
>  {
>  	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
>  						struct semaphore_waiter, list);
> +	/*
> +	 * Rotate sleepers - to make sure all of them get woken in case
> +	 * of parallel up()s:
> +	 */
> +	list_move_tail(&waiter->list, &sem->wait_list);
> +
>  	wake_up_process(waiter->task);
>  }

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 14:22                                             ` Matthew Wilcox
@ 2008-05-11 14:32                                               ` Ingo Molnar
  2008-05-11 14:46                                                 ` Matthew Wilcox
  2008-05-11 16:47                                                 ` Linus Torvalds
  0 siblings, 2 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-11 14:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin


* Matthew Wilcox <matthew@wil.cx> wrote:

> > the gain is rather obvious: two parallel up()s (or just up()s which 
> > come close enough after each other) will wake up two tasks in 
> > parallel. With your patch, the first guy wakes up and then it wakes 
> > up the second guy. I.e. your patch serializes the wakeup chain, mine 
> > keeps it parallel.
> 
> Yup.  I explained why that's actually beneficial in an earlier email.

but the problem is that by serializing the wakeup chains naively you 
introduced a more than 50% AIM7 performance regression.

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 14:18                                               ` Matthew Wilcox
@ 2008-05-11 14:42                                                 ` Ingo Molnar
  2008-05-11 14:48                                                   ` Matthew Wilcox
  0 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-11 14:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra


* Matthew Wilcox <matthew@wil.cx> wrote:

> > > Current XFS CVS has a fix from myself and Christoph that gets rid 
> > > of the l_flushsema and replaces it with a staggered wakeup of each 
> > > task that's waiting as the previously woken task clears the 
> > > critical section.
> > 
> > the solution is to reduce semaphore usage by converting them to 
> > mutexes. Is anyone working on removing legacy semaphore use from 
> > XFS?
> 
> This race is completely irrelevant to converting semaphores to 
> mutexes. [...]

i was not talking about the race. I was just reacting on your comments 
about thundering herds and staggered wakeups - which is a performance 
detail. Semaphores should not regress AIM7 by 50% but otherwise they are 
legacy code and their use should be reduced monotonically, so i was 
asking why anyone still cares about tuning semaphore details in XFS 
instead of just working on removing semaphore use from them.

> [...] It can only occur for semaphores which /can't/ be converted to 
> mutexes.

exactly what usecase is that? Perhaps it could be converted to an atomic 
counter + the wait_event() APIs.

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 14:32                                               ` Ingo Molnar
@ 2008-05-11 14:46                                                 ` Matthew Wilcox
  2008-05-11 16:47                                                 ` Linus Torvalds
  1 sibling, 0 replies; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-11 14:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Sun, May 11, 2008 at 04:32:27PM +0200, Ingo Molnar wrote:
> 
> * Matthew Wilcox <matthew@wil.cx> wrote:
> 
> > > the gain is rather obvious: two parallel up()s (or just up()s which 
> > > come close enough after each other) will wake up two tasks in 
> > > parallel. With your patch, the first guy wakes up and then it wakes 
> > > up the second guy. I.e. your patch serializes the wakeup chain, mine 
> > > keeps it parallel.
> > 
> > Yup.  I explained why that's actually beneficial in an earlier email.
> 
> but the problem is that by serializing the wakeup chains naively you 
> introduced a more than 50% AIM7 performance regression.

That's a different issue.  The AIM7 regression is to do with whether a
task that is currently running and hits a semaphore that has no current
holder but someone else waiting should be allowed to jump the queue.
No argument there; performance trumps theoretical fairness.

This issue is whether multiple sleepers should be woken up all-at-once
or one-at-a-time.  Here, you seem to be arguing for theoretical fairness
to trump performance.

(Let's be quite clear; this issue affects *only* multiple
sleepers and multiple wakes given to those sleepers.  ie
semaphores-being-used-as-completions and true counting semaphores).

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 14:42                                                 ` Ingo Molnar
@ 2008-05-11 14:48                                                   ` Matthew Wilcox
  2008-05-11 15:19                                                     ` Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-11 14:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra

On Sun, May 11, 2008 at 04:42:03PM +0200, Ingo Molnar wrote:
> 
> * Matthew Wilcox <matthew@wil.cx> wrote:
> 
> > > > Current XFS CVS has a fix from myself and Christoph that gets rid 
> > > > of the l_flushsema and replaces it with a staggered wakeup of each 
> > > > task that's waiting as the previously woken task clears the 
> > > > critical section.
>
> i was not talking about the race. I was just reacting on your comments 
> about thundering herds and staggered wakeups - which is a performance 
> detail. Semaphores should not regress AIM7 by 50% but otherwise they are 
> legacy code and their use should be reduced monotonically, so i was 
> asking why anyone still cares about tuning semaphore details in XFS 
> instead of just working on removing semaphore use from them.

That's what we did.  l_flushsema is gone.  It actually got replaced with
a condition variable, but it's equivalent to a wait_event()

> exactly what usecase is that? Perhaps it could be converted to an atomic 
> counter + the wait_event() APIs.

Effectively, it's a completion.  It just works better with staggered
wakeups than it does with the naive completion.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 14:48                                                   ` Matthew Wilcox
@ 2008-05-11 15:19                                                     ` Ingo Molnar
  2008-05-11 15:29                                                       ` Matthew Wilcox
  0 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-11 15:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra


* Matthew Wilcox <matthew@wil.cx> wrote:

> > exactly what usecase is that? Perhaps it could be converted to an 
> > atomic counter + the wait_event() APIs.
> 
> Effectively, it's a completion.  It just works better with staggered 
> wakeups than it does with the naive completion.

So why not transform it to real completions instead? And if our current 
'struct completion' abstraction is insufficient for whatever reason, why 
not extend that instead?

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 15:19                                                     ` Ingo Molnar
@ 2008-05-11 15:29                                                       ` Matthew Wilcox
  2008-05-13 14:11                                                         ` Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-11 15:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra

On Sun, May 11, 2008 at 05:19:09PM +0200, Ingo Molnar wrote:
> * Matthew Wilcox <matthew@wil.cx> wrote:
> > > exactly what usecase is that? Perhaps it could be converted to an 
> > > atomic counter + the wait_event() APIs.
> > 
> > Effectively, it's a completion.  It just works better with staggered 
> > wakeups than it does with the naive completion.
> 
> So why not transform it to real completions instead? And if our current 
> 'struct completion' abstraction is insufficient for whatever reason, why 
> not extend that instead?

My point is that for the only user of counting semaphores and/or
semaphores-abused-as-completions that has so far hit this race, the
serialised wake-up performs better.  You have not pointed at a scenario
that _shows_ a parallel wake-up to perform better.  Some hand-waving
and talking about lofty principles, yes.  But no actual data.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 14:32                                               ` Ingo Molnar
  2008-05-11 14:46                                                 ` Matthew Wilcox
@ 2008-05-11 16:47                                                 ` Linus Torvalds
  1 sibling, 0 replies; 140+ messages in thread
From: Linus Torvalds @ 2008-05-11 16:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matthew Wilcox, Sven Wegener, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin



On Sun, 11 May 2008, Ingo Molnar wrote:
> 
> but the problem is that by serializing the wakeup chains naively you 
> introduced a more than 50% AIM7 performance regression.

No, the problem is that the BKL shouldn't be a semaphore at all. 
Performance fixed.

After that, it's *purely* a correctness issue, and then we might as well 
be fair and not allow the stealing of semaphores from under waiters at 
ALL. Which is what Matthews original code did.

In other words, current -git is fine, and was already confirmed by Sven 
to fix the bug (before *any* of your patches were), and was earlier 
confirmed to fix the AIM7 performance regression (better than *any* of 
your patches were).  

So I fixed the problems last night already. Stop wasting everybody's time.

		Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-11 15:29                                                       ` Matthew Wilcox
@ 2008-05-13 14:11                                                         ` Ingo Molnar
  2008-05-13 14:21                                                           ` Matthew Wilcox
  0 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-13 14:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra


* Matthew Wilcox <matthew@wil.cx> wrote:

> > So why not transform it to real completions instead? And if our 
> > current 'struct completion' abstraction is insufficient for whatever 
> > reason, why not extend that instead?
> 
> My point is that for the only user of counting semaphores and/or 
> semaphores-abused-as-completions that has so far hit this race, the 
> serialised wake-up performs better.  You have not pointed at a 
> scenario that _shows_ a parallel wake-up to perform better. [...]

a 50% AIM7 slowdown maybe? With the BKL being a spinlock again it doesnt 
matter much in practice though.

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-13 14:11                                                         ` Ingo Molnar
@ 2008-05-13 14:21                                                           ` Matthew Wilcox
  2008-05-13 14:42                                                             ` Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-13 14:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra

On Tue, May 13, 2008 at 04:11:29PM +0200, Ingo Molnar wrote:
> 
> * Matthew Wilcox <matthew@wil.cx> wrote:
> 
> > > So why not transform it to real completions instead? And if our 
> > > current 'struct completion' abstraction is insufficient for whatever 
> > > reason, why not extend that instead?
> > 
> > My point is that for the only user of counting semaphores and/or 
> > semaphores-abused-as-completions that has so far hit this race, the 
> > serialised wake-up performs better.  You have not pointed at a 
> > scenario that _shows_ a parallel wake-up to perform better. [...]
> 
> a 50% AIM7 slowdown maybe? With the BKL being a spinlock again it doesnt 
> matter much in practice though.

You're not understanding me.  This is completely inapplicable to the BKL
because only one task can be in wakeup at a time (due to it having a
maximum value of 1).  There's no way to hit this race with the BKL.
The only kind of semaphore that can hit this race is the kind that can
have more than one wakeup in progress at a time -- ie one which can have
a value >1.  Like completions and real counting semaphores.

So the only thing worth talking about (and indeed, it's now entirely
moot) is what's the best way to solve this problem /for this kind of
semaphore/.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-13 14:21                                                           ` Matthew Wilcox
@ 2008-05-13 14:42                                                             ` Ingo Molnar
  2008-05-13 15:28                                                               ` Matthew Wilcox
  0 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-13 14:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra


* Matthew Wilcox <matthew@wil.cx> wrote:

> > a 50% AIM7 slowdown maybe? With the BKL being a spinlock again it 
> > doesnt matter much in practice though.
> 
> You're not understanding me.  This is completely inapplicable to the 
> BKL because only one task can be in wakeup at a time (due to it having 
> a maximum value of 1).  There's no way to hit this race with the BKL. 
> The only kind of semaphore that can hit this race is the kind that can 
> have more than one wakeup in progress at a time -- ie one which can 
> have a value >1.  Like completions and real counting semaphores.

yes, but even for parallel wakeups for completions it's good in general 
to keep more tasks in flight than to keep less tasks in flight.

Perhaps the code could throttle them to nr_cpus, but otherwise, as the 
BKL example has shown it (in another context), we do much better if we 
overload the scheduler (in which case it can and does batch 
intelligently) than if we try to second-guess it and under-load it and 
create lots of scheduling events.

i'd agree with you that are no numbers available pro or contra, so you 
are right that my 50% point does not apply to your argument.

> So the only thing worth talking about (and indeed, it's now entirely 
> moot) is what's the best way to solve this problem /for this kind of 
> semaphore/.

it's not really moot in terms of improving the completions code i 
suspect? For XFS i guess performance matters.

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-13 14:42                                                             ` Ingo Molnar
@ 2008-05-13 15:28                                                               ` Matthew Wilcox
  2008-05-13 17:13                                                                 ` Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2008-05-13 15:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra

On Tue, May 13, 2008 at 04:42:07PM +0200, Ingo Molnar wrote:
> yes, but even for parallel wakeups for completions it's good in general 
> to keep more tasks in flight than to keep less tasks in flight.

That might be the case for some users, but it isn't the case for XFS.
The first thing that each task does is grab a spinlock, so if you put as
much in flight as early as possible, you end up with horrible contention
on that spinlock.  I have no idea whether this is the common case for
multi-valued semaphores or not, it's just the only one I have data for.

> > So the only thing worth talking about (and indeed, it's now entirely 
> > moot) is what's the best way to solve this problem /for this kind of 
> > semaphore/.
> 
> it's not really moot in terms of improving the completions code i 
> suspect? For XFS i guess performance matters.

I think the completion code is less optimised than the semaphore code
today.  Clearly the same question does arise, but I don't know what the
answer is for completion users either.

Let's do a quick survey.  drivers/net has 5 users:

3c527.c -- execution_cmd has a mutex held, so never more than one task
waiting anyway.  xceiver_cmd is called during open and close which I think
are serialised at a higher level.  In any case, no performance issue here.

iseries_veth.c -- grabs a spinlock soon after being woken.

plip.c -- called in close, no perf implication.

ppp_synctty.c -- called in close, no perf implication.

ps3_gelic_wireless.c - If this isn't serialised, it's buggy.


Maybe drivers/net is a bad example.  Let's look at */*.c:

as-iosched.c -- in exit path.
blk-barrier.c -- completion on stack, so only one waiter.
blk-exec.c -- ditto
cfq-iosched.c -- in exit path

crypto/api.c -- in init path
crypto/gcm.c -- in setkey path
crypto/tcrypt.c -- crypto testing.  Not a perf path.

fs/exec.c -- waiting for coredumps.
kernel/exit.c -- likewise
kernel/fork.c -- completion on stack
kernel/kmod.c -- completion on stack 
kernel/kthread.c -- kthread creation and deletion.  Shouldn't be a hot
path, plus this looks like there's only going to be one task waiting.
kernel/rcupdate.c -- one completion on stack, one synchronised by a mutex
kernel/rcutorture.c -- doesn't matter
kernel/sched.c -- both completions on stack
kernel/stop_machine.c -- completion on stack
kernel/sysctl.c -- completion on stack
kernel/workqueue.c -- completion on stack

lib/klist.c -- This one seems like it could potentially have lots of
waiters, if only anything actually used klists.

It seems like most users use completions where it'd be just as easy
to use a task pointer and call wake_up_task().  In any case, I think
there's no evidence one way or the other about how people are using
multi-sleeper completions.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-13 15:28                                                               ` Matthew Wilcox
@ 2008-05-13 17:13                                                                 ` Ingo Molnar
  2008-05-13 17:22                                                                   ` Linus Torvalds
  0 siblings, 1 reply; 140+ messages in thread
From: Ingo Molnar @ 2008-05-13 17:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra


* Matthew Wilcox <matthew@wil.cx> wrote:

> > yes, but even for parallel wakeups for completions it's good in 
> > general to keep more tasks in flight than to keep less tasks in 
> > flight.
> 
> That might be the case for some users, but it isn't the case for XFS. 
> The first thing that each task does is grab a spinlock, so if you put 
> as much in flight as early as possible, you end up with horrible 
> contention on that spinlock. [...]

hm, this sounds like damage that is inflicted on itself by the XFS code. 

Why does it signal to its waiters that "resource is available", when in 
reality that resource is not available but immediately serialized via a 
lock? (even if the lock might technically be some _other_ object)

I have not looked closely at this but the more natural wakeup flow here 
would be that if you know there's going to be immediate contention, to 
signal a _single_ resource to a _single_ waiter, and then once that 
contention point is over a (hopefully) much more parallel processing 
phase occurs, to use a multi-value completion there.

in other words: dont tell the scheduler that there is parallelism in the 
system when in reality there is not. And for the same reason, do not 
throttle wakeups in a completion mechanism artificially because one 
given user utilizes it suboptimally. Once throttled it's not possible to 
regain that lost parallelism.

> [...] I have no idea whether this is the common case for multi-valued 
> semaphores or not, it's just the only one I have data for.

yeah. I'd guess XFS would be the primary user in this area who cares 
about performance.

> It seems like most users use completions where it'd be just as easy to 
> use a task pointer and call wake_up_task(). [...]

yeah - although i guess in general it's a bit safer to use an explicit 
completion. With a task pointer you have to be sure the task is still 
present, etc. (with a completion you are forced to put that completion 
object _somewhere_, which immediately forces one to think about lifetime 
issues. A wakeup to a single task pointer is way too easy to get wrong.)

So in general i'd recommend the use of completions.

> [...] In any case, I think there's no evidence one way or the other 
> about how people are using multi-sleeper completions.

yeah, that's definitely so.

	Ingo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-13 17:13                                                                 ` Ingo Molnar
@ 2008-05-13 17:22                                                                   ` Linus Torvalds
  2008-05-13 21:05                                                                     ` Ingo Molnar
  0 siblings, 1 reply; 140+ messages in thread
From: Linus Torvalds @ 2008-05-13 17:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matthew Wilcox, Sven Wegener, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra



On Tue, 13 May 2008, Ingo Molnar wrote:
> 
> hm, this sounds like damage that is inflicted on itself by the XFS code. 

No. You're confusing what a counting semaphore is.

> Why does it signal to its waiters that "resource is available", when in 
> reality that resource is not available but immediately serialized via a 
> lock? (even if the lock might technically be some _other_ object)

So you have 'n' resources available, and you use a counting semaphore for 
that resource counting.

But you'd still need a spinlock to actually protect the list of resources 
themselves. The spinlock protects a different thing than the semaphore. 
The semaphore isn't about mutual exclusion - it's about counting resources 
and waiting when there are too many things in flight.

And you seem to think that a counting semaphore is about mutual exclusion. 
It has nothing what-so-ever to do with that. 

			Linus

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [git pull] scheduler fixes
  2008-05-13 17:22                                                                   ` Linus Torvalds
@ 2008-05-13 21:05                                                                     ` Ingo Molnar
  0 siblings, 0 replies; 140+ messages in thread
From: Ingo Molnar @ 2008-05-13 21:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Sven Wegener, Zhang, Yanmin, Andi Kleen, LKML,
	Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > Why does it signal to its waiters that "resource is available", when 
> > in reality that resource is not available but immediately serialized 
> > via a lock? (even if the lock might technically be some _other_ 
> > object)
> 
> So you have 'n' resources available, and you use a counting semaphore 
> for that resource counting.
> 
> But you'd still need a spinlock to actually protect the list of 
> resources themselves. The spinlock protects a different thing than the 
> semaphore. The semaphore isn't about mutual exclusion - it's about 
> counting resources and waiting when there are too many things in 
> flight.
> 
> And you seem to think that a counting semaphore is about mutual 
> exclusion. It has nothing what-so-ever to do with that.

i was reacting to the point that Matthew made:

  " The first thing that each task does is grab a spinlock, so if you 
    put as much in flight as early as possible, you end up with horrible 
    contention on that spinlock. "

We were talking about the case of double, parallel up()s. My point was 
that the best guess is to put two tasks in flight in the synchronization 
primitive. Matthew's point was that the wakeups of the two tasks should 
be chained: one task gets woken first, which then wakes the second task 
in a chain. [ Matthew, i'm paraphrasing your opinion so please correct 
me if i'm misinterpreting your point. ]

My argument is that chaining like that in the synchronization primitive 
is bad for parallelism in general, because wakeup latencies are just too 
long in general and any action required in the context of the "first 
waker" throttles parallelism artificially and introduces artificial 
workload delays.

Even in the simple list+lock case you are talking about it's beneficial 
to keep as many wakers in flight as possible. The reason is that even 
with the worst-possible cacheline bouncing imaginable, it's much better 
to spin a bit on a spinlock (which is "horrible bouncing" only on paper, 
in practice it's nicely ordered waiting on a FIFO spinlock, with a list 
op every 100 nsecs or so) than to implement "soft bouncing" of tasks via 
an artificial chain of wakeups.

That artificial chain of wakeups has a latency of a few microseconds 
even in the best case - and in the worst-case it can be a delay of 
milliseconds later - throttling not just the current parallelism of the 
workload but also hiding potential future parallelism. The hardware is 
so much faster at messaging. [*] [**]

	Ingo

[*]  Not without qualifications - maybe not so on 1024 CPU systems, but
     certainly so on anything sane up to 16way or so. But if worry about 
     1024 CPU systems i'd suggest to first take a look at the current 
     practice in kernel/semaphore.c taking the semaphore internal 
     spinlock again _after_ a task has woken up, just to remove itself 
     from the list of waiters. That is rather unnecessary serialization 
     - at up() time we are already holding the lock, so we should remove
     the entry there. That's what the mutex code does too.

[**] The only place where i can see staggered/chained wakeups help is in 
     the specific case when the wakee runs into a heavy lock _right 
     after_ wakeup. XFS might be running into that and my reply above 
     talks about that hypothetical scenario.

     If it is so then it is handled incorrectly, because in that case we 
     dont have any true parallelism at the point of up(), and we know it 
     right there. There is no reason to pretend at that point that we 
     have more parallelism, when all we do is we block on a heavy lock 
     right after wakeup.

     Instead, the correct implementation in that case would be to have a 
     wait queue for that heavy lock (which in other words can be thought 
     of as a virtual single-resource which is either 'available' or 
     'unavailable'), and _then_ after that to use a counting semaphore 
     for the rest of the program flow, which is hopefully more parallel. 

     I.e. precisely map the true parallelism of the code via the 
     synchronization primitives, do not over-synchronize it and do not
     under-synchronize it. And if there's any doubt which one should be 
     used, under-synchronize it - because while the scheduler is rather
     good at dealing with too many tasks it cannot conjure runnable 
     tasks out of thin air.

     Btw., the AIM7 regression was exactly that: in the 50% regressed 
     workload the semaphore code hid the true parallelism of the 
     workload and we had only had 5 tasks on the runqueue and the 
     scheduler had no chance to saturate all CPUs. In the "good" case 
     (be that spinlock based or proper-semaphores based BKL) there were 
     2000 runnable tasks on the runqueues, and the scheduler sorted them 
     out and batched the workload just fine!

^ permalink raw reply	[flat|nested] 140+ messages in thread

end of thread, other threads:[~2008-05-13 21:06 UTC | newest]

Thread overview: 140+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-05-06  5:48 AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin
2008-05-06 11:18 ` Matthew Wilcox
2008-05-06 11:44 ` Ingo Molnar
2008-05-06 12:09   ` Matthew Wilcox
2008-05-06 16:23     ` Matthew Wilcox
2008-05-06 16:36       ` Linus Torvalds
2008-05-06 16:42         ` Matthew Wilcox
2008-05-06 16:39           ` Alan Cox
2008-05-06 16:51             ` Matthew Wilcox
2008-05-06 16:45               ` Alan Cox
2008-05-06 17:42               ` Linus Torvalds
2008-05-06 20:28           ` Linus Torvalds
2008-05-06 16:44         ` J. Bruce Fields
2008-05-06 17:21       ` Andrew Morton
2008-05-06 17:31         ` Matthew Wilcox
2008-05-06 17:49           ` Ingo Molnar
2008-05-06 18:07             ` Andrew Morton
2008-05-11 11:11               ` Matthew Wilcox
2008-05-06 17:39         ` Ingo Molnar
2008-05-07  6:49           ` Zhang, Yanmin
2008-05-06 17:45         ` Linus Torvalds
2008-05-07 16:38         ` Matthew Wilcox
2008-05-07 16:55           ` Linus Torvalds
2008-05-07 17:08             ` Linus Torvalds
2008-05-07 17:16               ` Andrew Morton
2008-05-07 17:27                 ` Linus Torvalds
2008-05-07 17:22               ` Ingo Molnar
2008-05-07 17:25                 ` Ingo Molnar
2008-05-07 17:31                 ` Linus Torvalds
2008-05-07 17:47                   ` Linus Torvalds
2008-05-07 17:49                   ` Ingo Molnar
2008-05-07 18:02                     ` Linus Torvalds
2008-05-07 18:17                       ` Ingo Molnar
2008-05-07 18:27                         ` Linus Torvalds
2008-05-07 18:43                           ` Ingo Molnar
2008-05-07 19:01                             ` Linus Torvalds
2008-05-07 19:09                               ` Ingo Molnar
2008-05-07 19:24                               ` Matthew Wilcox
2008-05-07 19:44                                 ` Linus Torvalds
2008-05-07 20:00                                   ` Oi. NFS people. Read this Matthew Wilcox
2008-05-07 22:10                                     ` Trond Myklebust
2008-05-09  1:43                                       ` J. Bruce Fields
2008-05-08  3:24       ` AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin
2008-05-08  3:34         ` Linus Torvalds
2008-05-08  4:37           ` Zhang, Yanmin
2008-05-08 14:58             ` Linus Torvalds
2008-05-07  2:11   ` Zhang, Yanmin
2008-05-07  3:41     ` Zhang, Yanmin
2008-05-07  3:59       ` Andrew Morton
2008-05-07  4:46         ` Zhang, Yanmin
2008-05-07  6:26       ` Ingo Molnar
2008-05-07  6:28         ` Ingo Molnar
2008-05-07  7:05           ` Zhang, Yanmin
2008-05-07 11:00       ` Andi Kleen
2008-05-07 11:46         ` Matthew Wilcox
2008-05-07 12:21           ` Andi Kleen
2008-05-07 14:36             ` Linus Torvalds
2008-05-07 14:35               ` Alan Cox
2008-05-07 15:00                 ` Linus Torvalds
2008-05-07 15:02                   ` Linus Torvalds
2008-05-07 14:57               ` Andi Kleen
2008-05-07 15:31                 ` Andrew Morton
2008-05-07 16:22                   ` Matthew Wilcox
2008-05-07 15:19               ` Linus Torvalds
2008-05-07 17:14                 ` Ingo Molnar
2008-05-08  2:44                 ` Zhang, Yanmin
2008-05-08  3:29                   ` Linus Torvalds
2008-05-08  4:08                     ` Zhang, Yanmin
2008-05-08  4:17                       ` Linus Torvalds
2008-05-08 12:01                         ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Ingo Molnar
2008-05-08 12:28                           ` Ingo Molnar
2008-05-08 14:43                             ` Ingo Molnar
2008-05-08 15:10                               ` [git pull] scheduler fixes Ingo Molnar
2008-05-08 15:33                                 ` Adrian Bunk
2008-05-08 15:41                                   ` Ingo Molnar
2008-05-08 19:42                                     ` Adrian Bunk
2008-05-11 11:03                                 ` Matthew Wilcox
2008-05-11 11:14                                   ` Matthew Wilcox
2008-05-11 11:48                                   ` Matthew Wilcox
2008-05-11 12:50                                     ` Ingo Molnar
2008-05-11 12:52                                       ` Ingo Molnar
2008-05-11 13:02                                         ` Matthew Wilcox
2008-05-11 13:26                                           ` Matthew Wilcox
2008-05-11 14:00                                             ` Ingo Molnar
2008-05-11 14:18                                               ` Matthew Wilcox
2008-05-11 14:42                                                 ` Ingo Molnar
2008-05-11 14:48                                                   ` Matthew Wilcox
2008-05-11 15:19                                                     ` Ingo Molnar
2008-05-11 15:29                                                       ` Matthew Wilcox
2008-05-13 14:11                                                         ` Ingo Molnar
2008-05-13 14:21                                                           ` Matthew Wilcox
2008-05-13 14:42                                                             ` Ingo Molnar
2008-05-13 15:28                                                               ` Matthew Wilcox
2008-05-13 17:13                                                                 ` Ingo Molnar
2008-05-13 17:22                                                                   ` Linus Torvalds
2008-05-13 21:05                                                                     ` Ingo Molnar
2008-05-11 13:54                                           ` Ingo Molnar
2008-05-11 14:22                                             ` Matthew Wilcox
2008-05-11 14:32                                               ` Ingo Molnar
2008-05-11 14:46                                                 ` Matthew Wilcox
2008-05-11 16:47                                                 ` Linus Torvalds
2008-05-11 13:01                                   ` Ingo Molnar
2008-05-11 13:06                                     ` Matthew Wilcox
2008-05-11 13:45                                       ` Ingo Molnar
2008-05-11 14:10                                   ` Sven Wegener
2008-05-08 16:02                             ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Linus Torvalds
2008-05-08 18:30                               ` Linus Torvalds
2008-05-08 20:19                                 ` Ingo Molnar
2008-05-08 20:27                                   ` Linus Torvalds
2008-05-08 21:45                                     ` Ingo Molnar
2008-05-08 22:02                                       ` Ingo Molnar
2008-05-08 22:55                                       ` Linus Torvalds
2008-05-08 23:07                                         ` Linus Torvalds
2008-05-08 23:14                                           ` Linus Torvalds
2008-05-08 23:16                                         ` Alan Cox
2008-05-08 23:33                                           ` Linus Torvalds
2008-05-08 23:27                                             ` Alan Cox
2008-05-09  6:50                                             ` Ingo Molnar
2008-05-09  8:29                                             ` Andi Kleen
2008-05-08 13:20                           ` Matthew Wilcox
2008-05-08 15:01                             ` Ingo Molnar
2008-05-08 13:56                           ` Arjan van de Ven
2008-05-08  6:43                   ` AIM7 40% regression with 2.6.26-rc1 Ingo Molnar
2008-05-08  6:48                     ` Andrew Morton
2008-05-08  7:14                     ` Zhang, Yanmin
2008-05-08  7:39                       ` Ingo Molnar
2008-05-08  8:44                         ` Zhang, Yanmin
2008-05-08  9:21                           ` Ingo Molnar
2008-05-08  9:29                             ` Ingo Molnar
2008-05-08  9:30                             ` Zhang, Yanmin
2008-05-07 16:20               ` Ingo Molnar
2008-05-07 16:35                 ` Linus Torvalds
2008-05-07 17:05                   ` Ingo Molnar
2008-05-07 17:24                     ` Linus Torvalds
2008-05-07 17:36                       ` Ingo Molnar
2008-05-07 17:55                         ` Linus Torvalds
2008-05-07 17:59                           ` Matthew Wilcox
2008-05-07 18:17                             ` Linus Torvalds
2008-05-07 18:49                               ` Ingo Molnar
2008-05-07 13:59         ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).