* AIM7 40% regression with 2.6.26-rc1 @ 2008-05-06 5:48 Zhang, Yanmin 2008-05-06 11:18 ` Matthew Wilcox 2008-05-06 11:44 ` Ingo Molnar 0 siblings, 2 replies; 140+ messages in thread From: Zhang, Yanmin @ 2008-05-06 5:48 UTC (permalink / raw) To: Matthew Wilcox; +Cc: LKML Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more than 40% with 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton, and Itanium Montecito. Bisect located below patch. 64ac24e738823161693bf791f87adc802cf529ff is first bad commit commit 64ac24e738823161693bf791f87adc802cf529ff Author: Matthew Wilcox <matthew@wil.cx> Date: Fri Mar 7 21:55:58 2008 -0500 Generic semaphore implementation After I manually reverted the patch against 2.6.26-rc1 while fixing lots of conflictions/errors, aim7 regression became less than 2%. -yanmin ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 5:48 AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin @ 2008-05-06 11:18 ` Matthew Wilcox 2008-05-06 11:44 ` Ingo Molnar 1 sibling, 0 replies; 140+ messages in thread From: Matthew Wilcox @ 2008-05-06 11:18 UTC (permalink / raw) To: Zhang, Yanmin; +Cc: LKML On Tue, May 06, 2008 at 01:48:24PM +0800, Zhang, Yanmin wrote: > Comparing with kernel 2.6.25, ???AIM7 (use tmpfs) has ???more than 40% with 2.6.26-rc1 > on my 8-core stoakley, 16-core tigerton, and Itanium Montecito. Bisect located > below patch. > > 64ac24e738823161693bf791f87adc802cf529ff is first bad commit > commit 64ac24e738823161693bf791f87adc802cf529ff > Author: Matthew Wilcox <matthew@wil.cx> > Date: Fri Mar 7 21:55:58 2008 -0500 > > Generic semaphore implementation > > > After I manually reverted the patch against 2.6.26-rc1 while fixing lots of > conflictions/errors, aim7 regression became less than 2%. 40%?! That's shocking. Can you tell which semaphore was heavily contended? I have a horrible feeling that it's the BKL. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 5:48 AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin 2008-05-06 11:18 ` Matthew Wilcox @ 2008-05-06 11:44 ` Ingo Molnar 2008-05-06 12:09 ` Matthew Wilcox 2008-05-07 2:11 ` Zhang, Yanmin 1 sibling, 2 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-06 11:44 UTC (permalink / raw) To: Zhang, Yanmin Cc: Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more than 40% with > 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton, and Itanium > Montecito. Bisect located below patch. > > 64ac24e738823161693bf791f87adc802cf529ff is first bad commit > commit 64ac24e738823161693bf791f87adc802cf529ff > Author: Matthew Wilcox <matthew@wil.cx> > Date: Fri Mar 7 21:55:58 2008 -0500 > > Generic semaphore implementation > > After I manually reverted the patch against 2.6.26-rc1 while fixing > lots of conflictions/errors, aim7 regression became less than 2%. hm, which exact semaphore would that be due to? My first blind guess would be the BKL - there's not much other semaphore use left in the core kernel otherwise that would affect AIM7 normally. The VFS still makes frequent use of the BKL and AIM7 is very VFS intense. Getting rid of that BKL use from the VFS might be useful to performance anyway. Could you try to check that it's indeed the BKL? Easiest way to check it would be to run AIM7 it on sched-devel.git/latest and do scheduler tracing via: http://people.redhat.com/mingo/sched-devel.git/readme-tracer.txt by doing: echo stacktrace > /debug/tracing/iter_ctl you could get exact backtraces of all scheduling points in the trace. If the BKL's down() shows up in those traces then it's definitely the BKL that causes this. The backtraces will also tell us exactly which BKL use is the most frequent one. To keep tracing overhead low on SMP i'd also suggest to only trace a single CPU, via: echo 1 > /debug/tracing/tracing_cpumask Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 11:44 ` Ingo Molnar @ 2008-05-06 12:09 ` Matthew Wilcox 2008-05-06 16:23 ` Matthew Wilcox 2008-05-07 2:11 ` Zhang, Yanmin 1 sibling, 1 reply; 140+ messages in thread From: Matthew Wilcox @ 2008-05-06 12:09 UTC (permalink / raw) To: Ingo Molnar Cc: Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds, Andrew Morton, linux-fsdevel On Tue, May 06, 2008 at 01:44:49PM +0200, Ingo Molnar wrote: > * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > > After I manually reverted the patch against 2.6.26-rc1 while fixing > > lots of conflictions/errors, aim7 regression became less than 2%. > > hm, which exact semaphore would that be due to? > > My first blind guess would be the BKL - there's not much other semaphore > use left in the core kernel otherwise that would affect AIM7 normally. > The VFS still makes frequent use of the BKL and AIM7 is very VFS > intense. Getting rid of that BKL use from the VFS might be useful to > performance anyway. That's slightly slanderous to the VFS ;-) The BKL really isn't used that much any more. So little that I've gone through and produced a list of places it's used: fs/block_dev.c opening and closing a block device. Unlikely to be provoked by AIM7. fs/char_dev.c chrdev_open. Unlikely to be provoked by AIM7. fs/compat.c mount. Unlikely to be provoked by AIM7. fs/compat_ioctl.c held around calls to ioctl translator. fs/exec.c coredump. If this is a contention problem ... fs/fcntl.c held around call to ->fasync. fs/ioctl.c held around f_op ->ioctl call (tmpfs doesn't have ioctl). ditto bmap. there's fasync, as previously mentioned. fs/locks.c hellhole. I hope AIM7 doesn't use locks. fs/namespace.c mount, umount. Unlikely to be provoked by AIM7. fs/read_write.c llseek. tmpfs uses the unlocked version. fs/super.c shtdown, remount. Unlikely to be provoked by AIM7. So the only likely things I can see are: - file locks - fasync -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 12:09 ` Matthew Wilcox @ 2008-05-06 16:23 ` Matthew Wilcox 2008-05-06 16:36 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 140+ messages in thread From: Matthew Wilcox @ 2008-05-06 16:23 UTC (permalink / raw) To: Ingo Molnar, J. Bruce Fields Cc: Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds, Andrew Morton, linux-fsdevel On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote: > So the only likely things I can see are: > > - file locks > - fasync I've wanted to fix file locks for a while. Here's a first attempt. It was done quickly, so I concede that it may well have bugs in it. I found (and fixed) one with LTP. It takes *no account* of nfsd, nor remote filesystems. We need to have a serious discussion about their requirements. diff --git a/fs/locks.c b/fs/locks.c index 663c069..cb09765 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -140,6 +140,8 @@ int lease_break_time = 45; #define for_each_lock(inode, lockp) \ for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next) +static DEFINE_SPINLOCK(file_lock_lock); + static LIST_HEAD(file_lock_list); static LIST_HEAD(blocked_list); @@ -510,9 +512,9 @@ static void __locks_delete_block(struct file_lock *waiter) */ static void locks_delete_block(struct file_lock *waiter) { - lock_kernel(); + spin_lock(&file_lock_lock); __locks_delete_block(waiter); - unlock_kernel(); + spin_unlock(&file_lock_lock); } /* Insert waiter into blocker's block list. @@ -649,7 +651,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl) { struct file_lock *cfl; - lock_kernel(); + spin_lock(&file_lock_lock); for (cfl = filp->f_path.dentry->d_inode->i_flock; cfl; cfl = cfl->fl_next) { if (!IS_POSIX(cfl)) continue; @@ -662,7 +664,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl) fl->fl_pid = pid_vnr(cfl->fl_nspid); } else fl->fl_type = F_UNLCK; - unlock_kernel(); + spin_unlock(&file_lock_lock); return; } EXPORT_SYMBOL(posix_test_lock); @@ -735,18 +737,21 @@ static int flock_lock_file(struct file *filp, struct file_lock *request) int error = 0; int found = 0; - lock_kernel(); - if (request->fl_flags & FL_ACCESS) + if (request->fl_flags & FL_ACCESS) { + spin_lock(&file_lock_lock); goto find_conflict; + } if (request->fl_type != F_UNLCK) { error = -ENOMEM; + new_fl = locks_alloc_lock(); if (new_fl == NULL) - goto out; + goto out_unlocked; error = 0; } + spin_lock(&file_lock_lock); for_each_lock(inode, before) { struct file_lock *fl = *before; if (IS_POSIX(fl)) @@ -772,10 +777,13 @@ static int flock_lock_file(struct file *filp, struct file_lock *request) * If a higher-priority process was blocked on the old file lock, * give it the opportunity to lock the file. */ - if (found) + if (found) { + spin_unlock(&file_lock_lock); cond_resched(); + spin_lock(&file_lock_lock); + } -find_conflict: + find_conflict: for_each_lock(inode, before) { struct file_lock *fl = *before; if (IS_POSIX(fl)) @@ -796,8 +804,9 @@ find_conflict: new_fl = NULL; error = 0; -out: - unlock_kernel(); + out: + spin_unlock(&file_lock_lock); + out_unlocked: if (new_fl) locks_free_lock(new_fl); return error; @@ -826,7 +835,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str new_fl2 = locks_alloc_lock(); } - lock_kernel(); + spin_lock(&file_lock_lock); if (request->fl_type != F_UNLCK) { for_each_lock(inode, before) { fl = *before; @@ -994,7 +1003,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str locks_wake_up_blocks(left); } out: - unlock_kernel(); + spin_unlock(&file_lock_lock); /* * Free any unused locks. */ @@ -1069,14 +1078,14 @@ int locks_mandatory_locked(struct inode *inode) /* * Search the lock list for this inode for any POSIX locks. */ - lock_kernel(); + spin_lock(&file_lock_lock); for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) { if (!IS_POSIX(fl)) continue; if (fl->fl_owner != owner) break; } - unlock_kernel(); + spin_unlock(&file_lock_lock); return fl ? -EAGAIN : 0; } @@ -1190,7 +1199,7 @@ int __break_lease(struct inode *inode, unsigned int mode) new_fl = lease_alloc(NULL, mode & FMODE_WRITE ? F_WRLCK : F_RDLCK); - lock_kernel(); + spin_lock(&file_lock_lock); time_out_leases(inode); @@ -1251,8 +1260,10 @@ restart: break_time++; } locks_insert_block(flock, new_fl); + spin_unlock(&file_lock_lock); error = wait_event_interruptible_timeout(new_fl->fl_wait, !new_fl->fl_next, break_time); + spin_lock(&file_lock_lock); __locks_delete_block(new_fl); if (error >= 0) { if (error == 0) @@ -1266,8 +1277,8 @@ restart: error = 0; } -out: - unlock_kernel(); + out: + spin_unlock(&file_lock_lock); if (!IS_ERR(new_fl)) locks_free_lock(new_fl); return error; @@ -1323,7 +1334,7 @@ int fcntl_getlease(struct file *filp) struct file_lock *fl; int type = F_UNLCK; - lock_kernel(); + spin_lock(&file_lock_lock); time_out_leases(filp->f_path.dentry->d_inode); for (fl = filp->f_path.dentry->d_inode->i_flock; fl && IS_LEASE(fl); fl = fl->fl_next) { @@ -1332,7 +1343,7 @@ int fcntl_getlease(struct file *filp) break; } } - unlock_kernel(); + spin_unlock(&file_lock_lock); return type; } @@ -1363,6 +1374,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp) if (error) return error; + spin_lock(&file_lock_lock); time_out_leases(inode); BUG_ON(!(*flp)->fl_lmops->fl_break); @@ -1370,10 +1382,11 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp) lease = *flp; if (arg != F_UNLCK) { + spin_unlock(&file_lock_lock); error = -ENOMEM; new_fl = locks_alloc_lock(); if (new_fl == NULL) - goto out; + goto out_unlocked; error = -EAGAIN; if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0)) @@ -1382,6 +1395,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp) && ((atomic_read(&dentry->d_count) > 1) || (atomic_read(&inode->i_count) > 1))) goto out; + spin_lock(&file_lock_lock); } /* @@ -1429,11 +1443,14 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp) locks_copy_lock(new_fl, lease); locks_insert_lock(before, new_fl); + spin_unlock(&file_lock_lock); *flp = new_fl; return 0; -out: + out: + spin_unlock(&file_lock_lock); + out_unlocked: if (new_fl != NULL) locks_free_lock(new_fl); return error; @@ -1471,12 +1488,10 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease) { int error; - lock_kernel(); if (filp->f_op && filp->f_op->setlease) error = filp->f_op->setlease(filp, arg, lease); else error = generic_setlease(filp, arg, lease); - unlock_kernel(); return error; } @@ -1503,12 +1518,11 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg) if (error) return error; - lock_kernel(); - error = vfs_setlease(filp, arg, &flp); if (error || arg == F_UNLCK) - goto out_unlock; + return error; + lock_kernel(); error = fasync_helper(fd, filp, 1, &flp->fl_fasync); if (error < 0) { /* remove lease just inserted by setlease */ @@ -1519,7 +1533,7 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg) } error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0); -out_unlock: + out_unlock: unlock_kernel(); return error; } @@ -2024,7 +2038,7 @@ void locks_remove_flock(struct file *filp) fl.fl_ops->fl_release_private(&fl); } - lock_kernel(); + spin_lock(&file_lock_lock); before = &inode->i_flock; while ((fl = *before) != NULL) { @@ -2042,7 +2056,7 @@ void locks_remove_flock(struct file *filp) } before = &fl->fl_next; } - unlock_kernel(); + spin_unlock(&file_lock_lock); } /** @@ -2057,12 +2071,12 @@ posix_unblock_lock(struct file *filp, struct file_lock *waiter) { int status = 0; - lock_kernel(); + spin_lock(&file_lock_lock); if (waiter->fl_next) __locks_delete_block(waiter); else status = -ENOENT; - unlock_kernel(); + spin_unlock(&file_lock_lock); return status; } @@ -2175,7 +2189,7 @@ static int locks_show(struct seq_file *f, void *v) static void *locks_start(struct seq_file *f, loff_t *pos) { - lock_kernel(); + spin_lock(&file_lock_lock); f->private = (void *)1; return seq_list_start(&file_lock_list, *pos); } @@ -2187,7 +2201,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos) static void locks_stop(struct seq_file *f, void *v) { - unlock_kernel(); + spin_unlock(&file_lock_lock); } struct seq_operations locks_seq_operations = { @@ -2215,7 +2229,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len) { struct file_lock *fl; int result = 1; - lock_kernel(); + spin_lock(&file_lock_lock); for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) { if (IS_POSIX(fl)) { if (fl->fl_type == F_RDLCK) @@ -2232,7 +2246,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len) result = 0; break; } - unlock_kernel(); + spin_unlock(&file_lock_lock); return result; } @@ -2255,7 +2269,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len) { struct file_lock *fl; int result = 1; - lock_kernel(); + spin_lock(&file_lock_lock); for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) { if (IS_POSIX(fl)) { if ((fl->fl_end < start) || (fl->fl_start > (start + len))) @@ -2270,7 +2284,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len) result = 0; break; } - unlock_kernel(); + spin_unlock(&file_lock_lock); return result; } -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply related [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 16:23 ` Matthew Wilcox @ 2008-05-06 16:36 ` Linus Torvalds 2008-05-06 16:42 ` Matthew Wilcox 2008-05-06 16:44 ` J. Bruce Fields 2008-05-06 17:21 ` Andrew Morton 2008-05-08 3:24 ` AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin 2 siblings, 2 replies; 140+ messages in thread From: Linus Torvalds @ 2008-05-06 16:36 UTC (permalink / raw) To: Matthew Wilcox Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton, linux-fsdevel On Tue, 6 May 2008, Matthew Wilcox wrote: > > I've wanted to fix file locks for a while. Here's a first attempt. > It was done quickly, so I concede that it may well have bugs in it. > I found (and fixed) one with LTP. Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is there some user that doesn't have the inode info, or does anything that might cross inode boundaries? This does seem to drop all locking around the "setlease()" calls down to the filesystem, which worries me. That said, we clearly do need to do this. Probably should have done it a long time ago. Also, why do people do this: > -find_conflict: > + find_conflict: Hmm? Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 16:36 ` Linus Torvalds @ 2008-05-06 16:42 ` Matthew Wilcox 2008-05-06 16:39 ` Alan Cox 2008-05-06 20:28 ` Linus Torvalds 2008-05-06 16:44 ` J. Bruce Fields 1 sibling, 2 replies; 140+ messages in thread From: Matthew Wilcox @ 2008-05-06 16:42 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton, linux-fsdevel On Tue, May 06, 2008 at 09:36:06AM -0700, Linus Torvalds wrote: > On Tue, 6 May 2008, Matthew Wilcox wrote: > > I've wanted to fix file locks for a while. Here's a first attempt. > > It was done quickly, so I concede that it may well have bugs in it. > > I found (and fixed) one with LTP. > > Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is > there some user that doesn't have the inode info, or does anything that > might cross inode boundaries? /proc/locks and deadlock detection both cross inode boundaries (and even filesystem boundaries). The BKL-removal brigade tried this back in 2.4 and the locking ended up scaling worse than just plonking a single spinlock around the whole thing. > This does seem to drop all locking around the "setlease()" calls down to > the filesystem, which worries me. That said, we clearly do need to do > this. Probably should have done it a long time ago. The only filesystems that are going to have their own setlease methods will be remote ones (nfs, smbfs, etc). They're going to need to sleep while the server responds to them. So holding a spinlock while we call them is impolite at best. > Also, why do people do this: > > > -find_conflict: > > + find_conflict: > > Hmm? So that find_conflict doesn't end up in the first column, which causes diff to treat it as a function name for the purposes of the @@ lines. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 16:42 ` Matthew Wilcox @ 2008-05-06 16:39 ` Alan Cox 2008-05-06 16:51 ` Matthew Wilcox 2008-05-06 20:28 ` Linus Torvalds 1 sibling, 1 reply; 140+ messages in thread From: Alan Cox @ 2008-05-06 16:39 UTC (permalink / raw) To: Matthew Wilcox Cc: Linus Torvalds, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton, linux-fsdevel > > Hmm? > > So that find_conflict doesn't end up in the first column, which causes > diff to treat it as a function name for the purposes of the @@ lines. Please can we just fix the tools not mangle the kernel to work around silly bugs ? ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 16:39 ` Alan Cox @ 2008-05-06 16:51 ` Matthew Wilcox 2008-05-06 16:45 ` Alan Cox 2008-05-06 17:42 ` Linus Torvalds 0 siblings, 2 replies; 140+ messages in thread From: Matthew Wilcox @ 2008-05-06 16:51 UTC (permalink / raw) To: Alan Cox Cc: Linus Torvalds, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton, linux-fsdevel On Tue, May 06, 2008 at 05:39:48PM +0100, Alan Cox wrote: > > > Hmm? > > > > So that find_conflict doesn't end up in the first column, which causes > > diff to treat it as a function name for the purposes of the @@ lines. > > Please can we just fix the tools not mangle the kernel to work around > silly bugs ? The people who control the tools refuse to fix them. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 16:51 ` Matthew Wilcox @ 2008-05-06 16:45 ` Alan Cox 2008-05-06 17:42 ` Linus Torvalds 1 sibling, 0 replies; 140+ messages in thread From: Alan Cox @ 2008-05-06 16:45 UTC (permalink / raw) To: Matthew Wilcox Cc: Linus Torvalds, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton, linux-fsdevel On Tue, 6 May 2008 10:51:12 -0600 Matthew Wilcox <matthew@wil.cx> wrote: > On Tue, May 06, 2008 at 05:39:48PM +0100, Alan Cox wrote: > > > > Hmm? > > > > > > So that find_conflict doesn't end up in the first column, which causes > > > diff to treat it as a function name for the purposes of the @@ lines. > > > > Please can we just fix the tools not mangle the kernel to work around > > silly bugs ? > > The people who control the tools refuse to fix them. That would be their problem. We should refuse to mash the kernel up because they aren't doing their job. If need be someone can fork a private git-patch. ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 16:51 ` Matthew Wilcox 2008-05-06 16:45 ` Alan Cox @ 2008-05-06 17:42 ` Linus Torvalds 1 sibling, 0 replies; 140+ messages in thread From: Linus Torvalds @ 2008-05-06 17:42 UTC (permalink / raw) To: Matthew Wilcox Cc: Alan Cox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton, linux-fsdevel On Tue, 6 May 2008, Matthew Wilcox wrote: > On Tue, May 06, 2008 at 05:39:48PM +0100, Alan Cox wrote: > > > > Hmm? > > > > > > So that find_conflict doesn't end up in the first column, which causes > > > diff to treat it as a function name for the purposes of the @@ lines. > > > > Please can we just fix the tools not mangle the kernel to work around > > silly bugs ? > > The people who control the tools refuse to fix them. That's just plain bullocks. First off, that "@@" line thing in diffs is not important enough to screw up the source code for. Ever. It's just a small hint to make it somewhat easier to see more context for humans. Second, it's not even that bad to show the last label there, rather than the function name. Third, you seem to be a git user, so if you actually really care that much about the @@ line, then git actually lets you set your very own pattern for those things. In fact, you can even do it on a per-file basis based on things like filename rules (ie you can have different patterns for what to trigger on for a *.c file and for a *.S file, since in a *.S file the 'name:' thing _is_ the right pattern). So not only are you making idiotic changes just for irrelevant tool usage, you're also apparently lying about people "refusing to fix" things as an excuse. You can play with it. It's documented in gitattributes (see "Defining a custom hunk header"), and the default one is just the same one that GNU diff uses for "-p". I think. You can add something like this to your ~/.gitconfig: [diff "default"] funcname=^[a-zA-Z_$].*(.*$ to only trigger the funcname pattern on a line that starts with a valid C identifier hing, and contains a '('. And you can just override the default like the above (that way you don't have to specify attributes), but if you want to do things differently for *.c files than from *.S files, you can edit your .git/info/attributes file and make it contain something like *.S diff=assembler *.c diff=C and now you can make your ~/.gitconfig actually show them differently, ie something like [diff "C"] funcname=^[a-zA-Z_$].*(.*$ [diff "assembler"] funcname=^[a-zA-Z_$].*: etc. Of course, there is a real cost to this, but it's cheap enough in practice that you'll never notice. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 16:42 ` Matthew Wilcox 2008-05-06 16:39 ` Alan Cox @ 2008-05-06 20:28 ` Linus Torvalds 1 sibling, 0 replies; 140+ messages in thread From: Linus Torvalds @ 2008-05-06 20:28 UTC (permalink / raw) To: Matthew Wilcox Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton, linux-fsdevel On Tue, 6 May 2008, Matthew Wilcox wrote: > On Tue, May 06, 2008 at 09:36:06AM -0700, Linus Torvalds wrote: > > > > Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is > > there some user that doesn't have the inode info, or does anything that > > might cross inode boundaries? > > /proc/locks and deadlock detection both cross inode boundaries (and even > filesystem boundaries). The BKL-removal brigade tried this back in 2.4 > and the locking ended up scaling worse than just plonking a single > spinlock around the whole thing. Ok, no worries. Just as long as I know why it's a single lock. Looks ok to me, apart from the need for testing (and talking to NFS etc people). Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 16:36 ` Linus Torvalds 2008-05-06 16:42 ` Matthew Wilcox @ 2008-05-06 16:44 ` J. Bruce Fields 1 sibling, 0 replies; 140+ messages in thread From: J. Bruce Fields @ 2008-05-06 16:44 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Ingo Molnar, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton, linux-fsdevel, richterd On Tue, May 06, 2008 at 09:36:06AM -0700, Linus Torvalds wrote: > > > On Tue, 6 May 2008, Matthew Wilcox wrote: > > > > I've wanted to fix file locks for a while. Here's a first attempt. > > It was done quickly, so I concede that it may well have bugs in it. > > I found (and fixed) one with LTP. > > Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is > there some user that doesn't have the inode info, or does anything that > might cross inode boundaries? The deadlock detection crosses inode boundaries. --b. ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 16:23 ` Matthew Wilcox 2008-05-06 16:36 ` Linus Torvalds @ 2008-05-06 17:21 ` Andrew Morton 2008-05-06 17:31 ` Matthew Wilcox ` (3 more replies) 2008-05-08 3:24 ` AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin 2 siblings, 4 replies; 140+ messages in thread From: Andrew Morton @ 2008-05-06 17:21 UTC (permalink / raw) To: Matthew Wilcox Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds, linux-fsdevel On Tue, 6 May 2008 10:23:32 -0600 Matthew Wilcox <matthew@wil.cx> wrote: > On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote: > > So the only likely things I can see are: > > > > - file locks > > - fasync > > I've wanted to fix file locks for a while. Here's a first attempt. Do we actually know that the locks code is implicated in this regression? I'd initially thought "lseek" but afaict tmpfs doesn't hit default_llseek() or remote_llseek(). tmpfs tends to do weird stuff - it would be interesting to know if the regression is also present on ramfs or ext2/ext3/xfs/etc. It would be interesting to see if the context switch rate has increased. Finally: how come we regressed by swapping the semaphore implementation anyway? We went from one sleeping lock implementation to another - I'd have expected performance to be pretty much the same. <looks at the implementation> down(), down_interruptible() and down_try() should use spin_lock_irq(), not irqsave. up() seems to be doing wake-one, FIFO which is nice. Did the implementation which we just removed also do that? Was it perhaps accidentally doing LIFO or something like that? ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 17:21 ` Andrew Morton @ 2008-05-06 17:31 ` Matthew Wilcox 2008-05-06 17:49 ` Ingo Molnar 2008-05-06 17:39 ` Ingo Molnar ` (2 subsequent siblings) 3 siblings, 1 reply; 140+ messages in thread From: Matthew Wilcox @ 2008-05-06 17:31 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds, linux-fsdevel On Tue, May 06, 2008 at 10:21:53AM -0700, Andrew Morton wrote: > Do we actually know that the locks code is implicated in this regression? Not yet. We don't even know it's the BKL. It's just my best guess. We're waiting for the original reporter to run some tests Ingo pointed him at. > I'd initially thought "lseek" but afaict tmpfs doesn't hit default_llseek() > or remote_llseek(). Correct. > Finally: how come we regressed by swapping the semaphore implementation > anyway? We went from one sleeping lock implementation to another - I'd > have expected performance to be pretty much the same. > > <looks at the implementation> > > down(), down_interruptible() and down_try() should use spin_lock_irq(), not > irqsave. We talked about this ... the BKL actually requires that you be able to acquire it with interrupts disabled. Maybe we should make lock_kernel do this: if (likely(!depth)) { unsigned long flags; local_save_flags(flags); down(); local_irq_restore(flags); } But tweaking down() is not worth it -- we should be eliminating users of both the BKL and semaphores instead. > up() seems to be doing wake-one, FIFO which is nice. Did the > implementation which we just removed also do that? Was it perhaps > accidentally doing LIFO or something like that? That's a question for someone who knows x86 assembler, I think. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 17:31 ` Matthew Wilcox @ 2008-05-06 17:49 ` Ingo Molnar 2008-05-06 18:07 ` Andrew Morton 0 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-06 17:49 UTC (permalink / raw) To: Matthew Wilcox Cc: Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds, linux-fsdevel * Matthew Wilcox <matthew@wil.cx> wrote: > > down(), down_interruptible() and down_try() should use > > spin_lock_irq(), not irqsave. > > We talked about this ... the BKL actually requires that you be able to > acquire it with interrupts disabled. [...] hm, where does it require it, besides the early bootup code? (which should just be fixed) down_trylock() is OK as irqsave/irqrestore for legacy reasons, but that is fundamentally atomic anyway. > > up() seems to be doing wake-one, FIFO which is nice. Did the > > implementation which we just removed also do that? Was it perhaps > > accidentally doing LIFO or something like that? > > That's a question for someone who knows x86 assembler, I think. the assembly is mostly just for the fastpath - and a 40% regression cannot be about fastpath differences. In the old code the scheduling happens in lib/semaphore-sleeper.c, and from the looks of it it appears to be a proper FIFO as well. (plus this small wakeup weirdness it has) i reviewed the new code in kernel/semaphore.c as well and can see nothing bad in it - it does proper wake-up, FIFO queueing, like the mutex code. Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 17:49 ` Ingo Molnar @ 2008-05-06 18:07 ` Andrew Morton 2008-05-11 11:11 ` Matthew Wilcox 0 siblings, 1 reply; 140+ messages in thread From: Andrew Morton @ 2008-05-06 18:07 UTC (permalink / raw) To: Ingo Molnar Cc: matthew, bfields, yanmin_zhang, linux-kernel, viro, torvalds, linux-fsdevel On Tue, 6 May 2008 19:49:54 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Matthew Wilcox <matthew@wil.cx> wrote: > > > > down(), down_interruptible() and down_try() should use > > > spin_lock_irq(), not irqsave. > > > > We talked about this ... the BKL actually requires that you be able to > > acquire it with interrupts disabled. [...] > > hm, where does it require it, besides the early bootup code? (which > should just be fixed) Yeah, the early bootup code. The kernel does accidental lock_kernel()s in various places and if that renables interrupts then powerpc goeth crunch. Matthew, that seemingly-unneeded irqsave in lib/semaphore.c is a prime site for /* one of these things */, no? > down_trylock() is OK as irqsave/irqrestore for legacy reasons, but that > is fundamentally atomic anyway. yes, trylock should be made irq-safe. > > > up() seems to be doing wake-one, FIFO which is nice. Did the > > > implementation which we just removed also do that? Was it perhaps > > > accidentally doing LIFO or something like that? > > > > That's a question for someone who knows x86 assembler, I think. > > the assembly is mostly just for the fastpath - and a 40% regression > cannot be about fastpath differences. In the old code the scheduling > happens in lib/semaphore-sleeper.c, and from the looks of it it appears > to be a proper FIFO as well. (plus this small wakeup weirdness it has) > > i reviewed the new code in kernel/semaphore.c as well and can see > nothing bad in it - it does proper wake-up, FIFO queueing, like the > mutex code. > There's the weird wakeup in down() which I understood for about five minutes five years ago. Perhaps that accidentally sped something up. Oh well, more investigation needed.. ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 18:07 ` Andrew Morton @ 2008-05-11 11:11 ` Matthew Wilcox 0 siblings, 0 replies; 140+ messages in thread From: Matthew Wilcox @ 2008-05-11 11:11 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, bfields, yanmin_zhang, linux-kernel, viro, torvalds, linux-fsdevel On Tue, May 06, 2008 at 11:07:52AM -0700, Andrew Morton wrote: > Yeah, the early bootup code. The kernel does accidental lock_kernel()s in > various places and if that renables interrupts then powerpc goeth crunch. > > Matthew, that seemingly-unneeded irqsave in lib/semaphore.c is a prime site > for /* one of these things */, no? I was just reviewing the code and I came across one of these: /* * Some notes on the implementation: * * The spinlock controls access to the other members of the semaphore. * down_trylock() and up() can be called from interrupt context, so we * have to disable interrupts when taking the lock. It turns out various * parts of the kernel expect to be able to use down() on a semaphore in * interrupt context when they know it will succeed, so we have to use * irqsave variants for down(), down_interruptible() and down_killable() * too. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 17:21 ` Andrew Morton 2008-05-06 17:31 ` Matthew Wilcox @ 2008-05-06 17:39 ` Ingo Molnar 2008-05-07 6:49 ` Zhang, Yanmin 2008-05-06 17:45 ` Linus Torvalds 2008-05-07 16:38 ` Matthew Wilcox 3 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-06 17:39 UTC (permalink / raw) To: Andrew Morton Cc: Matthew Wilcox, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds, linux-fsdevel * Andrew Morton <akpm@linux-foundation.org> wrote: > Finally: how come we regressed by swapping the semaphore > implementation anyway? We went from one sleeping lock implementation > to another - I'd have expected performance to be pretty much the same. > > <looks at the implementation> > > down(), down_interruptible() and down_try() should use > spin_lock_irq(), not irqsave. > > up() seems to be doing wake-one, FIFO which is nice. Did the > implementation which we just removed also do that? Was it perhaps > accidentally doing LIFO or something like that? i just checked the old implementation on x86. It used lib/semaphore-sleepers.c which does one weird thing: - __down() when it returns wakes up yet another task via wake_up_locked(). i.e. we'll always keep yet another task in flight. This can mask wakeup latencies especially when it takes time. The patch (hack) below tries to emulate this weirdness - it 'kicks' another task as well and keeps it busy. Most of the time this just causes extra scheduling, but if AIM7 is _just_ saturating the number of CPUs, it might make a difference. Yanmin, does the patch below make any difference to the AIM7 results? ( it would be useful data to get a meaningful context switch trace from the whole regressed workload, and compare it to a context switch trace with the revert added. ) Ingo --- kernel/semaphore.c | 10 ++++++++++ 1 file changed, 10 insertions(+) Index: linux/kernel/semaphore.c =================================================================== --- linux.orig/kernel/semaphore.c +++ linux/kernel/semaphore.c @@ -261,4 +261,14 @@ static noinline void __sched __up(struct list_del(&waiter->list); waiter->up = 1; wake_up_process(waiter->task); + + if (likely(list_empty(&sem->wait_list))) + return; + /* + * Opportunistically wake up another task as well but do not + * remove it from the list: + */ + waiter = list_first_entry(&sem->wait_list, + struct semaphore_waiter, list); + wake_up_process(waiter->task); } ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 17:39 ` Ingo Molnar @ 2008-05-07 6:49 ` Zhang, Yanmin 0 siblings, 0 replies; 140+ messages in thread From: Zhang, Yanmin @ 2008-05-07 6:49 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Matthew Wilcox, J. Bruce Fields, LKML, Alexander Viro, Linus Torvalds, linux-fsdevel On Tue, 2008-05-06 at 19:39 +0200, Ingo Molnar wrote: > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > Finally: how come we regressed by swapping the semaphore > > implementation anyway? We went from one sleeping lock implementation > > to another - I'd have expected performance to be pretty much the same. > i.e. we'll always keep yet another task in flight. This can mask wakeup > latencies especially when it takes time. > > The patch (hack) below tries to emulate this weirdness - it 'kicks' > another task as well and keeps it busy. Most of the time this just > causes extra scheduling, but if AIM7 is _just_ saturating the number of > CPUs, it might make a difference. Yanmin, does the patch below make any > difference to the AIM7 results? I tested it on my 8-core stoakley and the result is 12% worse than the one of pure 2.6.26-rc1. -yanmin > > ( it would be useful data to get a meaningful context switch trace from > the whole regressed workload, and compare it to a context switch trace > with the revert added. ) > > Ingo > > --- > kernel/semaphore.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > Index: linux/kernel/semaphore.c > =================================================================== > --- linux.orig/kernel/semaphore.c > +++ linux/kernel/semaphore.c > @@ -261,4 +261,14 @@ static noinline void __sched __up(struct > list_del(&waiter->list); > waiter->up = 1; > wake_up_process(waiter->task); > + > + if (likely(list_empty(&sem->wait_list))) > + return; > + /* > + * Opportunistically wake up another task as well but do not > + * remove it from the list: > + */ > + waiter = list_first_entry(&sem->wait_list, > + struct semaphore_waiter, list); > + wake_up_process(waiter->task); > } ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 17:21 ` Andrew Morton 2008-05-06 17:31 ` Matthew Wilcox 2008-05-06 17:39 ` Ingo Molnar @ 2008-05-06 17:45 ` Linus Torvalds 2008-05-07 16:38 ` Matthew Wilcox 3 siblings, 0 replies; 140+ messages in thread From: Linus Torvalds @ 2008-05-06 17:45 UTC (permalink / raw) To: Andrew Morton Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel On Tue, 6 May 2008, Andrew Morton wrote: > > down(), down_interruptible() and down_try() should use spin_lock_irq(), not > irqsave. down_trylock() is used in atomic code. See for example kernel/printk.c. So no, that one needs to be irqsafe. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 17:21 ` Andrew Morton ` (2 preceding siblings ...) 2008-05-06 17:45 ` Linus Torvalds @ 2008-05-07 16:38 ` Matthew Wilcox 2008-05-07 16:55 ` Linus Torvalds 3 siblings, 1 reply; 140+ messages in thread From: Matthew Wilcox @ 2008-05-07 16:38 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds, linux-fsdevel On Tue, May 06, 2008 at 10:21:53AM -0700, Andrew Morton wrote: > up() seems to be doing wake-one, FIFO which is nice. Did the > implementation which we just removed also do that? Was it perhaps > accidentally doing LIFO or something like that? If heavily contended, it could do this. up() would increment sem->count and cal __up() which would call wake_up() down() would decrement sem->count. The unlucky task woken by __up would lose the race and go back to sleep. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 16:38 ` Matthew Wilcox @ 2008-05-07 16:55 ` Linus Torvalds 2008-05-07 17:08 ` Linus Torvalds 0 siblings, 1 reply; 140+ messages in thread From: Linus Torvalds @ 2008-05-07 16:55 UTC (permalink / raw) To: Matthew Wilcox Cc: Andrew Morton, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel On Wed, 7 May 2008, Matthew Wilcox wrote: > > If heavily contended, it could do this. It doesn' have to be heavily contended - if it's just hot and a bit lucky, it would potentially never schedule at all, because it would never take the spinlock and serialize the callers. It doesn't even need "unfairness" to work that way. The old semaphore implementation was very much designed to be lock-free, and if you had one CPU doing a lock while another did an unlock, the *common* situation was that the unlock would succeed first, because the unlocker was also the person who had the spinlock exclusively in its cache! The above may count as "lucky", but the hot-cache-line thing is a big deal. It likely "lucky" into something that isn't a 50:50 chance, but something that is quite possible to trigger consistently if you just have mostly short holders of the lock. Which, btw, is probably true. The BKL is normally held for short times, and released (by that thread) for relatively much longer times. Which is when spinlocks tend to work the best, even when they are fair (because it's not so much a fairness issue, it's simply a cost-of-taking-the-lock issue!) Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 16:55 ` Linus Torvalds @ 2008-05-07 17:08 ` Linus Torvalds 2008-05-07 17:16 ` Andrew Morton 2008-05-07 17:22 ` Ingo Molnar 0 siblings, 2 replies; 140+ messages in thread From: Linus Torvalds @ 2008-05-07 17:08 UTC (permalink / raw) To: Matthew Wilcox Cc: Andrew Morton, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel On Wed, 7 May 2008, Linus Torvalds wrote: > > Which, btw, is probably true. The BKL is normally held for short times, > and released (by that thread) for relatively much longer times. Which > is when spinlocks tend to work the best, even when they are fair (because > it's not so much a fairness issue, it's simply a cost-of-taking-the-lock > issue!) .. and don't get me wrong: the old semaphores (and the new mutexes) should also have this property when lucky: taking the lock is often a hot-path case. And the spinlock+generic semaphore thing probably makes that "lucky" behavior be exponentially less likely, because now to hit the lucky case, rather than the hot path having just *one* access to the interesting cache line, it has basically something like 4 accesses (spinlock, count test, count decrement, spinunlock), in addition to various serializing instructions, so I suspect it quite often gets serialized simply because even the "fast path" is actually about ten times as long! As a result, a slow "fast path" means that the thing gets saturated much more easily, and that in turn means that the "fast path" turns into a "slow path" more easily, which is how you end up in the scheduler rather than just taking the fast path. This is why sleeping locks are more expensive in general: they have a *huge* cost from when they get contended. Hundreds of times higher than a spinlock. And the faster they are, the longer it takes for them to get contended under load. So slowing them down in the fast path is a double whammy, in that it shows their bad behaviour much earlier. And the generic semaphores really are slower than the old optimized ones in that fast path. By a *big* amount. Which is why I'm 100% convinced it's not even worth saving the old code. It needs to use mutexes, or spinlocks. I bet it has *nothing* to do with "slow path" other than the fact that it gets to that slow path much more these days. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 17:08 ` Linus Torvalds @ 2008-05-07 17:16 ` Andrew Morton 2008-05-07 17:27 ` Linus Torvalds 2008-05-07 17:22 ` Ingo Molnar 1 sibling, 1 reply; 140+ messages in thread From: Andrew Morton @ 2008-05-07 17:16 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel On Wed, 7 May 2008 10:08:18 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > Which is why I'm 100% convinced it's not even worth saving the old code. > It needs to use mutexes, or spinlocks. I bet it has *nothing* to do with > "slow path" other than the fact that it gets to that slow path much more > these days. Stupid question: why doesn't lock_kernel() use a mutex? (stupid answer: it'll trigger might_sleep() checks when we do it early in boot with irqs disabled, but we can fix that) (And __might_sleep()'s system_state check might even save us from that) Of course, we shouldn't change anything until we've worked out why the new semaphores got slower. ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 17:16 ` Andrew Morton @ 2008-05-07 17:27 ` Linus Torvalds 0 siblings, 0 replies; 140+ messages in thread From: Linus Torvalds @ 2008-05-07 17:27 UTC (permalink / raw) To: Andrew Morton Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel On Wed, 7 May 2008, Andrew Morton wrote: > On Wed, 7 May 2008 10:08:18 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > Which is why I'm 100% convinced it's not even worth saving the old code. > > It needs to use mutexes, or spinlocks. I bet it has *nothing* to do with > > "slow path" other than the fact that it gets to that slow path much more > > these days. > > Stupid question: why doesn't lock_kernel() use a mutex? Not stupid. The only reason some code didn't get turned over to mutexes was literally that they didn't want the debugging because they were doing intentionally bad things. I think the BKL is one of them (the console semaphore was another, iirc). Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 17:08 ` Linus Torvalds 2008-05-07 17:16 ` Andrew Morton @ 2008-05-07 17:22 ` Ingo Molnar 2008-05-07 17:25 ` Ingo Molnar 2008-05-07 17:31 ` Linus Torvalds 1 sibling, 2 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-07 17:22 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel * Linus Torvalds <torvalds@linux-foundation.org> wrote: > Which is why I'm 100% convinced it's not even worth saving the old > code. It needs to use mutexes, or spinlocks. I bet it has *nothing* to > do with "slow path" other than the fact that it gets to that slow path > much more these days. i think your theory should be easy to test: Yanmin, could you turn on CONFIG_MUTEX_DEBUG=y and check by how much AIM7 regresses? Because in the CONFIG_MUTEX_DEBUG=y case the mutex debug code does exactly that: it doesnt use the single-instruction fastpath [it uses asm-generic/mutex-null.h] but always drops into the slowpath (to be able to access debug state). That debug code is about as expensive as the generic semaphore code's current fastpath. (perhaps even more expensive.) There's far more normal mutex fastpath use during an AIM7 run than any BKL use. So if it's due to any direct fastpath overhead and the resulting widening of the window for the real slowdown, we should see a severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed? Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 17:22 ` Ingo Molnar @ 2008-05-07 17:25 ` Ingo Molnar 2008-05-07 17:31 ` Linus Torvalds 1 sibling, 0 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-07 17:25 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel * Ingo Molnar <mingo@elte.hu> wrote: > There's far more normal mutex fastpath use during an AIM7 run than any > BKL use. So if it's due to any direct fastpath overhead and the > resulting widening of the window for the real slowdown, we should see > a severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed? my own guesstimate about AIM7 performance impact resulting out of CONFIG_MUTEX_DEBUG=y: performance overhead will not be measurable, or will at most be in the sub-1% range. But i've been badly wrong before :) Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 17:22 ` Ingo Molnar 2008-05-07 17:25 ` Ingo Molnar @ 2008-05-07 17:31 ` Linus Torvalds 2008-05-07 17:47 ` Linus Torvalds 2008-05-07 17:49 ` Ingo Molnar 1 sibling, 2 replies; 140+ messages in thread From: Linus Torvalds @ 2008-05-07 17:31 UTC (permalink / raw) To: Ingo Molnar Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel On Wed, 7 May 2008, Ingo Molnar wrote: > > There's far more normal mutex fastpath use during an AIM7 run than any > BKL use. So if it's due to any direct fastpath overhead and the > resulting widening of the window for the real slowdown, we should see a > severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed? Not agreed. The BKL is special because it is a *single* lock. All the "normal" mutex code use fine-grained locking, so even if you slow down the fast path, that won't cause the same kind of fastpath->slowpath increase. In order to see the fastpath->slowpath thing, you do need to have many threads hitting the same lock: ie the slowdown has to result in real contention. Almost no mutexes have any potential for contention what-so-ever, except for things that very consciously try to hit it (multiple threads doing readdir and/or file creation on the *same* directory etc). The BKL really is special. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 17:31 ` Linus Torvalds @ 2008-05-07 17:47 ` Linus Torvalds 2008-05-07 17:49 ` Ingo Molnar 1 sibling, 0 replies; 140+ messages in thread From: Linus Torvalds @ 2008-05-07 17:47 UTC (permalink / raw) To: Ingo Molnar Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel On Wed, 7 May 2008, Linus Torvalds wrote: > > All the "normal" mutex code use fine-grained locking, so even if you slow > down the fast path, that won't cause the same kind of fastpath->slowpath > increase. Put another way: let's say that the "good fastpath" is basically a single locked instruction - ~12 cycles on AMD, ~35 Core 2. That's the no-bouncing, no-contention case. Doing it with debugging (call overhead, spinlocks, local irq saving rtc) will probably easily triple it or more, but we're not changing anything else. There's no "downstream" effect: the behaviour itself doesn't change. It doesn't get more bouncing, it doesn't start sleeping. But what happens if the lock has the *potential* for conflicts is different. There, a "longish pause + fast lock + short average code sequece + fast unlock" is quite likely to stay uncontended for a fair amount of time, and while it will be much slower than the no-contention-at-all case (because you do get a pretty likely cacheline event at the "fast lock" part), with a fairly low number of CPU's and a long enough pause, you *also* easily get into a pattern where the thing that got the lock will likely also get to unlock without dropping the cacheline. So far so good. But that basically depends on the fact that "lock + work + unlock" is _much_ shorter than the "longish pause" in between, so that even if you have <n> CPU's all doing the same thing, their pauses between the locked section are still bigger than <n> times that short time. Once that is no longer true, you now start to bounce both at the lock *and* the unlock, and now that whole sequence got likely ten times slower. *AND* because it now actually has real contention, it actually got even worse: if the lock is a sleeping one, you get *another* order of magnitude just because you now started doing scheduling overhead too! So the thing is, it just breaks down very badly. A spinlock that gets contention probably gets ten times slower due to bouncing the cacheline. A semaphore that gets contention probably gets a *hundred* times slower, or more. And so my bet is that both the old and the new semaphores had the same bad break-down situation, but the new semaphores just are a lot easier to trigger it because they are at least three times costlier than the old ones, so you just hit the bad behaviour with much lower loads (or fewer number of CPU's). But spinlocks really do behave much better when contended, because at least they don't get the even bigger hit of also hitting the scheduler. So the old semaphores would have behaved badly too *eventually*, they just needed a more extreme case to show that bad behavior. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 17:31 ` Linus Torvalds 2008-05-07 17:47 ` Linus Torvalds @ 2008-05-07 17:49 ` Ingo Molnar 2008-05-07 18:02 ` Linus Torvalds 1 sibling, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-07 17:49 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > There's far more normal mutex fastpath use during an AIM7 run than > > any BKL use. So if it's due to any direct fastpath overhead and the > > resulting widening of the window for the real slowdown, we should > > see a severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed? > > Not agreed. > > The BKL is special because it is a *single* lock. ok, indeed my suggestion is wrong and this would not be a good comparison. another idea: my trial-baloon patch should test your theory too, because the generic down_trylock() is still the 'fat' version, it does: spin_lock_irqsave(&sem->lock, flags); count = sem->count - 1; if (likely(count >= 0)) sem->count = count; spin_unlock_irqrestore(&sem->lock, flags); if there is a noticeable performance difference between your trial-ballon patch and mine, then the micro-cost of the BKL very much matters to this workload. Agreed about that? but i'd be _hugely_ surprised about it. The tty code's BKL use should i think only happen when a task exits and releases the tty - and a task exit - even if this is a threaded test (which AIM7 can be - not sure which exact parameters Yanmin used) - the costs of thread creation and thread exit are just not in the same ballpark as any BKL micro-costs. Dunno, maybe i overlooked some high-freq BKL user. (but any such site would have shown up before) Even assuming a widening of the critical path and some catastrophic domino effect (that does show up as increased scheduling) i've never seen a 40% drop like this. this regression, to me, has "different scheduling behavior" written all over it - but that's just an impression. I'm not going to bet against you though ;-) Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 17:49 ` Ingo Molnar @ 2008-05-07 18:02 ` Linus Torvalds 2008-05-07 18:17 ` Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Linus Torvalds @ 2008-05-07 18:02 UTC (permalink / raw) To: Ingo Molnar Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel On Wed, 7 May 2008, Ingo Molnar wrote: > > another idea: my trial-baloon patch should test your theory too, because > the generic down_trylock() is still the 'fat' version, it does: I agree that your trial-balloon should likely get rid of the big regression, since it avoids the scheduler. So with your patch, lock_kernel() ends up being just a rather expensive spinlock. And yes, I'd expect that it should get rid of the 40% cost, because while it makes lock_kernel() more expensive than a spinlock and you might end up having a few more cacheline bounces on the lock due to that, that's still the "small" expense compared to going through the whole scheduler on conflicts. So I'd expect that realistically the performance difference between your version and just plain spinlocks shouldn't be *that* big. I'd expect it to be visible, but in the (low) single-digit percentage range rather than in any 40% range. That's just a guess. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 18:02 ` Linus Torvalds @ 2008-05-07 18:17 ` Ingo Molnar 2008-05-07 18:27 ` Linus Torvalds 0 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-07 18:17 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, 7 May 2008, Ingo Molnar wrote: > > > > another idea: my trial-baloon patch should test your theory too, > > because the generic down_trylock() is still the 'fat' version, it > > does: > > I agree that your trial-balloon should likely get rid of the big > regression, since it avoids the scheduler. > > So with your patch, lock_kernel() ends up being just a rather > expensive spinlock. And yes, I'd expect that it should get rid of the > 40% cost, because while it makes lock_kernel() more expensive than a > spinlock and you might end up having a few more cacheline bounces on > the lock due to that, that's still the "small" expense compared to > going through the whole scheduler on conflicts. > > So I'd expect that realistically the performance difference between > your version and just plain spinlocks shouldn't be *that* big. I'd > expect it to be visible, but in the (low) single-digit percentage > range rather than in any 40% range. That's just a guess. third attempt - the patch below ontop of v2.6.25 should be quite similar fastpath atomic overhead to what generic semaphores do? So if Yanmin tests this patch ontop of v2.6.25, we should see the direct fastpath overhead - without any changes to the semaphore wakeup/scheduling logic otherwise. [ this patch should in fact be a bit worse, because there's two more atomics in the fastpath - the fastpath atomics of the old semaphore code. ] Ingo ------------------> Subject: v2.6.25 BKL: add atomic overhead From: Ingo Molnar <mingo@elte.hu> Date: Wed May 07 20:09:13 CEST 2008 --- lib/kernel_lock.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) Index: linux-2.6.25/lib/kernel_lock.c =================================================================== --- linux-2.6.25.orig/lib/kernel_lock.c +++ linux-2.6.25/lib/kernel_lock.c @@ -24,6 +24,7 @@ * Don't use in new code. */ static DECLARE_MUTEX(kernel_sem); +static DEFINE_SPINLOCK(global_lock); /* * Re-acquire the kernel semaphore. @@ -47,6 +48,9 @@ int __lockfunc __reacquire_kernel_lock(v down(&kernel_sem); + spin_lock(&global_lock); + spin_unlock(&global_lock); + preempt_disable(); task->lock_depth = saved_lock_depth; @@ -55,6 +59,9 @@ int __lockfunc __reacquire_kernel_lock(v void __lockfunc __release_kernel_lock(void) { + spin_lock(&global_lock); + spin_unlock(&global_lock); + up(&kernel_sem); } @@ -66,12 +73,16 @@ void __lockfunc lock_kernel(void) struct task_struct *task = current; int depth = task->lock_depth + 1; - if (likely(!depth)) + if (likely(!depth)) { /* * No recursion worries - we set up lock_depth _after_ */ down(&kernel_sem); + spin_lock(&global_lock); + spin_unlock(&global_lock); + } + task->lock_depth = depth; } ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 18:17 ` Ingo Molnar @ 2008-05-07 18:27 ` Linus Torvalds 2008-05-07 18:43 ` Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Linus Torvalds @ 2008-05-07 18:27 UTC (permalink / raw) To: Ingo Molnar Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel On Wed, 7 May 2008, Ingo Molnar wrote: > > [ this patch should in fact be a bit worse, because there's two more > atomics in the fastpath - the fastpath atomics of the old semaphore > code. ] Well, it doesn't have the irq stuff, which is also pretty costly. Also, it doesn't nest the accesses the same way (with the counts being *inside* the spinlock and serialized against each other), so I'm not 100% sure you'd get the same behaviour. But yes, it certainly has the potential to show the same slowdown. But it's not a very good patch, since not showing it doesn't really prove much. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 18:27 ` Linus Torvalds @ 2008-05-07 18:43 ` Ingo Molnar 2008-05-07 19:01 ` Linus Torvalds 0 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-07 18:43 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > [ this patch should in fact be a bit worse, because there's two more > > atomics in the fastpath - the fastpath atomics of the old > > semaphore code. ] > > Well, it doesn't have the irq stuff, which is also pretty costly. > Also, it doesn't nest the accesses the same way (with the counts being > *inside* the spinlock and serialized against each other), so I'm not > 100% sure you'd get the same behaviour. > > But yes, it certainly has the potential to show the same slowdown. But > it's not a very good patch, since not showing it doesn't really prove > much. ok, the one below does irq ops and the counter behavior - and because the critical section also has the old-semaphore atomics i think this should definitely be a more expensive fastpath than what the new generic code introduces. So if this patch produces a 40% AIM7 slowdown on v2.6.25 it's the fastpath overhead (and its effects on slowpath probability) that makes the difference. Ingo -------------------> Subject: add BKL atomic overhead From: Ingo Molnar <mingo@elte.hu> Date: Wed May 07 20:09:13 CEST 2008 NOT-Signed-off-by: Ingo Molnar <mingo@elte.hu> --- lib/kernel_lock.c | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) Index: linux-2.6.25/lib/kernel_lock.c =================================================================== --- linux-2.6.25.orig/lib/kernel_lock.c +++ linux-2.6.25/lib/kernel_lock.c @@ -24,6 +24,8 @@ * Don't use in new code. */ static DECLARE_MUTEX(kernel_sem); +static int global_count; +static DEFINE_SPINLOCK(global_lock); /* * Re-acquire the kernel semaphore. @@ -39,6 +41,7 @@ int __lockfunc __reacquire_kernel_lock(v { struct task_struct *task = current; int saved_lock_depth = task->lock_depth; + unsigned long flags; BUG_ON(saved_lock_depth < 0); @@ -47,6 +50,10 @@ int __lockfunc __reacquire_kernel_lock(v down(&kernel_sem); + spin_lock_irqsave(&global_lock, flags); + global_count++; + spin_unlock_irqrestore(&global_lock, flags); + preempt_disable(); task->lock_depth = saved_lock_depth; @@ -55,6 +62,10 @@ int __lockfunc __reacquire_kernel_lock(v void __lockfunc __release_kernel_lock(void) { + spin_lock_irqsave(&global_lock, flags); + global_count--; + spin_unlock_irqrestore(&global_lock, flags); + up(&kernel_sem); } @@ -66,12 +77,17 @@ void __lockfunc lock_kernel(void) struct task_struct *task = current; int depth = task->lock_depth + 1; - if (likely(!depth)) + if (likely(!depth)) { /* * No recursion worries - we set up lock_depth _after_ */ down(&kernel_sem); + spin_lock_irqsave(&global_lock, flags); + global_count++; + spin_unlock_irqrestore(&global_lock, flags); + } + task->lock_depth = depth; } ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 18:43 ` Ingo Molnar @ 2008-05-07 19:01 ` Linus Torvalds 2008-05-07 19:09 ` Ingo Molnar 2008-05-07 19:24 ` Matthew Wilcox 0 siblings, 2 replies; 140+ messages in thread From: Linus Torvalds @ 2008-05-07 19:01 UTC (permalink / raw) To: Ingo Molnar Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel On Wed, 7 May 2008, Ingo Molnar wrote: > > ok, the one below does irq ops and the counter behavior No it doesn't. The counter isn't used for any actual *testing*, so the locking around it and the serialization of it has absolutely no impact on the scheduling behaviour! Since the big slowdown was clearly accompanied by sleeping behaviour (the processes who didn't get the lock end up sleeping!), that is a *big* part of the slowdown. Is it possible that your patch gets similar behaviour? Absolutely. But you're missing the whole point here. Anybody can make code behave badly and perform worse. But if you want to just verify that it's about the sleeping behaviour and timings of the BKL, then you need to do exactly that: emulate the sleeping behavior, not just the timings _outside_ of the sleeping behavior. The thing is, we definitely are interested to see whether it's the BKL or some other semaphore that is the problem. But the best way to test that is to just try my patch that *guarantees* that the BKL doesn't have any semaphore behaviour AT ALL. Could it be something else entirely? Yes. We know it's semaphore- related. We don't know for a fact that it's the BKL itself. There could be other semaphores that are that hot. It sounds unlikely, but quite frankly, regardless, I don't really see the point of your patches. If Yanmin tries my patch, it is *guaranteed* to show something. It either shows that it's about the BKL (and that we absolutely have to do the BKL as something _else_ than a generic semaphore), or it shows that it's not about the BKL (and that _all_ the patches in this discussion are likely pointless). In contrast, these "try to emulate bad behavior with the old known-ok semaphores" don't show anything AT ALL. We already know it's related to semaphores. And your patches aren't even guaranteed to show the same issue. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 19:01 ` Linus Torvalds @ 2008-05-07 19:09 ` Ingo Molnar 2008-05-07 19:24 ` Matthew Wilcox 1 sibling, 0 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-07 19:09 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel * Linus Torvalds <torvalds@linux-foundation.org> wrote: > In contrast, these "try to emulate bad behavior with the old known-ok > semaphores" don't show anything AT ALL. We already know it's related > to semaphores. And your patches aren't even guaranteed to show the > same issue. yeah, i was just trying to come up with patches to probe which one of the following two possibilities is actually the case: - if the regression is due to the difference in scheduling behavior of new semaphores (different wakeup patterns, etc.), that's fixable in the new semaphore code => then the BKL code need not change. - if the regression is due due to difference in the fastpath cost, then the new semaphores can probably not be improved (much of their appeal comes from them not being complex and not being in assembly) => then the BKL code needs to change to become cheaper [i.e. then we want your patch]. Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 19:01 ` Linus Torvalds 2008-05-07 19:09 ` Ingo Molnar @ 2008-05-07 19:24 ` Matthew Wilcox 2008-05-07 19:44 ` Linus Torvalds 1 sibling, 1 reply; 140+ messages in thread From: Matthew Wilcox @ 2008-05-07 19:24 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel On Wed, May 07, 2008 at 12:01:28PM -0700, Linus Torvalds wrote: > The thing is, we definitely are interested to see whether it's the BKL or > some other semaphore that is the problem. But the best way to test that is > to just try my patch that *guarantees* that the BKL doesn't have any > semaphore behaviour AT ALL. > > Could it be something else entirely? Yes. We know it's semaphore- related. > We don't know for a fact that it's the BKL itself. There could be other > semaphores that are that hot. It sounds unlikely, but quite frankly, > regardless, I don't really see the point of your patches. > > If Yanmin tries my patch, it is *guaranteed* to show something. It either > shows that it's about the BKL (and that we absolutely have to do the BKL > as something _else_ than a generic semaphore), or it shows that it's not > about the BKL (and that _all_ the patches in this discussion are likely > pointless). One patch I'd still like Yanmin to test is my one from yesterday which removes the BKL from fs/locks.c. http://marc.info/?l=linux-fsdevel&m=121009123427437&w=2 Obviously, it won't help if the problem isn't the BKL. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 19:24 ` Matthew Wilcox @ 2008-05-07 19:44 ` Linus Torvalds 2008-05-07 20:00 ` Oi. NFS people. Read this Matthew Wilcox 0 siblings, 1 reply; 140+ messages in thread From: Linus Torvalds @ 2008-05-07 19:44 UTC (permalink / raw) To: Matthew Wilcox Cc: Ingo Molnar, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel On Wed, 7 May 2008, Matthew Wilcox wrote: > > One patch I'd still like Yanmin to test is my one from yesterday which > removes the BKL from fs/locks.c. And I'd personally rather have the network-fs people test and comment on that one ;) I think that patch is worth looking at regardless, but the problems with that one aren't about performance, but about what the implications are for the filesystems (if any)... Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Oi. NFS people. Read this. 2008-05-07 19:44 ` Linus Torvalds @ 2008-05-07 20:00 ` Matthew Wilcox 2008-05-07 22:10 ` Trond Myklebust 0 siblings, 1 reply; 140+ messages in thread From: Matthew Wilcox @ 2008-05-07 20:00 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel On Wed, May 07, 2008 at 12:44:48PM -0700, Linus Torvalds wrote: > On Wed, 7 May 2008, Matthew Wilcox wrote: > > > > One patch I'd still like Yanmin to test is my one from yesterday which > > removes the BKL from fs/locks.c. > > And I'd personally rather have the network-fs people test and comment on > that one ;) > > I think that patch is worth looking at regardless, but the problems with > that one aren't about performance, but about what the implications are for > the filesystems (if any)... Oh, well, they don't seem interested. I can comment on some of the problems though. fs/lockd/svcsubs.c, fs/nfs/delegation.c, fs/nfs/nfs4state.c, fs/nfsd/nfs4state.c all walk the i_flock list under the BKL. That won't protect them against locks.c any more. That's probably OK for fs/nfs/* since they'll be protected by their own data structures (Someone please check me on that?), but it's a bad idea for lockd/nfsd which are walking the lists for filesystems. Are we going to have to export the file_lock_lock? I'd rather not. But we need to keep nfsd/lockd from tripping over locks.c. Maybe we could come up with a decent API that lockd could use? It all seems a bit complex at the moment ... maybe lockd should be keeping track of the locks it owns anyway (since surely the posix deadlock detection code can't work properly if it's just passing all the locks through). -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: Oi. NFS people. Read this. 2008-05-07 20:00 ` Oi. NFS people. Read this Matthew Wilcox @ 2008-05-07 22:10 ` Trond Myklebust 2008-05-09 1:43 ` J. Bruce Fields 0 siblings, 1 reply; 140+ messages in thread From: Trond Myklebust @ 2008-05-07 22:10 UTC (permalink / raw) To: Matthew Wilcox Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel On Wed, 2008-05-07 at 14:00 -0600, Matthew Wilcox wrote: > On Wed, May 07, 2008 at 12:44:48PM -0700, Linus Torvalds wrote: > > On Wed, 7 May 2008, Matthew Wilcox wrote: > > > > > > One patch I'd still like Yanmin to test is my one from yesterday which > > > removes the BKL from fs/locks.c. > > > > And I'd personally rather have the network-fs people test and comment on > > that one ;) > > > > I think that patch is worth looking at regardless, but the problems with > > that one aren't about performance, but about what the implications are for > > the filesystems (if any)... > > Oh, well, they don't seem interested. Poor timing: we're all preparing for and travelling to the annual Connectathon interoperability testing conference which starts tomorrow. > I can comment on some of the problems though. > > fs/lockd/svcsubs.c, fs/nfs/delegation.c, fs/nfs/nfs4state.c, > fs/nfsd/nfs4state.c all walk the i_flock list under the BKL. That won't > protect them against locks.c any more. That's probably OK for fs/nfs/* > since they'll be protected by their own data structures (Someone please > check me on that?), but it's a bad idea for lockd/nfsd which are walking > the lists for filesystems. Yes. fs/nfs is just reusing the code in fs/locks.c in order to track the locks it holds on the server. We could alternatively have coded a private lock implementation, but this seemed easier. > Are we going to have to export the file_lock_lock? I'd rather not. But > we need to keep nfsd/lockd from tripping over locks.c. > > Maybe we could come up with a decent API that lockd could use? It all > seems a bit complex at the moment ... maybe lockd should be keeping > track of the locks it owns anyway (since surely the posix deadlock > detection code can't work properly if it's just passing all the locks > through). I'm not sure what you mean when you talk about lockd keeping track of the locks it owns. It has to keep those locks on inode->i_flock in order to make them visible to the host filesystem... All lockd really needs, is the ability to find a lock it owns, and then obtain a copy. As for the nfs client, I suspect we can make do with something similar... Cheers Trond ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: Oi. NFS people. Read this. 2008-05-07 22:10 ` Trond Myklebust @ 2008-05-09 1:43 ` J. Bruce Fields 0 siblings, 0 replies; 140+ messages in thread From: J. Bruce Fields @ 2008-05-09 1:43 UTC (permalink / raw) To: Trond Myklebust Cc: Matthew Wilcox, Linus Torvalds, Ingo Molnar, Andrew Morton, Zhang, Yanmin, LKML, Alexander Viro, linux-fsdevel On Wed, May 07, 2008 at 03:10:27PM -0700, Trond Myklebust wrote: > On Wed, 2008-05-07 at 14:00 -0600, Matthew Wilcox wrote: > > On Wed, May 07, 2008 at 12:44:48PM -0700, Linus Torvalds wrote: > > > On Wed, 7 May 2008, Matthew Wilcox wrote: > > > > > > > > One patch I'd still like Yanmin to test is my one from yesterday which > > > > removes the BKL from fs/locks.c. > > > > > > And I'd personally rather have the network-fs people test and comment on > > > that one ;) > > > > > > I think that patch is worth looking at regardless, but the problems with > > > that one aren't about performance, but about what the implications are for > > > the filesystems (if any)... > > > > Oh, well, they don't seem interested. > > Poor timing: we're all preparing for and travelling to the annual > Connectathon interoperability testing conference which starts tomorrow. > > > I can comment on some of the problems though. > > > > fs/lockd/svcsubs.c, fs/nfs/delegation.c, fs/nfs/nfs4state.c, > > fs/nfsd/nfs4state.c all walk the i_flock list under the BKL. That won't > > protect them against locks.c any more. That's probably OK for fs/nfs/* > > since they'll be protected by their own data structures (Someone please > > check me on that?), but it's a bad idea for lockd/nfsd which are walking > > the lists for filesystems. > > Yes. fs/nfs is just reusing the code in fs/locks.c in order to track the > locks it holds on the server. We could alternatively have coded a > private lock implementation, but this seemed easier. So, assuming nfs is taking care of its own locking (I don't know if that's right), that leaves nlm_traverse_locks() and nlm_file_inuse() (both in fs/lockd/svcsubs.c) as the problem spots. > > Are we going to have to export the file_lock_lock? I'd rather not. But > > we need to keep nfsd/lockd from tripping over locks.c. > > > > Maybe we could come up with a decent API that lockd could use? It all > > seems a bit complex at the moment ... maybe lockd should be keeping > > track of the locks it owns anyway (since surely the posix deadlock > > detection code can't work properly if it's just passing all the locks > > through). > > I'm not sure what you mean when you talk about lockd keeping track of > the locks it owns. It has to keep those locks on inode->i_flock in order > to make them visible to the host filesystem... > > All lockd really needs, is the ability to find a lock it owns, and then > obtain a copy. That sounds right. --b. > As for the nfs client, I suspect we can make do with > something similar... ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 16:23 ` Matthew Wilcox 2008-05-06 16:36 ` Linus Torvalds 2008-05-06 17:21 ` Andrew Morton @ 2008-05-08 3:24 ` Zhang, Yanmin 2008-05-08 3:34 ` Linus Torvalds 2 siblings, 1 reply; 140+ messages in thread From: Zhang, Yanmin @ 2008-05-08 3:24 UTC (permalink / raw) To: Matthew Wilcox Cc: Ingo Molnar, J. Bruce Fields, LKML, Alexander Viro, Linus Torvalds, Andrew Morton, linux-fsdevel On Tue, 2008-05-06 at 10:23 -0600, Matthew Wilcox wrote: > On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote: > > So the only likely things I can see are: > > > > - file locks > > - fasync > > I've wanted to fix file locks for a while. Here's a first attempt. > It was done quickly, so I concede that it may well have bugs in it. > I found (and fixed) one with LTP. > > It takes *no account* of nfsd, nor remote filesystems. We need to have > a serious discussion about their requirements. I tested it on 8-core stoakley. aim7 result becomes 23% worse than the one of pure 2.6.26-rc1. I replied this email in case you have many patches and I might test what you don't expect. -yanmin > > diff --git a/fs/locks.c b/fs/locks.c > index 663c069..cb09765 100644 > --- a/fs/locks.c > +++ b/fs/locks.c > @@ -140,6 +140,8 @@ int lease_break_time = 45; > #define for_each_lock(inode, lockp) \ > for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next) > > +static DEFINE_SPINLOCK(file_lock_lock); > + > static LIST_HEAD(file_lock_list); > static LIST_HEAD(blocked_list); > > @@ -510,9 +512,9 @@ static void __locks_delete_block(struct file_lock *waiter) > */ > static void locks_delete_block(struct file_lock *waiter) > { > - lock_kernel(); > + spin_lock(&file_lock_lock); > __locks_delete_block(waiter); > - unlock_kernel(); > + spin_unlock(&file_lock_lock); > } > > /* Insert waiter into blocker's block list. > @@ -649,7 +651,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl) > { > struct file_lock *cfl; > > - lock_kernel(); > + spin_lock(&file_lock_lock); > for (cfl = filp->f_path.dentry->d_inode->i_flock; cfl; cfl = cfl->fl_next) { > if (!IS_POSIX(cfl)) > continue; > @@ -662,7 +664,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl) > fl->fl_pid = pid_vnr(cfl->fl_nspid); > } else > fl->fl_type = F_UNLCK; > - unlock_kernel(); > + spin_unlock(&file_lock_lock); > return; > } > EXPORT_SYMBOL(posix_test_lock); > @@ -735,18 +737,21 @@ static int flock_lock_file(struct file *filp, struct file_lock *request) > int error = 0; > int found = 0; > > - lock_kernel(); > - if (request->fl_flags & FL_ACCESS) > + if (request->fl_flags & FL_ACCESS) { > + spin_lock(&file_lock_lock); > goto find_conflict; > + } > > if (request->fl_type != F_UNLCK) { > error = -ENOMEM; > + > new_fl = locks_alloc_lock(); > if (new_fl == NULL) > - goto out; > + goto out_unlocked; > error = 0; > } > > + spin_lock(&file_lock_lock); > for_each_lock(inode, before) { > struct file_lock *fl = *before; > if (IS_POSIX(fl)) > @@ -772,10 +777,13 @@ static int flock_lock_file(struct file *filp, struct file_lock *request) > * If a higher-priority process was blocked on the old file lock, > * give it the opportunity to lock the file. > */ > - if (found) > + if (found) { > + spin_unlock(&file_lock_lock); > cond_resched(); > + spin_lock(&file_lock_lock); > + } > > -find_conflict: > + find_conflict: > for_each_lock(inode, before) { > struct file_lock *fl = *before; > if (IS_POSIX(fl)) > @@ -796,8 +804,9 @@ find_conflict: > new_fl = NULL; > error = 0; > > -out: > - unlock_kernel(); > + out: > + spin_unlock(&file_lock_lock); > + out_unlocked: > if (new_fl) > locks_free_lock(new_fl); > return error; > @@ -826,7 +835,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str > new_fl2 = locks_alloc_lock(); > } > > - lock_kernel(); > + spin_lock(&file_lock_lock); > if (request->fl_type != F_UNLCK) { > for_each_lock(inode, before) { > fl = *before; > @@ -994,7 +1003,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str > locks_wake_up_blocks(left); > } > out: > - unlock_kernel(); > + spin_unlock(&file_lock_lock); > /* > * Free any unused locks. > */ > @@ -1069,14 +1078,14 @@ int locks_mandatory_locked(struct inode *inode) > /* > * Search the lock list for this inode for any POSIX locks. > */ > - lock_kernel(); > + spin_lock(&file_lock_lock); > for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) { > if (!IS_POSIX(fl)) > continue; > if (fl->fl_owner != owner) > break; > } > - unlock_kernel(); > + spin_unlock(&file_lock_lock); > return fl ? -EAGAIN : 0; > } > > @@ -1190,7 +1199,7 @@ int __break_lease(struct inode *inode, unsigned int mode) > > new_fl = lease_alloc(NULL, mode & FMODE_WRITE ? F_WRLCK : F_RDLCK); > > - lock_kernel(); > + spin_lock(&file_lock_lock); > > time_out_leases(inode); > > @@ -1251,8 +1260,10 @@ restart: > break_time++; > } > locks_insert_block(flock, new_fl); > + spin_unlock(&file_lock_lock); > error = wait_event_interruptible_timeout(new_fl->fl_wait, > !new_fl->fl_next, break_time); > + spin_lock(&file_lock_lock); > __locks_delete_block(new_fl); > if (error >= 0) { > if (error == 0) > @@ -1266,8 +1277,8 @@ restart: > error = 0; > } > > -out: > - unlock_kernel(); > + out: > + spin_unlock(&file_lock_lock); > if (!IS_ERR(new_fl)) > locks_free_lock(new_fl); > return error; > @@ -1323,7 +1334,7 @@ int fcntl_getlease(struct file *filp) > struct file_lock *fl; > int type = F_UNLCK; > > - lock_kernel(); > + spin_lock(&file_lock_lock); > time_out_leases(filp->f_path.dentry->d_inode); > for (fl = filp->f_path.dentry->d_inode->i_flock; fl && IS_LEASE(fl); > fl = fl->fl_next) { > @@ -1332,7 +1343,7 @@ int fcntl_getlease(struct file *filp) > break; > } > } > - unlock_kernel(); > + spin_unlock(&file_lock_lock); > return type; > } > > @@ -1363,6 +1374,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp) > if (error) > return error; > > + spin_lock(&file_lock_lock); > time_out_leases(inode); > > BUG_ON(!(*flp)->fl_lmops->fl_break); > @@ -1370,10 +1382,11 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp) > lease = *flp; > > if (arg != F_UNLCK) { > + spin_unlock(&file_lock_lock); > error = -ENOMEM; > new_fl = locks_alloc_lock(); > if (new_fl == NULL) > - goto out; > + goto out_unlocked; > > error = -EAGAIN; > if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0)) > @@ -1382,6 +1395,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp) > && ((atomic_read(&dentry->d_count) > 1) > || (atomic_read(&inode->i_count) > 1))) > goto out; > + spin_lock(&file_lock_lock); > } > > /* > @@ -1429,11 +1443,14 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp) > > locks_copy_lock(new_fl, lease); > locks_insert_lock(before, new_fl); > + spin_unlock(&file_lock_lock); > > *flp = new_fl; > return 0; > > -out: > + out: > + spin_unlock(&file_lock_lock); > + out_unlocked: > if (new_fl != NULL) > locks_free_lock(new_fl); > return error; > @@ -1471,12 +1488,10 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease) > { > int error; > > - lock_kernel(); > if (filp->f_op && filp->f_op->setlease) > error = filp->f_op->setlease(filp, arg, lease); > else > error = generic_setlease(filp, arg, lease); > - unlock_kernel(); > > return error; > } > @@ -1503,12 +1518,11 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg) > if (error) > return error; > > - lock_kernel(); > - > error = vfs_setlease(filp, arg, &flp); > if (error || arg == F_UNLCK) > - goto out_unlock; > + return error; > > + lock_kernel(); > error = fasync_helper(fd, filp, 1, &flp->fl_fasync); > if (error < 0) { > /* remove lease just inserted by setlease */ > @@ -1519,7 +1533,7 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg) > } > > error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0); > -out_unlock: > + out_unlock: > unlock_kernel(); > return error; > } > @@ -2024,7 +2038,7 @@ void locks_remove_flock(struct file *filp) > fl.fl_ops->fl_release_private(&fl); > } > > - lock_kernel(); > + spin_lock(&file_lock_lock); > before = &inode->i_flock; > > while ((fl = *before) != NULL) { > @@ -2042,7 +2056,7 @@ void locks_remove_flock(struct file *filp) > } > before = &fl->fl_next; > } > - unlock_kernel(); > + spin_unlock(&file_lock_lock); > } > > /** > @@ -2057,12 +2071,12 @@ posix_unblock_lock(struct file *filp, struct file_lock *waiter) > { > int status = 0; > > - lock_kernel(); > + spin_lock(&file_lock_lock); > if (waiter->fl_next) > __locks_delete_block(waiter); > else > status = -ENOENT; > - unlock_kernel(); > + spin_unlock(&file_lock_lock); > return status; > } > > @@ -2175,7 +2189,7 @@ static int locks_show(struct seq_file *f, void *v) > > static void *locks_start(struct seq_file *f, loff_t *pos) > { > - lock_kernel(); > + spin_lock(&file_lock_lock); > f->private = (void *)1; > return seq_list_start(&file_lock_list, *pos); > } > @@ -2187,7 +2201,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos) > > static void locks_stop(struct seq_file *f, void *v) > { > - unlock_kernel(); > + spin_unlock(&file_lock_lock); > } > > struct seq_operations locks_seq_operations = { > @@ -2215,7 +2229,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len) > { > struct file_lock *fl; > int result = 1; > - lock_kernel(); > + spin_lock(&file_lock_lock); > for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) { > if (IS_POSIX(fl)) { > if (fl->fl_type == F_RDLCK) > @@ -2232,7 +2246,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len) > result = 0; > break; > } > - unlock_kernel(); > + spin_unlock(&file_lock_lock); > return result; > } > > @@ -2255,7 +2269,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len) > { > struct file_lock *fl; > int result = 1; > - lock_kernel(); > + spin_lock(&file_lock_lock); > for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) { > if (IS_POSIX(fl)) { > if ((fl->fl_end < start) || (fl->fl_start > (start + len))) > @@ -2270,7 +2284,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len) > result = 0; > break; > } > - unlock_kernel(); > + spin_unlock(&file_lock_lock); > return result; > } > > ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-08 3:24 ` AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin @ 2008-05-08 3:34 ` Linus Torvalds 2008-05-08 4:37 ` Zhang, Yanmin 0 siblings, 1 reply; 140+ messages in thread From: Linus Torvalds @ 2008-05-08 3:34 UTC (permalink / raw) To: Zhang, Yanmin Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, LKML, Alexander Viro, Andrew Morton, linux-fsdevel On Thu, 8 May 2008, Zhang, Yanmin wrote: > > On Tue, 2008-05-06 at 10:23 -0600, Matthew Wilcox wrote: > > On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote: > > > So the only likely things I can see are: > > > > > > - file locks > > > - fasync > > > > I've wanted to fix file locks for a while. Here's a first attempt. > > It was done quickly, so I concede that it may well have bugs in it. > > I found (and fixed) one with LTP. > > > > It takes *no account* of nfsd, nor remote filesystems. We need to have > > a serious discussion about their requirements. > > I tested it on 8-core stoakley. aim7 result becomes 23% worse than the one of > pure 2.6.26-rc1. Ouch. That's really odd. The BKL->spinlock conversion looks really obvious, so it shouldn't be that noticeably slower. The *one* difference is that the BKL has the whole "you can take it recursively and you can sleep without dropping it because the scheduler will drop it for you" thing. The spinlock conversion changed all of that into explicit "drop and retake" locks, and maybe that causes some issues. But 23% worse? That sounds really odd/extreme. Can you do a oprofile run or something? Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-08 3:34 ` Linus Torvalds @ 2008-05-08 4:37 ` Zhang, Yanmin 2008-05-08 14:58 ` Linus Torvalds 0 siblings, 1 reply; 140+ messages in thread From: Zhang, Yanmin @ 2008-05-08 4:37 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, LKML, Alexander Viro, Andrew Morton, linux-fsdevel On Wed, 2008-05-07 at 20:34 -0700, Linus Torvalds wrote: > > On Thu, 8 May 2008, Zhang, Yanmin wrote: > > > > On Tue, 2008-05-06 at 10:23 -0600, Matthew Wilcox wrote: > > > On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote: > > > > So the only likely things I can see are: > > > > > > > > - file locks > > > > - fasync > > > > > > I've wanted to fix file locks for a while. Here's a first attempt. > > > It was done quickly, so I concede that it may well have bugs in it. > > > I found (and fixed) one with LTP. > > > > > > It takes *no account* of nfsd, nor remote filesystems. We need to have > > > a serious discussion about their requirements. > > > > I tested it on 8-core stoakley. aim7 result becomes 23% worse than the one of > > pure 2.6.26-rc1. > > Ouch. That's really odd. The BKL->spinlock conversion looks really > obvious, so it shouldn't be that noticeably slower. > > The *one* difference is that the BKL has the whole "you can take it > recursively and you can sleep without dropping it because the scheduler > will drop it for you" thing. The spinlock conversion changed all of that > into explicit "drop and retake" locks, and maybe that causes some issues. > > But 23% worse? That sounds really odd/extreme. > > Can you do a oprofile run or something? I collected oprofile data. It looks not useful, as cpu idle is more than 50%. samples % app name symbol name 270157 9.4450 multitask add_long 266419 9.3143 multitask add_int 238934 8.3534 multitask add_double 187184 6.5442 multitask mul_double 159448 5.5745 multitask add_float 156312 5.4649 multitask sieve 148081 5.1771 multitask mul_float 127192 4.4468 multitask add_short 80480 2.8137 multitask string_rtns_1 57520 2.0110 vmlinux clear_page_c 53935 1.8856 multitask div_long 48753 1.7045 libc-2.6.90.so strncat 40825 1.4273 multitask array_rtns 32807 1.1470 vmlinux __copy_user_nocache 31995 1.1186 multitask div_int 31143 1.0888 multitask div_float 28821 1.0076 multitask div_double 26400 0.9230 vmlinux find_lock_page 26159 0.9146 vmlinux unmap_vmas 25249 0.8827 multitask div_short 21509 0.7520 vmlinux native_read_tsc 18865 0.6595 vmlinux copy_user_generic_string 17993 0.6291 vmlinux copy_page_c 16367 0.5722 vmlinux system_call 14616 0.5110 libc-2.6.90.so msort_with_tmp 13630 0.4765 vmlinux native_sched_clock 12952 0.4528 vmlinux copy_page_range 12817 0.4481 libc-2.6.90.so strcat 12708 0.4443 vmlinux calc_delta_mine 12611 0.4409 libc-2.6.90.so memset 11631 0.4066 bash (no symbols) 9991 0.3493 vmlinux update_curr 9328 0.3261 vmlinux unlock_page ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-08 4:37 ` Zhang, Yanmin @ 2008-05-08 14:58 ` Linus Torvalds 0 siblings, 0 replies; 140+ messages in thread From: Linus Torvalds @ 2008-05-08 14:58 UTC (permalink / raw) To: Zhang, Yanmin Cc: Matthew Wilcox, Ingo Molnar, J. Bruce Fields, LKML, Alexander Viro, Andrew Morton, linux-fsdevel On Thu, 8 May 2008, Zhang, Yanmin wrote: > > I collected oprofile data. It looks not useful, as cpu idle is more than 50%. Ahh, so it's probably still the BKL that is the problem, it's just not in the file locking code. The changes to fs/locks.c probably didn't matter all that much, and the additional regression was likely just some perturbation. So it's probably fasync that AIM7 tests. Quite possibly coupled with /dev/tty etc. No file locking. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-06 11:44 ` Ingo Molnar 2008-05-06 12:09 ` Matthew Wilcox @ 2008-05-07 2:11 ` Zhang, Yanmin 2008-05-07 3:41 ` Zhang, Yanmin 1 sibling, 1 reply; 140+ messages in thread From: Zhang, Yanmin @ 2008-05-07 2:11 UTC (permalink / raw) To: Ingo Molnar Cc: Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton On Tue, 2008-05-06 at 13:44 +0200, Ingo Molnar wrote: > * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > > > Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more than 40% with > > 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton, and Itanium > > Montecito. Bisect located below patch. > > > > 64ac24e738823161693bf791f87adc802cf529ff is first bad commit > > commit 64ac24e738823161693bf791f87adc802cf529ff > > Author: Matthew Wilcox <matthew@wil.cx> > > Date: Fri Mar 7 21:55:58 2008 -0500 > > > > Generic semaphore implementation > > > > After I manually reverted the patch against 2.6.26-rc1 while fixing > > lots of conflictions/errors, aim7 regression became less than 2%. > > hm, which exact semaphore would that be due to? > > My first blind guess would be the BKL - there's not much other semaphore > use left in the core kernel otherwise that would affect AIM7 normally. > The VFS still makes frequent use of the BKL and AIM7 is very VFS > intense. Getting rid of that BKL use from the VFS might be useful to > performance anyway. > > Could you try to check that it's indeed the BKL? > > Easiest way to check it would be to run AIM7 it on > sched-devel.git/latest and do scheduler tracing via: > > http://people.redhat.com/mingo/sched-devel.git/readme-tracer.txt Thank you guys for the quick response. I ran into many regressions with 2.6.26-rc1, but just reported 2 of them because I located the patches. My machine is locating the root cause of 30% regression of sysbench+mysql(oltp readonly) now. Bisect is not so qucik because either kernel hang with testing or compilation fails. Another specjbb2005 on Montvale is also under investigation. Let me figure out how to clone your tree quickly as the network speed is very slow. One clear weird behavior of aim7 is cpu idle is 0% with 2.6.25, but is more than 50% with 2.6.26-rc1. I have a patch to collect schedule info. > > by doing: > > echo stacktrace > /debug/tracing/iter_ctl > > you could get exact backtraces of all scheduling points in the trace. If > the BKL's down() shows up in those traces then it's definitely the BKL > that causes this. The backtraces will also tell us exactly which BKL use > is the most frequent one. > > To keep tracing overhead low on SMP i'd also suggest to only trace a > single CPU, via: > > echo 1 > /debug/tracing/tracing_cpumask > > Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 2:11 ` Zhang, Yanmin @ 2008-05-07 3:41 ` Zhang, Yanmin 2008-05-07 3:59 ` Andrew Morton ` (2 more replies) 0 siblings, 3 replies; 140+ messages in thread From: Zhang, Yanmin @ 2008-05-07 3:41 UTC (permalink / raw) To: Ingo Molnar Cc: Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton On Wed, 2008-05-07 at 10:11 +0800, Zhang, Yanmin wrote: > On Tue, 2008-05-06 at 13:44 +0200, Ingo Molnar wrote: > > * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > > > > > Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more than 40% with > > > 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton, and Itanium > > > Montecito. Bisect located below patch. > > > > > > 64ac24e738823161693bf791f87adc802cf529ff is first bad commit > > > commit 64ac24e738823161693bf791f87adc802cf529ff > > > Author: Matthew Wilcox <matthew@wil.cx> > > > Date: Fri Mar 7 21:55:58 2008 -0500 > > > > > > Generic semaphore implementation > > > > > > After I manually reverted the patch against 2.6.26-rc1 while fixing > > > lots of conflictions/errors, aim7 regression became less than 2%. > > > > hm, which exact semaphore would that be due to? > > > > My first blind guess would be the BKL - there's not much other semaphore > > use left in the core kernel otherwise that would affect AIM7 normally. > > The VFS still makes frequent use of the BKL and AIM7 is very VFS > > intense. Getting rid of that BKL use from the VFS might be useful to > > performance anyway. > > > > Could you try to check that it's indeed the BKL? > > > > Easiest way to check it would be to run AIM7 it on > > sched-devel.git/latest and do scheduler tracing via: > > > > http://people.redhat.com/mingo/sched-devel.git/readme-tracer.txt > One clear weird behavior of aim7 is cpu idle is 0% with 2.6.25, but is more than 50% with > 2.6.26-rc1. I have a patch to collect schedule info. With my patch+gprof, I collected some data. Below was outputed by gprof. index % time self children called name 0.00 0.00 2/223305376 __down_write_nested [22749] 0.00 0.00 3/223305376 journal_commit_transaction [10526] 0.00 0.00 6/223305376 __down_read [22745] 0.00 0.00 8/223305376 start_this_handle [19167] 0.00 0.00 15/223305376 sys_pause [19808] 0.00 0.00 17/223305376 log_wait_commit [11047] 0.00 0.00 20/223305376 futex_wait [8122] 0.00 0.00 64/223305376 pdflush [14335] 0.00 0.00 71/223305376 do_get_write_access [5367] 0.00 0.00 84/223305376 pipe_wait [14460] 0.00 0.00 111/223305376 kjournald [10726] 0.00 0.00 116/223305376 int_careful [9634] 0.00 0.00 224/223305376 do_nanosleep [5418] 0.00 0.00 1152/223305376 watchdog [22065] 0.00 0.00 4087/223305376 worker_thread [22076] 0.00 0.00 5003/223305376 __mutex_lock_killable_slowpath [23305] 0.00 0.00 7810/223305376 ksoftirqd [10831] 0.00 0.00 9389/223305376 __mutex_lock_slowpath [23306] 0.00 0.00 10642/223305376 io_schedule [9813] 0.00 0.00 23544/223305376 migration_thread [11495] 0.00 0.00 35319/223305376 __cond_resched [22673] 0.00 0.00 49065/223305376 retint_careful [16146] 0.00 0.00 119757/223305376 sysret_careful [20074] 0.00 0.00 151717/223305376 do_wait [5545] 0.00 0.00 250221/223305376 do_exit [5356] 0.00 0.00 303836/223305376 cpu_idle [4350] 0.00 0.00 222333093/223305376 schedule_timeout [2] [1] 0.0 0.00 0.00 223305376 schedule [1] ----------------------------------------------- 0.00 0.00 2/222333093 io_schedule_timeout [9814] 0.00 0.00 4/222333093 journal_stop [10588] 0.00 0.00 8/222333093 cifs_oplock_thread [3760] 0.00 0.00 14/222333093 do_sys_poll [5513] 0.00 0.00 20/222333093 cifs_dnotify_thread [3733] 0.00 0.00 32/222333093 read_chan [15648] 0.00 0.00 47/222333093 wait_for_common [22017] 0.00 0.00 658/222333093 do_select [5479] 0.00 0.00 2000/222333093 inet_stream_connect [9324] 0.00 0.00 222330308/222333093 __down [22577] [2] 0.0 0.00 0.00 222333093 schedule_timeout [2] 0.00 0.00 222333093/223305376 schedule [1] ----------------------------------------------- 0.00 0.00 1/165565 flock_lock_file_wait [7735] 0.00 0.00 7/165565 __posix_lock_file [23371] 0.00 0.00 203/165565 de_put [4665] 0.00 0.00 243/165565 opost [13633] 0.00 0.00 333/165565 proc_root_readdir [14982] 0.00 0.00 358/165565 write_chan [22090] 0.00 0.00 6222/165565 proc_lookup_de [14908] 0.00 0.00 32081/165565 sys_fcntl [19687] 0.00 0.00 36045/165565 vfs_ioctl [21822] 0.00 0.00 42025/165565 tty_release [20818] 0.00 0.00 48047/165565 chrdev_open [3702] [3] 0.0 0.00 0.00 165565 lock_kernel [3] 0.00 0.00 152987/153190 down [4] ----------------------------------------------- 0.00 0.00 203/153190 __reacquire_kernel_lock [23420] 0.00 0.00 152987/153190 lock_kernel [3] [4] 0.0 0.00 0.00 153190 down [4] 0.00 0.00 153190/153190 __down [22577] ----------------------------------------------- 0.00 0.00 153190/153190 down [4] [22577 0.0 0.00 0.00 153190 __down [22577] 0.00 0.00 222330308/222333093 schedule_timeout [2] As system idle is more than 50%, so the schedule/schedule_timeout caller is important information. 1) lock_kernel causes most schedule/schedule_timeout; 2) When lock_kernel calls down, then __down, __down calls schedule_timeout for lots of times in a loop; 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open. -yanmin ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 3:41 ` Zhang, Yanmin @ 2008-05-07 3:59 ` Andrew Morton 2008-05-07 4:46 ` Zhang, Yanmin 2008-05-07 6:26 ` Ingo Molnar 2008-05-07 11:00 ` Andi Kleen 2 siblings, 1 reply; 140+ messages in thread From: Andrew Morton @ 2008-05-07 3:59 UTC (permalink / raw) To: Zhang, Yanmin Cc: Ingo Molnar, Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds On Wed, 07 May 2008 11:41:52 +0800 "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote: > As system idle is more than 50%, so the schedule/schedule_timeout caller is important > information. > 1) lock_kernel causes most schedule/schedule_timeout; > 2) When lock_kernel calls down, then __down, __down calls ___schedule_timeout for > lots of times in a loop; Really? Are you sure? That would imply that we keep on waking up tasks which then fail to acquire the lock. But the code pretty plainly doesn't do that. Odd. > 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open. Still :( ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 3:59 ` Andrew Morton @ 2008-05-07 4:46 ` Zhang, Yanmin 0 siblings, 0 replies; 140+ messages in thread From: Zhang, Yanmin @ 2008-05-07 4:46 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds On Tue, 2008-05-06 at 20:59 -0700, Andrew Morton wrote: > On Wed, 07 May 2008 11:41:52 +0800 "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote: > > > As system idle is more than 50%, so the schedule/schedule_timeout caller is important > > information. > > 1) lock_kernel causes most schedule/schedule_timeout; > > 2) When lock_kernel calls down, then __down, __down calls ___schedule_timeout for > > lots of times in a loop; > > Really? Are you sure? That would imply that we keep on waking up tasks > which then fail to acquire the lock. But the code pretty plainly doesn't > do that. Yes, totally based on the data. The data means the calling times among functions. Initially , I just collected the caller of schedule and schedule_timeout. Then I found most schedule/schedule_timeout are called by __down which is called down. Then, I changes kernel to collect more functions' calling info. If comparing the calling times of down, __down and schedule_timeout, we could find schedule_timeout is called by __down for 222330308, but __down is called only for 153190. > > Odd. > > > 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open. > Still :( Yes. The data has an error difference, but the difference is small. My patch doesn't use lock to protect data in case it might introduces too much overhead. ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 3:41 ` Zhang, Yanmin 2008-05-07 3:59 ` Andrew Morton @ 2008-05-07 6:26 ` Ingo Molnar 2008-05-07 6:28 ` Ingo Molnar 2008-05-07 11:00 ` Andi Kleen 2 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-07 6:26 UTC (permalink / raw) To: Zhang, Yanmin Cc: Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > 3) Caller of lock_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open. that's one often-forgotten BKL site: about 1000 ioctls are still running under the BKL. The TTY one is hurting the most. To make sure it's only that BKL acquire/release that hurts, could you try the hack patch below, does it make any difference to performance? but even if taking the BKL does hurt, it's quite unexpected to cause a 40% drop. Perhaps AIM7 has tons of threads that exit at once and all try to release their controlling terminal or something like that? Ingo ------------------------> Subject: DANGEROUS tty hack: no BKL From: Ingo Molnar <mingo@elte.hu> Date: Wed May 07 08:21:22 CEST 2008 NOT-Signed-off-by: Ingo Molnar <mingo@elte.hu> --- drivers/char/tty_io.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) Index: linux/drivers/char/tty_io.c =================================================================== --- linux.orig/drivers/char/tty_io.c +++ linux/drivers/char/tty_io.c @@ -2844,9 +2844,10 @@ out: static int tty_release(struct inode *inode, struct file *filp) { - lock_kernel(); + /* DANGEROUS - can crash your kernel! */ +// lock_kernel(); release_dev(filp); - unlock_kernel(); +// unlock_kernel(); return 0; } ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 6:26 ` Ingo Molnar @ 2008-05-07 6:28 ` Ingo Molnar 2008-05-07 7:05 ` Zhang, Yanmin 0 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-07 6:28 UTC (permalink / raw) To: Zhang, Yanmin Cc: Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton * Ingo Molnar <mingo@elte.hu> wrote: > > 3) Caller of lock_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open. > > that's one often-forgotten BKL site: about 1000 ioctls are still > running under the BKL. The TTY one is hurting the most. [...] although it's an unlocked_ioctl() now in 2.6.26, so all the BKL locking has been nicely pushed down to deep inside the tty code. > [...] To make sure it's only that BKL acquire/release that hurts, > could you try the hack patch below, does it make any difference to > performance? if you use a serial console you will need the updated patch below. Ingo ----------------------> Subject: no: tty bkl From: Ingo Molnar <mingo@elte.hu> Date: Wed May 07 08:21:22 CEST 2008 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- drivers/char/tty_io.c | 5 +++-- drivers/serial/serial_core.c | 2 +- 2 files changed, 4 insertions(+), 3 deletions(-) Index: linux/drivers/char/tty_io.c =================================================================== --- linux.orig/drivers/char/tty_io.c +++ linux/drivers/char/tty_io.c @@ -2844,9 +2844,10 @@ out: static int tty_release(struct inode *inode, struct file *filp) { - lock_kernel(); + /* DANGEROUS - can crash your kernel! */ +// lock_kernel(); release_dev(filp); - unlock_kernel(); +// unlock_kernel(); return 0; } Index: linux/drivers/serial/serial_core.c =================================================================== --- linux.orig/drivers/serial/serial_core.c +++ linux/drivers/serial/serial_core.c @@ -1241,7 +1241,7 @@ static void uart_close(struct tty_struct struct uart_state *state = tty->driver_data; struct uart_port *port; - BUG_ON(!kernel_locked()); +// BUG_ON(!kernel_locked()); if (!state || !state->port) return; ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 6:28 ` Ingo Molnar @ 2008-05-07 7:05 ` Zhang, Yanmin 0 siblings, 0 replies; 140+ messages in thread From: Zhang, Yanmin @ 2008-05-07 7:05 UTC (permalink / raw) To: Ingo Molnar Cc: Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton On Wed, 2008-05-07 at 08:28 +0200, Ingo Molnar wrote: > * Ingo Molnar <mingo@elte.hu> wrote: > > > > 3) Caller of lock_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open. > > > > that's one often-forgotten BKL site: about 1000 ioctls are still > > running under the BKL. The TTY one is hurting the most. [...] > > although it's an unlocked_ioctl() now in 2.6.26, so all the BKL locking > has been nicely pushed down to deep inside the tty code. > > > [...] To make sure it's only that BKL acquire/release that hurts, > > could you try the hack patch below, does it make any difference to > > performance? > > if you use a serial console you will need the updated patch below. I tested it on my 8-core stoakley. The result is 4% worse than the one of pure 2.6.26-rc1. Still not good. > > Ingo > > ----------------------> > Subject: no: tty bkl > From: Ingo Molnar <mingo@elte.hu> > Date: Wed May 07 08:21:22 CEST 2008 > > Signed-off-by: Ingo Molnar <mingo@elte.hu> > --- > drivers/char/tty_io.c | 5 +++-- > drivers/serial/serial_core.c | 2 +- > 2 files changed, 4 insertions(+), 3 deletions(-) > > Index: linux/drivers/char/tty_io.c > =================================================================== > --- linux.orig/drivers/char/tty_io.c > +++ linux/drivers/char/tty_io.c > @@ -2844,9 +2844,10 @@ out: > > static int tty_release(struct inode *inode, struct file *filp) > { > - lock_kernel(); > + /* DANGEROUS - can crash your kernel! */ > +// lock_kernel(); > release_dev(filp); > - unlock_kernel(); > +// unlock_kernel(); > return 0; > } > > Index: linux/drivers/serial/serial_core.c > =================================================================== > --- linux.orig/drivers/serial/serial_core.c > +++ linux/drivers/serial/serial_core.c > @@ -1241,7 +1241,7 @@ static void uart_close(struct tty_struct > struct uart_state *state = tty->driver_data; > struct uart_port *port; > > - BUG_ON(!kernel_locked()); > +// BUG_ON(!kernel_locked()); > > if (!state || !state->port) > return; > ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 3:41 ` Zhang, Yanmin 2008-05-07 3:59 ` Andrew Morton 2008-05-07 6:26 ` Ingo Molnar @ 2008-05-07 11:00 ` Andi Kleen 2008-05-07 11:46 ` Matthew Wilcox 2008-05-07 13:59 ` Alan Cox 2 siblings, 2 replies; 140+ messages in thread From: Andi Kleen @ 2008-05-07 11:00 UTC (permalink / raw) To: Zhang, Yanmin Cc: Ingo Molnar, Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes: > 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open. I have an older patchkit that introduced unlocked_fnctl for some cases. It was briefly in mm but then dropped. Sounds like it is worth resurrecting? tty_* is being taken care of by Alan. chrdev_open is more work. -Andi (who BTW never quite understood why BKL is a semaphore now and not a spinlock?) ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 11:00 ` Andi Kleen @ 2008-05-07 11:46 ` Matthew Wilcox 2008-05-07 12:21 ` Andi Kleen 2008-05-07 13:59 ` Alan Cox 1 sibling, 1 reply; 140+ messages in thread From: Matthew Wilcox @ 2008-05-07 11:46 UTC (permalink / raw) To: Andi Kleen Cc: Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Linus Torvalds, Andrew Morton On Wed, May 07, 2008 at 01:00:14PM +0200, Andi Kleen wrote: > "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes: > > 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open. > > I have an older patchkit that introduced unlocked_fnctl for some cases. It was > briefly in mm but then dropped. Sounds like it is worth resurrecting? Not sure what you're talking about here, Andi. The only lock_kernel in fcntl.c is around the call to ->fasync. And Yanmin's traces don't show fasync as being a culprit, just the paths in locks.c > tty_* is being taken care of by Alan. > > chrdev_open is more work. > > -Andi (who BTW never quite understood why BKL is a semaphore now and not > a spinlock?) See git commit 6478d8800b75253b2a934ddcb734e13ade023ad0 -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 11:46 ` Matthew Wilcox @ 2008-05-07 12:21 ` Andi Kleen 2008-05-07 14:36 ` Linus Torvalds 0 siblings, 1 reply; 140+ messages in thread From: Andi Kleen @ 2008-05-07 12:21 UTC (permalink / raw) To: Matthew Wilcox Cc: Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Linus Torvalds, Andrew Morton Matthew Wilcox <matthew@wil.cx> writes: > On Wed, May 07, 2008 at 01:00:14PM +0200, Andi Kleen wrote: >> "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes: >> > 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open. >> >> I have an older patchkit that introduced unlocked_fnctl for some cases. It was >> briefly in mm but then dropped. Sounds like it is worth resurrecting? > > Not sure what you're talking about here, Andi. The only lock_kernel in > fcntl.c is around the call to ->fasync. And Yanmin's traces don't show > fasync as being a culprit, just the paths in locks.c I was talking about fasync. >> -Andi (who BTW never quite understood why BKL is a semaphore now and not >> a spinlock?) > > See git commit 6478d8800b75253b2a934ddcb734e13ade023ad0 I am aware of that commit, thank you, but the comment was refering to that it came with about zero justification why it was done. For the left over BKL regions which are relatively short surely a spinlock would be better than a semaphore? So PREEMPT_BKL should have been removed, not !PREEMPT_BKL. If that was done all these regressions would disappear I bet. That said of course it is still better to actually fix the lock_kernel()s, but shorter time just fixing lock_kernel again would be easier. -Andi ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 12:21 ` Andi Kleen @ 2008-05-07 14:36 ` Linus Torvalds 2008-05-07 14:35 ` Alan Cox ` (3 more replies) 0 siblings, 4 replies; 140+ messages in thread From: Linus Torvalds @ 2008-05-07 14:36 UTC (permalink / raw) To: Andi Kleen Cc: Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Andrew Morton On Wed, 7 May 2008, Andi Kleen wrote: > > I am aware of that commit, thank you, but the comment was refering to that it > came with about zero justification why it was done. For the left over BKL > regions which are relatively short surely a spinlock would be better than a > semaphore? So PREEMPT_BKL should have been removed, not !PREEMPT_BKL. I do agree. I think turning the BKL into a semaphore was fine per se, but that was when semaphores were fast. Considering the apparent AIM regressions, we really either need to revert the semaphore consolidation, or we need to fix the kernel lock. And by "fixing", I don't mean removing it - it will happen, but it almost certainly won't happen for 2.6.26. The easiest approach would seem to just turn the BKL into a mutex instead, which should hopefully be about as optimized as the old semaphores. But my preferred option would indeed be just turning it back into a spinlock - and screw latency and BKL preemption - and having the RT people who care deeply just work on removing the BKL in the long run. Is BKL preemption worth it? Sounds very dubious. Sounds even more dubious when we now apparently have even more reason to aim for removing the BKL rather than trying to mess around with it. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 14:36 ` Linus Torvalds @ 2008-05-07 14:35 ` Alan Cox 2008-05-07 15:00 ` Linus Torvalds 2008-05-07 14:57 ` Andi Kleen ` (2 subsequent siblings) 3 siblings, 1 reply; 140+ messages in thread From: Alan Cox @ 2008-05-07 14:35 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Andrew Morton > But my preferred option would indeed be just turning it back into a > spinlock - and screw latency and BKL preemption - and having the RT people > who care deeply just work on removing the BKL in the long run. It isn't as if the RT build can't use a different lock type to the default build. > Is BKL preemption worth it? Sounds very dubious. Sounds even more dubious > when we now apparently have even more reason to aim for removing the BKL > rather than trying to mess around with it. We have some horrible long lasting BKL users left unfortunately. Alan ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 14:35 ` Alan Cox @ 2008-05-07 15:00 ` Linus Torvalds 2008-05-07 15:02 ` Linus Torvalds 0 siblings, 1 reply; 140+ messages in thread From: Linus Torvalds @ 2008-05-07 15:00 UTC (permalink / raw) To: Alan Cox Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Andrew Morton On Wed, 7 May 2008, Alan Cox wrote: > > > But my preferred option would indeed be just turning it back into a > > spinlock - and screw latency and BKL preemption - and having the RT people > > who care deeply just work on removing the BKL in the long run. > > It isn't as if the RT build can't use a different lock type to the > default build. Well, considering just *how* bad the new BKL apparently is, I think that's a separate issue. The semaphore implementation is simply not worth it. At a minimum, it should be a mutex. > > Is BKL preemption worth it? Sounds very dubious. Sounds even more dubious > > when we now apparently have even more reason to aim for removing the BKL > > rather than trying to mess around with it. > > We have some horrible long lasting BKL users left unfortunately. Quite frankly, maybe we _need_ to have a bad BKL for those to ever get fixed. As it was, people worked on trying to make the BKL behave better, and it was a failure. Rather than spend the effort on trying to make it work better (at a horrible cost), why not just say "Hell no - if you have issues with it, you need to work with people to get rid of the BKL rather than cluge around it". Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 15:00 ` Linus Torvalds @ 2008-05-07 15:02 ` Linus Torvalds 0 siblings, 0 replies; 140+ messages in thread From: Linus Torvalds @ 2008-05-07 15:02 UTC (permalink / raw) To: Alan Cox Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Andrew Morton On Wed, 7 May 2008, Linus Torvalds wrote: > > Quite frankly, maybe we _need_ to have a bad BKL for those to ever get > fixed. As it was, people worked on trying to make the BKL behave better, > and it was a failure. Rather than spend the effort on trying to make it > work better (at a horrible cost), why not just say "Hell no - if you have > issues with it, you need to work with people to get rid of the BKL > rather than cluge around it". Put another way: if we had introduced the BKL-as-semaphore with a known 40% performance drop in AIM7, I would simply never ever have accepted the patch in the first place, regardless of _any_ excuses. Performance is a feature too. Now, just because the code is already merged should not be an excuse for it then being shown to be bad. It's not a valid excuse to say "but we already merged it, so we can't unmerge it". We sure as hell _can_ unmerge it. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 14:36 ` Linus Torvalds 2008-05-07 14:35 ` Alan Cox @ 2008-05-07 14:57 ` Andi Kleen 2008-05-07 15:31 ` Andrew Morton 2008-05-07 15:19 ` Linus Torvalds 2008-05-07 16:20 ` Ingo Molnar 3 siblings, 1 reply; 140+ messages in thread From: Andi Kleen @ 2008-05-07 14:57 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Andrew Morton > I think turning the BKL into a semaphore was fine per se, Are there really that many long lived BKL holders? I have some doubts. Semaphore makes only sense when the critical region is at least many thousands of cycles to amortize the scheduling overhead. Ok perhaps this needs some numbers to decide. but that was > when semaphores were fast. The semaphores should be still nearly as fast in theory, especially for the contended case. > Considering the apparent AIM regressions, we really either need to revert > the semaphore consolidation, Or figure out what made the semaphore consolidation slower? As Ingo pointed out earlier 40% is unlikely to be a fast path problem, but some algorithmic problem. Surely that is fixable (even for .26)? Perhaps we were lucky it showed so easily, not in something tricky. -Andi ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 14:57 ` Andi Kleen @ 2008-05-07 15:31 ` Andrew Morton 2008-05-07 16:22 ` Matthew Wilcox 0 siblings, 1 reply; 140+ messages in thread From: Andrew Morton @ 2008-05-07 15:31 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro On Wed, 07 May 2008 16:57:52 +0200 Andi Kleen <andi@firstfloor.org> wrote: > Or figure out what made the semaphore consolidation slower? As Ingo > pointed out earlier 40% is unlikely to be a fast path problem, but some > algorithmic problem. Surely that is fixable (even for .26)? Absolutely. Yanmin is apparently showing that each call to __down() results in 1,451 calls to schedule(). wtf? ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 15:31 ` Andrew Morton @ 2008-05-07 16:22 ` Matthew Wilcox 0 siblings, 0 replies; 140+ messages in thread From: Matthew Wilcox @ 2008-05-07 16:22 UTC (permalink / raw) To: Andrew Morton Cc: Andi Kleen, Linus Torvalds, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro On Wed, May 07, 2008 at 08:31:05AM -0700, Andrew Morton wrote: > On Wed, 07 May 2008 16:57:52 +0200 Andi Kleen <andi@firstfloor.org> wrote: > > > Or figure out what made the semaphore consolidation slower? As Ingo > > pointed out earlier 40% is unlikely to be a fast path problem, but some > > algorithmic problem. Surely that is fixable (even for .26)? > > Absolutely. Yanmin is apparently showing that each call to __down() > results in 1,451 calls to schedule(). wtf? I can't figure it out either. Unless schedule() is broken somehow ... but that should have shown up with semaphore-sleepers.c, shouldn't it? One other difference between semaphore-sleepers and the new generic code is that in effect, semaphore-sleepers does a little bit of spinning before it sleeps. That is, if up() and down() are called more-or-less simultaneously, the increment of sem->count will happen before __down calls schedule(). How about something like this: diff --git a/kernel/semaphore.c b/kernel/semaphore.c index 5c2942e..ef83f5a 100644 --- a/kernel/semaphore.c +++ b/kernel/semaphore.c @@ -211,6 +211,7 @@ static inline int __sched __down_common(struct semaphore *sem, long state, waiter.up = 0; for (;;) { + int i; if (state == TASK_INTERRUPTIBLE && signal_pending(task)) goto interrupted; if (state == TASK_KILLABLE && fatal_signal_pending(task)) @@ -219,7 +220,15 @@ static inline int __sched __down_common(struct semaphore *sem, long state, goto timed_out; __set_task_state(task, state); spin_unlock_irq(&sem->lock); + + for (i = 0; i < 10; i++) { + if (waiter.up) + goto skip_schedule; + cpu_relax(); + } + timeout = schedule_timeout(timeout); + skip_schedule: spin_lock_irq(&sem->lock); if (waiter.up) return 0; Maybe it'd be enough to test it once ... or maybe we should use spin_is_locked() ... Ingo? -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply related [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 14:36 ` Linus Torvalds 2008-05-07 14:35 ` Alan Cox 2008-05-07 14:57 ` Andi Kleen @ 2008-05-07 15:19 ` Linus Torvalds 2008-05-07 17:14 ` Ingo Molnar 2008-05-08 2:44 ` Zhang, Yanmin 2008-05-07 16:20 ` Ingo Molnar 3 siblings, 2 replies; 140+ messages in thread From: Linus Torvalds @ 2008-05-07 15:19 UTC (permalink / raw) To: Andi Kleen Cc: Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Andrew Morton On Wed, 7 May 2008, Linus Torvalds wrote: > > But my preferred option would indeed be just turning it back into a > spinlock - and screw latency and BKL preemption - and having the RT people > who care deeply just work on removing the BKL in the long run. Here's a trial balloon patch to do that. Yanmin - this is not well tested, but the code is fairly obvious, and it would be interesting to hear if this fixes the performance regression. Because if it doesn't, then it's not the BKL, or something totally different is going on. Now, we should probably also test just converting the thing to a mutex, to see if that perhaps also fixes it. Linus --- arch/mn10300/Kconfig | 11 ---- include/linux/hardirq.h | 18 ++++--- kernel/sched.c | 27 ++--------- lib/kernel_lock.c | 120 +++++++++++++++++++++++++++++++--------------- 4 files changed, 95 insertions(+), 81 deletions(-) diff --git a/arch/mn10300/Kconfig b/arch/mn10300/Kconfig index 6a6409a..e856218 100644 --- a/arch/mn10300/Kconfig +++ b/arch/mn10300/Kconfig @@ -186,17 +186,6 @@ config PREEMPT Say Y here if you are building a kernel for a desktop, embedded or real-time system. Say N if you are unsure. -config PREEMPT_BKL - bool "Preempt The Big Kernel Lock" - depends on PREEMPT - default y - help - This option reduces the latency of the kernel by making the - big kernel lock preemptible. - - Say Y here if you are building a kernel for a desktop system. - Say N if you are unsure. - config MN10300_CURRENT_IN_E2 bool "Hold current task address in E2 register" default y diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index 897f723..181006c 100644 --- a/include/linux/hardirq.h +++ b/include/linux/hardirq.h @@ -72,6 +72,14 @@ #define in_softirq() (softirq_count()) #define in_interrupt() (irq_count()) +#if defined(CONFIG_PREEMPT) +# define PREEMPT_INATOMIC_BASE kernel_locked() +# define PREEMPT_CHECK_OFFSET 1 +#else +# define PREEMPT_INATOMIC_BASE 0 +# define PREEMPT_CHECK_OFFSET 0 +#endif + /* * Are we running in atomic context? WARNING: this macro cannot * always detect atomic context; in particular, it cannot know about @@ -79,17 +87,11 @@ * used in the general case to determine whether sleeping is possible. * Do not use in_atomic() in driver code. */ -#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0) - -#ifdef CONFIG_PREEMPT -# define PREEMPT_CHECK_OFFSET 1 -#else -# define PREEMPT_CHECK_OFFSET 0 -#endif +#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_INATOMIC_BASE) /* * Check whether we were atomic before we did preempt_disable(): - * (used by the scheduler) + * (used by the scheduler, *after* releasing the kernel lock) */ #define in_atomic_preempt_off() \ ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_CHECK_OFFSET) diff --git a/kernel/sched.c b/kernel/sched.c index 58fb8af..c51b656 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4567,8 +4567,6 @@ EXPORT_SYMBOL(schedule); asmlinkage void __sched preempt_schedule(void) { struct thread_info *ti = current_thread_info(); - struct task_struct *task = current; - int saved_lock_depth; /* * If there is a non-zero preempt_count or interrupts are disabled, @@ -4579,16 +4577,7 @@ asmlinkage void __sched preempt_schedule(void) do { add_preempt_count(PREEMPT_ACTIVE); - - /* - * We keep the big kernel semaphore locked, but we - * clear ->lock_depth so that schedule() doesnt - * auto-release the semaphore: - */ - saved_lock_depth = task->lock_depth; - task->lock_depth = -1; schedule(); - task->lock_depth = saved_lock_depth; sub_preempt_count(PREEMPT_ACTIVE); /* @@ -4609,26 +4598,15 @@ EXPORT_SYMBOL(preempt_schedule); asmlinkage void __sched preempt_schedule_irq(void) { struct thread_info *ti = current_thread_info(); - struct task_struct *task = current; - int saved_lock_depth; /* Catch callers which need to be fixed */ BUG_ON(ti->preempt_count || !irqs_disabled()); do { add_preempt_count(PREEMPT_ACTIVE); - - /* - * We keep the big kernel semaphore locked, but we - * clear ->lock_depth so that schedule() doesnt - * auto-release the semaphore: - */ - saved_lock_depth = task->lock_depth; - task->lock_depth = -1; local_irq_enable(); schedule(); local_irq_disable(); - task->lock_depth = saved_lock_depth; sub_preempt_count(PREEMPT_ACTIVE); /* @@ -5853,8 +5831,11 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu) spin_unlock_irqrestore(&rq->lock, flags); /* Set the preempt count _outside_ the spinlocks! */ +#if defined(CONFIG_PREEMPT) + task_thread_info(idle)->preempt_count = (idle->lock_depth >= 0); +#else task_thread_info(idle)->preempt_count = 0; - +#endif /* * The idle tasks have their own, simple scheduling class: */ diff --git a/lib/kernel_lock.c b/lib/kernel_lock.c index cd3e825..06722aa 100644 --- a/lib/kernel_lock.c +++ b/lib/kernel_lock.c @@ -11,79 +11,121 @@ #include <linux/semaphore.h> /* - * The 'big kernel semaphore' + * The 'big kernel lock' * - * This mutex is taken and released recursively by lock_kernel() + * This spinlock is taken and released recursively by lock_kernel() * and unlock_kernel(). It is transparently dropped and reacquired * over schedule(). It is used to protect legacy code that hasn't * been migrated to a proper locking design yet. * - * Note: code locked by this semaphore will only be serialized against - * other code using the same locking facility. The code guarantees that - * the task remains on the same CPU. - * * Don't use in new code. */ -static DECLARE_MUTEX(kernel_sem); +static __cacheline_aligned_in_smp DEFINE_SPINLOCK(kernel_flag); + /* - * Re-acquire the kernel semaphore. + * Acquire/release the underlying lock from the scheduler. * - * This function is called with preemption off. + * This is called with preemption disabled, and should + * return an error value if it cannot get the lock and + * TIF_NEED_RESCHED gets set. * - * We are executing in schedule() so the code must be extremely careful - * about recursion, both due to the down() and due to the enabling of - * preemption. schedule() will re-check the preemption flag after - * reacquiring the semaphore. + * If it successfully gets the lock, it should increment + * the preemption count like any spinlock does. + * + * (This works on UP too - _raw_spin_trylock will never + * return false in that case) */ int __lockfunc __reacquire_kernel_lock(void) { - struct task_struct *task = current; - int saved_lock_depth = task->lock_depth; - - BUG_ON(saved_lock_depth < 0); - - task->lock_depth = -1; - preempt_enable_no_resched(); - - down(&kernel_sem); - + while (!_raw_spin_trylock(&kernel_flag)) { + if (test_thread_flag(TIF_NEED_RESCHED)) + return -EAGAIN; + cpu_relax(); + } preempt_disable(); - task->lock_depth = saved_lock_depth; - return 0; } void __lockfunc __release_kernel_lock(void) { - up(&kernel_sem); + _raw_spin_unlock(&kernel_flag); + preempt_enable_no_resched(); } /* - * Getting the big kernel semaphore. + * These are the BKL spinlocks - we try to be polite about preemption. + * If SMP is not on (ie UP preemption), this all goes away because the + * _raw_spin_trylock() will always succeed. */ -void __lockfunc lock_kernel(void) +#ifdef CONFIG_PREEMPT +static inline void __lock_kernel(void) { - struct task_struct *task = current; - int depth = task->lock_depth + 1; + preempt_disable(); + if (unlikely(!_raw_spin_trylock(&kernel_flag))) { + /* + * If preemption was disabled even before this + * was called, there's nothing we can be polite + * about - just spin. + */ + if (preempt_count() > 1) { + _raw_spin_lock(&kernel_flag); + return; + } - if (likely(!depth)) /* - * No recursion worries - we set up lock_depth _after_ + * Otherwise, let's wait for the kernel lock + * with preemption enabled.. */ - down(&kernel_sem); + do { + preempt_enable(); + while (spin_is_locked(&kernel_flag)) + cpu_relax(); + preempt_disable(); + } while (!_raw_spin_trylock(&kernel_flag)); + } +} - task->lock_depth = depth; +#else + +/* + * Non-preemption case - just get the spinlock + */ +static inline void __lock_kernel(void) +{ + _raw_spin_lock(&kernel_flag); } +#endif -void __lockfunc unlock_kernel(void) +static inline void __unlock_kernel(void) { - struct task_struct *task = current; + /* + * the BKL is not covered by lockdep, so we open-code the + * unlocking sequence (and thus avoid the dep-chain ops): + */ + _raw_spin_unlock(&kernel_flag); + preempt_enable(); +} - BUG_ON(task->lock_depth < 0); +/* + * Getting the big kernel lock. + * + * This cannot happen asynchronously, so we only need to + * worry about other CPU's. + */ +void __lockfunc lock_kernel(void) +{ + int depth = current->lock_depth+1; + if (likely(!depth)) + __lock_kernel(); + current->lock_depth = depth; +} - if (likely(--task->lock_depth < 0)) - up(&kernel_sem); +void __lockfunc unlock_kernel(void) +{ + BUG_ON(current->lock_depth < 0); + if (likely(--current->lock_depth < 0)) + __unlock_kernel(); } EXPORT_SYMBOL(lock_kernel); ^ permalink raw reply related [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 15:19 ` Linus Torvalds @ 2008-05-07 17:14 ` Ingo Molnar 2008-05-08 2:44 ` Zhang, Yanmin 1 sibling, 0 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-07 17:14 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But my preferred option would indeed be just turning it back into a > > spinlock - and screw latency and BKL preemption - and having the RT > > people who care deeply just work on removing the BKL in the long > > run. > > Here's a trial balloon patch to do that. here's a simpler trial baloon test-patch (well, hack) that is also reasonably well tested. It turns the BKL into a "spin-semaphore". If this resolves the performance problem then it's all due to the BKL's scheduling/preemption properties. this approach is ugly (it's just a more expensive spinlock), but has an advantage: the code logic is obviously correct, and it would also make it much easier later on to turn the BKL back into a sleeping lock again - once the TTY code's BKL use is fixed. (i think Alan said it might happen in the next few months) The BKL is more expensive than a simple spinlock anyway. Ingo -------------> Subject: BKL: spin on acquire From: Ingo Molnar <mingo@elte.hu> Date: Wed May 07 19:05:40 CEST 2008 NOT-Signed-off-by: Ingo Molnar <mingo@elte.hu> --- lib/kernel_lock.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) Index: linux/lib/kernel_lock.c =================================================================== --- linux.orig/lib/kernel_lock.c +++ linux/lib/kernel_lock.c @@ -46,7 +46,8 @@ int __lockfunc __reacquire_kernel_lock(v task->lock_depth = -1; preempt_enable_no_resched(); - down(&kernel_sem); + while (down_trylock(&kernel_sem)) + cpu_relax(); preempt_disable(); task->lock_depth = saved_lock_depth; @@ -67,11 +68,13 @@ void __lockfunc lock_kernel(void) struct task_struct *task = current; int depth = task->lock_depth + 1; - if (likely(!depth)) + if (likely(!depth)) { /* * No recursion worries - we set up lock_depth _after_ */ - down(&kernel_sem); + while (down_trylock(&kernel_sem)) + cpu_relax(); + } task->lock_depth = depth; } ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 15:19 ` Linus Torvalds 2008-05-07 17:14 ` Ingo Molnar @ 2008-05-08 2:44 ` Zhang, Yanmin 2008-05-08 3:29 ` Linus Torvalds 2008-05-08 6:43 ` AIM7 40% regression with 2.6.26-rc1 Ingo Molnar 1 sibling, 2 replies; 140+ messages in thread From: Zhang, Yanmin @ 2008-05-08 2:44 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Matthew Wilcox, Ingo Molnar, LKML, Alexander Viro, Andrew Morton On Wed, 2008-05-07 at 08:19 -0700, Linus Torvalds wrote: > > On Wed, 7 May 2008, Linus Torvalds wrote: > > > > But my preferred option would indeed be just turning it back into a > > spinlock - and screw latency and BKL preemption - and having the RT people > > who care deeply just work on removing the BKL in the long run. > > Here's a trial balloon patch to do that. > > Yanmin - this is not well tested, but the code is fairly obvious, and it > would be interesting to hear if this fixes the performance regression. > Because if it doesn't, then it's not the BKL, or something totally > different is going on. Congratulations! The patch really fixes the regression completely! vmstat showed cpu idle is 0%, just like 2.6.25's. Some config options in my .config file: CONFIG_NR_CPUS=32 CONFIG_SCHED_SMT=y CONFIG_SCHED_MC=y # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set CONFIG_X86_LOCAL_APIC=y CONFIG_X86_IO_APIC=y yanmin > > Now, we should probably also test just converting the thing to a mutex, > to see if that perhaps also fixes it. > > Linus > > --- > arch/mn10300/Kconfig | 11 ---- > include/linux/hardirq.h | 18 ++++--- > kernel/sched.c | 27 ++--------- > lib/kernel_lock.c | 120 +++++++++++++++++++++++++++++++--------------- > 4 files changed, 95 insertions(+), 81 deletions(-) > > diff --git a/arch/mn10300/Kconfig b/arch/mn10300/Kconfig > index 6a6409a..e856218 100644 > --- a/arch/mn10300/Kconfig > +++ b/arch/mn10300/Kconfig > @@ -186,17 +186,6 @@ config PREEMPT > Say Y here if you are building a kernel for a desktop, embedded > or real-time system. Say N if you are unsure. > > -config PREEMPT_BKL > - bool "Preempt The Big Kernel Lock" > - depends on PREEMPT > - default y > - help > - This option reduces the latency of the kernel by making the > - big kernel lock preemptible. > - > - Say Y here if you are building a kernel for a desktop system. > - Say N if you are unsure. > - > config MN10300_CURRENT_IN_E2 > bool "Hold current task address in E2 register" > default y > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h > index 897f723..181006c 100644 > --- a/include/linux/hardirq.h > +++ b/include/linux/hardirq.h > @@ -72,6 +72,14 @@ > #define in_softirq() (softirq_count()) > #define in_interrupt() (irq_count()) > > +#if defined(CONFIG_PREEMPT) > +# define PREEMPT_INATOMIC_BASE kernel_locked() > +# define PREEMPT_CHECK_OFFSET 1 > +#else > +# define PREEMPT_INATOMIC_BASE 0 > +# define PREEMPT_CHECK_OFFSET 0 > +#endif > + > /* > * Are we running in atomic context? WARNING: this macro cannot > * always detect atomic context; in particular, it cannot know about > @@ -79,17 +87,11 @@ > * used in the general case to determine whether sleeping is possible. > * Do not use in_atomic() in driver code. > */ > -#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0) > - > -#ifdef CONFIG_PREEMPT > -# define PREEMPT_CHECK_OFFSET 1 > -#else > -# define PREEMPT_CHECK_OFFSET 0 > -#endif > +#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_INATOMIC_BASE) > > /* > * Check whether we were atomic before we did preempt_disable(): > - * (used by the scheduler) > + * (used by the scheduler, *after* releasing the kernel lock) > */ > #define in_atomic_preempt_off() \ > ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_CHECK_OFFSET) > diff --git a/kernel/sched.c b/kernel/sched.c > index 58fb8af..c51b656 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -4567,8 +4567,6 @@ EXPORT_SYMBOL(schedule); > asmlinkage void __sched preempt_schedule(void) > { > struct thread_info *ti = current_thread_info(); > - struct task_struct *task = current; > - int saved_lock_depth; > > /* > * If there is a non-zero preempt_count or interrupts are disabled, > @@ -4579,16 +4577,7 @@ asmlinkage void __sched preempt_schedule(void) > > do { > add_preempt_count(PREEMPT_ACTIVE); > - > - /* > - * We keep the big kernel semaphore locked, but we > - * clear ->lock_depth so that schedule() doesnt > - * auto-release the semaphore: > - */ > - saved_lock_depth = task->lock_depth; > - task->lock_depth = -1; > schedule(); > - task->lock_depth = saved_lock_depth; > sub_preempt_count(PREEMPT_ACTIVE); > > /* > @@ -4609,26 +4598,15 @@ EXPORT_SYMBOL(preempt_schedule); > asmlinkage void __sched preempt_schedule_irq(void) > { > struct thread_info *ti = current_thread_info(); > - struct task_struct *task = current; > - int saved_lock_depth; > > /* Catch callers which need to be fixed */ > BUG_ON(ti->preempt_count || !irqs_disabled()); > > do { > add_preempt_count(PREEMPT_ACTIVE); > - > - /* > - * We keep the big kernel semaphore locked, but we > - * clear ->lock_depth so that schedule() doesnt > - * auto-release the semaphore: > - */ > - saved_lock_depth = task->lock_depth; > - task->lock_depth = -1; > local_irq_enable(); > schedule(); > local_irq_disable(); > - task->lock_depth = saved_lock_depth; > sub_preempt_count(PREEMPT_ACTIVE); > > /* > @@ -5853,8 +5831,11 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu) > spin_unlock_irqrestore(&rq->lock, flags); > > /* Set the preempt count _outside_ the spinlocks! */ > +#if defined(CONFIG_PREEMPT) > + task_thread_info(idle)->preempt_count = (idle->lock_depth >= 0); > +#else > task_thread_info(idle)->preempt_count = 0; > - > +#endif > /* > * The idle tasks have their own, simple scheduling class: > */ > diff --git a/lib/kernel_lock.c b/lib/kernel_lock.c > index cd3e825..06722aa 100644 > --- a/lib/kernel_lock.c > +++ b/lib/kernel_lock.c > @@ -11,79 +11,121 @@ > #include <linux/semaphore.h> > > /* > - * The 'big kernel semaphore' > + * The 'big kernel lock' > * > - * This mutex is taken and released recursively by lock_kernel() > + * This spinlock is taken and released recursively by lock_kernel() > * and unlock_kernel(). It is transparently dropped and reacquired > * over schedule(). It is used to protect legacy code that hasn't > * been migrated to a proper locking design yet. > * > - * Note: code locked by this semaphore will only be serialized against > - * other code using the same locking facility. The code guarantees that > - * the task remains on the same CPU. > - * > * Don't use in new code. > */ > -static DECLARE_MUTEX(kernel_sem); > +static __cacheline_aligned_in_smp DEFINE_SPINLOCK(kernel_flag); > + > > /* > - * Re-acquire the kernel semaphore. > + * Acquire/release the underlying lock from the scheduler. > * > - * This function is called with preemption off. > + * This is called with preemption disabled, and should > + * return an error value if it cannot get the lock and > + * TIF_NEED_RESCHED gets set. > * > - * We are executing in schedule() so the code must be extremely careful > - * about recursion, both due to the down() and due to the enabling of > - * preemption. schedule() will re-check the preemption flag after > - * reacquiring the semaphore. > + * If it successfully gets the lock, it should increment > + * the preemption count like any spinlock does. > + * > + * (This works on UP too - _raw_spin_trylock will never > + * return false in that case) > */ > int __lockfunc __reacquire_kernel_lock(void) > { > - struct task_struct *task = current; > - int saved_lock_depth = task->lock_depth; > - > - BUG_ON(saved_lock_depth < 0); > - > - task->lock_depth = -1; > - preempt_enable_no_resched(); > - > - down(&kernel_sem); > - > + while (!_raw_spin_trylock(&kernel_flag)) { > + if (test_thread_flag(TIF_NEED_RESCHED)) > + return -EAGAIN; > + cpu_relax(); > + } > preempt_disable(); > - task->lock_depth = saved_lock_depth; > - > return 0; > } > > void __lockfunc __release_kernel_lock(void) > { > - up(&kernel_sem); > + _raw_spin_unlock(&kernel_flag); > + preempt_enable_no_resched(); > } > > /* > - * Getting the big kernel semaphore. > + * These are the BKL spinlocks - we try to be polite about preemption. > + * If SMP is not on (ie UP preemption), this all goes away because the > + * _raw_spin_trylock() will always succeed. > */ > -void __lockfunc lock_kernel(void) > +#ifdef CONFIG_PREEMPT > +static inline void __lock_kernel(void) > { > - struct task_struct *task = current; > - int depth = task->lock_depth + 1; > + preempt_disable(); > + if (unlikely(!_raw_spin_trylock(&kernel_flag))) { > + /* > + * If preemption was disabled even before this > + * was called, there's nothing we can be polite > + * about - just spin. > + */ > + if (preempt_count() > 1) { > + _raw_spin_lock(&kernel_flag); > + return; > + } > > - if (likely(!depth)) > /* > - * No recursion worries - we set up lock_depth _after_ > + * Otherwise, let's wait for the kernel lock > + * with preemption enabled.. > */ > - down(&kernel_sem); > + do { > + preempt_enable(); > + while (spin_is_locked(&kernel_flag)) > + cpu_relax(); > + preempt_disable(); > + } while (!_raw_spin_trylock(&kernel_flag)); > + } > +} > > - task->lock_depth = depth; > +#else > + > +/* > + * Non-preemption case - just get the spinlock > + */ > +static inline void __lock_kernel(void) > +{ > + _raw_spin_lock(&kernel_flag); > } > +#endif > > -void __lockfunc unlock_kernel(void) > +static inline void __unlock_kernel(void) > { > - struct task_struct *task = current; > + /* > + * the BKL is not covered by lockdep, so we open-code the > + * unlocking sequence (and thus avoid the dep-chain ops): > + */ > + _raw_spin_unlock(&kernel_flag); > + preempt_enable(); > +} > > - BUG_ON(task->lock_depth < 0); > +/* > + * Getting the big kernel lock. > + * > + * This cannot happen asynchronously, so we only need to > + * worry about other CPU's. > + */ > +void __lockfunc lock_kernel(void) > +{ > + int depth = current->lock_depth+1; > + if (likely(!depth)) > + __lock_kernel(); > + current->lock_depth = depth; > +} > > - if (likely(--task->lock_depth < 0)) > - up(&kernel_sem); > +void __lockfunc unlock_kernel(void) > +{ > + BUG_ON(current->lock_depth < 0); > + if (likely(--current->lock_depth < 0)) > + __unlock_kernel(); > } > > EXPORT_SYMBOL(lock_kernel); ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-08 2:44 ` Zhang, Yanmin @ 2008-05-08 3:29 ` Linus Torvalds 2008-05-08 4:08 ` Zhang, Yanmin 2008-05-08 6:43 ` AIM7 40% regression with 2.6.26-rc1 Ingo Molnar 1 sibling, 1 reply; 140+ messages in thread From: Linus Torvalds @ 2008-05-08 3:29 UTC (permalink / raw) To: Zhang, Yanmin Cc: Andi Kleen, Matthew Wilcox, Ingo Molnar, LKML, Alexander Viro, Andrew Morton On Thu, 8 May 2008, Zhang, Yanmin wrote: > > Congratulations! The patch really fixes the regression completely! > vmstat showed cpu idle is 0%, just like 2.6.25's. Well, that shows that it was the BKL. That said, "idle 0%" is easy when you spin. Do you also have actual performance numbers? I'd hope that not only do we use full CPU time, it's also at least as fast as the old semaphores were? While I've been dissing sleeping locks (because their overhead is so high), at least in _theory_ they can get better behavior when not spinning. Now, that's not going to happen with the BKL, I'm 99.99% sure, but I'd still like to hear actual performance numbers too, just to be sure. Anyway, at least the "was it the BKL or some other semaphore user" question is entirely off the table. So we need to - fix the BKL. My patch may be a good starting point, but there are alternatives: (a) reinstate the old BKL code entirely Quite frankly, I'd prefer not to. Not only did it have three totally different cases, some of them were apparently broken (ie BKL+regular preempt didn't cond_resched() right), and I just don't think it's worth maintaining three different versions, when distro's are going to pick one anyway. *We* should pick one, and maintain it. (b) screw the special BKL preemption - it's a spinlock, we don't preempt other spinlocks, but at least fix BKL+preempt+cond_resched thing. This would be "my patch + fixes" where at least one of the fixes is the known (apparently old) cond_preempt() bug. (c) Try to keep the 2.6.25 code as closely as possible, but just switch over to mutexes instead. I dunno. I was never all that enamoured with the BKL as a sleeping lock, so I'm biased against this one, but hey, it's just a personal bias. - get rid of the BKL anyway, at least in anything that is noticeable. Matthew's patch to file locking is probably worth doing as-is, simply because I haven't heard any better solutions. The BKL certainly can't be it, and whatever comes out of the NFSD discussion will almost certainly involve just making sure that those leases just use the new fs/locks.c lock. This is also why I'd actually prefer the simplest possible (non-preempting) spinlock BKL. Because it means that we can get rid of all that "saved_lock_depth" crud (like my patch already did). We shouldn't aim for a clever BKL, we should aim for a BKL that nobody uses. I'm certainly open to anything. Regardless, we should decide fairly soon, so that we have the choice made before -rc2 is out, and not drag this out, since regardless of the choice it needs to be tested and people comfy with it for the 2.6.26 release. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-08 3:29 ` Linus Torvalds @ 2008-05-08 4:08 ` Zhang, Yanmin 2008-05-08 4:17 ` Linus Torvalds 0 siblings, 1 reply; 140+ messages in thread From: Zhang, Yanmin @ 2008-05-08 4:08 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Matthew Wilcox, Ingo Molnar, LKML, Alexander Viro, Andrew Morton On Wed, 2008-05-07 at 20:29 -0700, Linus Torvalds wrote: > > On Thu, 8 May 2008, Zhang, Yanmin wrote: > > > > Congratulations! The patch really fixes the regression completely! > > vmstat showed cpu idle is 0%, just like 2.6.25's. > > Well, that shows that it was the BKL. > > That said, "idle 0%" is easy when you spin. Do you also have actual > performance numbers? Yes. My conclusion is based on the actual number. cpu idle 0% is just a behavior it should be. > I'd hope that not only do we use full CPU time, it's > also at least as fast as the old semaphores were? Yes. > > While I've been dissing sleeping locks (because their overhead is so > high), at least in _theory_ they can get better behavior when not > spinning. Now, that's not going to happen with the BKL, I'm 99.99% sure, > but I'd still like to hear actual performance numbers too, just to be > sure. For sure. > > Anyway, at least the "was it the BKL or some other semaphore user" > question is entirely off the table. > > So we need to > > - fix the BKL. My patch may be a good starting point, but there are > alternatives: > > (a) reinstate the old BKL code entirely > > Quite frankly, I'd prefer not to. Not only did it have three > totally different cases, some of them were apparently broken (ie > BKL+regular preempt didn't cond_resched() right), and I just don't > think it's worth maintaining three different versions, when > distro's are going to pick one anyway. *We* should pick one, and > maintain it. > > (b) screw the special BKL preemption - it's a spinlock, we don't > preempt other spinlocks, but at least fix BKL+preempt+cond_resched > thing. > > This would be "my patch + fixes" where at least one of the fixes > is the known (apparently old) cond_preempt() bug. > > (c) Try to keep the 2.6.25 code as closely as possible, but just > switch over to mutexes instead. > > I dunno. I was never all that enamoured with the BKL as a sleeping > lock, so I'm biased against this one, but hey, it's just a > personal bias. > > - get rid of the BKL anyway, at least in anything that is noticeable. > > Matthew's patch to file locking is probably worth doing as-is, > simply because I haven't heard any better solutions. The BKL > certainly can't be it, and whatever comes out of the NFSD > discussion will almost certainly involve just making sure that > those leases just use the new fs/locks.c lock. > > This is also why I'd actually prefer the simplest possible > (non-preempting) spinlock BKL. Because it means that we can get > rid of all that "saved_lock_depth" crud (like my patch already > did). We shouldn't aim for a clever BKL, we should aim for a BKL > that nobody uses. > > I'm certainly open to anything. Regardless, we should decide fairly soon, > so that we have the choice made before -rc2 is out, and not drag this out, > since regardless of the choice it needs to be tested and people comfy with > it for the 2.6.26 release. > > Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-08 4:08 ` Zhang, Yanmin @ 2008-05-08 4:17 ` Linus Torvalds 2008-05-08 12:01 ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Linus Torvalds @ 2008-05-08 4:17 UTC (permalink / raw) To: Zhang, Yanmin Cc: Andi Kleen, Matthew Wilcox, Ingo Molnar, LKML, Alexander Viro, Andrew Morton On Thu, 8 May 2008, Zhang, Yanmin wrote: > > On Wed, 2008-05-07 at 20:29 -0700, Linus Torvalds wrote: > > > > That said, "idle 0%" is easy when you spin. Do you also have actual > > performance numbers? > > Yes. My conclusion is based on the actual number. cpu idle 0% is just > a behavior it should be. Thanks, that's all I wanted to verify. I'll leave this overnight, and see if somebody has come up with some smart and wonderful patch. And if not, I think I'll apply mine as "known to fix a regression", and we can perhaps then improve on things further from there. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 4:17 ` Linus Torvalds @ 2008-05-08 12:01 ` Ingo Molnar 2008-05-08 12:28 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-08 12:01 UTC (permalink / raw) To: Linus Torvalds Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > That said, "idle 0%" is easy when you spin. Do you also have > > > actual performance numbers? > > > > Yes. My conclusion is based on the actual number. cpu idle 0% is > > just a behavior it should be. > > Thanks, that's all I wanted to verify. > > I'll leave this overnight, and see if somebody has come up with some > smart and wonderful patch. And if not, I think I'll apply mine as > "known to fix a regression", and we can perhaps then improve on things > further from there. hey, i happen to have such a smart and wonderful patch =B-) i reproduced the AIM7 workload and can confirm Yanmin's findings that -.26-rc1 regresses over .25 - by over 67% here. Looking at the workload i found and fixed what i believe to be the real bug causing the AIM7 regression: it was inefficient wakeup / scheduling / locking behavior of the new generic semaphore code, causing suboptimal performance. The problem comes from the following code. The new semaphore code does this on down(): spin_lock_irqsave(&sem->lock, flags); if (likely(sem->count > 0)) sem->count--; else __down(sem); spin_unlock_irqrestore(&sem->lock, flags); and this on up(): spin_lock_irqsave(&sem->lock, flags); if (likely(list_empty(&sem->wait_list))) sem->count++; else __up(sem); spin_unlock_irqrestore(&sem->lock, flags); where __up() does: list_del(&waiter->list); waiter->up = 1; wake_up_process(waiter->task); and where __down() does this in essence: list_add_tail(&waiter.list, &sem->wait_list); waiter.task = task; waiter.up = 0; for (;;) { [...] spin_unlock_irq(&sem->lock); timeout = schedule_timeout(timeout); spin_lock_irq(&sem->lock); if (waiter.up) return 0; } the fastpath looks good and obvious, but note the following property of the contended path: if there's a task on the ->wait_list, the up() of the current owner will "pass over" ownership to that waiting task, in a wake-one manner, via the waiter->up flag and by removing the waiter from the wait list. That is all and fine in principle, but as implemented in kernel/semaphore.c it also creates a nasty, hidden source of contention! The contention comes from the following property of the new semaphore code: the new owner owns the semaphore exclusively, even if it is not running yet. So if the old owner, even if just a few instructions later, does a down() [lock_kernel()] again, it will be blocked and will have to wait on the new owner to eventually be scheduled (possibly on another CPU)! Or if another other task gets to lock_kernel() sooner than the "new owner" scheduled, it will be blocked unnecessarily and for a very long time when there are 2000 tasks running. I.e. the implementation of the new semaphores code does wake-one and lock ownership in a very restrictive way - it does not allow opportunistic re-locking of the lock at all and keeps the scheduler from picking task order intelligently. This kind of scheduling, with 2000 AIM7 processes running, creates awful cross-scheduling between those 2000 tasks, causes reduced parallelism, a throttled runqueue length and a lot of idle time. With increasing number of CPUs it causes an exponentially worse behavior in AIM7, as the chance for a newly woken new-owner task to actually run anytime soon is less and less likely. Note that it takes just a tiny bit of contention for the 'new-semaphore catastrophy' to happen: the wakeup latencies get added to whatever small contention there is, and quickly snowball out of control! I believe Yanmin's findings and numbers support this analysis too. The best fix for this problem is to use the same scheduling logic that the kernel/mutex.c code uses: keep the wake-one behavior (that is OK and wanted because we do not want to over-schedule), but also allow opportunistic locking of the lock even if a wakee is already "in flight". The patch below implements this new logic. With this patch applied the AIM7 regression is largely fixed on my quad testbox: # v2.6.25 vanilla: .................. Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 56096.4 91 207.5 789.7 0.4675 2000 55894.4 94 208.2 792.7 0.4658 # v2.6.26-rc1-166-gc0a1811 vanilla: ................................... Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 33230.6 83 350.3 784.5 0.2769 2000 31778.1 86 366.3 783.6 0.2648 # v2.6.26-rc1-166-gc0a1811 + semaphore-speedup: ............................................... Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 55707.1 92 209.0 795.6 0.4642 2000 55704.4 96 209.0 796.0 0.4642 i.e. a 67% speedup. We are now back to within 1% of the v2.6.25 performance levels and have zero idle time during the test, as expected. Btw., interactivity also improved dramatically with the fix - for example console-switching became almost instantaneous during this workload (which after all is running 2000 tasks at once!), without the patch it was stuck for a minute at times. I also ran Linus's spinlock-BKL patch as well: # v2.6.26-rc1-166-gc0a1811 + Linus-BKL-spinlock-patch: ...................................................... Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 55889.0 92 208.3 793.3 0.4657 2000 55891.7 96 208.3 793.3 0.4658 it is about 0.3% faster - but note that that is within the general noise levels of this test. I'd expect Linus's spinlock-BKL patch to give some small speedup because the BKL acquire times are short and 2000 tasks running all at once really increases the context-switching cost and most BKL contentions are within the cost of context-switching. But i believe the better solution for that is to remove BKL use from all hotpaths, not to hide some of its costs by reintroducing it as a spinlock. Reintroducing the spinlock based BKL would have other disadvantages as well: it could reintroduce per-CPU-ness assumptions in BKL-using code and other complications as well. It's also not a very realistic workload - with 2000 tasks running the system was barely serviceable. I'd much rather make BKL costs more apparent and more visible - but 50% regression was of course too much. But 0.3% for a 2000-tasks workload, which is near the noise level ... is acceptable i think - especially as this discussion has now reinvigorated the remove-the-BKL discussions and patches. Linus, we can do your spinlock-BKL patch too if you feel strongly about it, but i'd rather not - we fought so hard for the preemptible BKL :-) The spinlock-based-BKL patch only worked around the real problem i believe, because it eliminated the use of the suboptimal new semaphore code: with spinlocks there's no scheduling at all, so the wakeup/locking bug of the new semaphore code did not apply. It was not about any fastpath overhead AFAICS. [we'd have seen that with the CONFIG_PREEMPT_BKL=y code as well, which has been the default setting since v2.6.8.] There's another nice side-effect of this speedup patch, the new generic semaphore code got even smaller: text data bss dec hex filename 1241 0 0 1241 4d9 semaphore.o.before 1207 0 0 1207 4b7 semaphore.o.after (because the waiter.up complication got removed.) Longer-term we should look into using the mutex code for the generic semaphore code as well - but i's not easy due to legacies and it's outside of the scope of v2.6.26 and outside the scope of this patch as well. Hm? Ingo Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/semaphore.c | 36 ++++++++++++++++-------------------- 1 file changed, 16 insertions(+), 20 deletions(-) Index: linux/kernel/semaphore.c =================================================================== --- linux.orig/kernel/semaphore.c +++ linux/kernel/semaphore.c @@ -54,10 +54,9 @@ void down(struct semaphore *sem) unsigned long flags; spin_lock_irqsave(&sem->lock, flags); - if (likely(sem->count > 0)) - sem->count--; - else + if (unlikely(sem->count <= 0)) __down(sem); + sem->count--; spin_unlock_irqrestore(&sem->lock, flags); } EXPORT_SYMBOL(down); @@ -77,10 +76,10 @@ int down_interruptible(struct semaphore int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (likely(sem->count > 0)) - sem->count--; - else + if (unlikely(sem->count <= 0)) result = __down_interruptible(sem); + if (!result) + sem->count--; spin_unlock_irqrestore(&sem->lock, flags); return result; @@ -103,10 +102,10 @@ int down_killable(struct semaphore *sem) int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (likely(sem->count > 0)) - sem->count--; - else + if (unlikely(sem->count <= 0)) result = __down_killable(sem); + if (!result) + sem->count--; spin_unlock_irqrestore(&sem->lock, flags); return result; @@ -157,10 +156,10 @@ int down_timeout(struct semaphore *sem, int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (likely(sem->count > 0)) - sem->count--; - else + if (unlikely(sem->count <= 0)) result = __down_timeout(sem, jiffies); + if (!result) + sem->count--; spin_unlock_irqrestore(&sem->lock, flags); return result; @@ -179,9 +178,8 @@ void up(struct semaphore *sem) unsigned long flags; spin_lock_irqsave(&sem->lock, flags); - if (likely(list_empty(&sem->wait_list))) - sem->count++; - else + sem->count++; + if (unlikely(!list_empty(&sem->wait_list))) __up(sem); spin_unlock_irqrestore(&sem->lock, flags); } @@ -192,7 +190,6 @@ EXPORT_SYMBOL(up); struct semaphore_waiter { struct list_head list; struct task_struct *task; - int up; }; /* @@ -206,11 +203,11 @@ static inline int __sched __down_common( struct task_struct *task = current; struct semaphore_waiter waiter; - list_add_tail(&waiter.list, &sem->wait_list); waiter.task = task; - waiter.up = 0; for (;;) { + list_add_tail(&waiter.list, &sem->wait_list); + if (state == TASK_INTERRUPTIBLE && signal_pending(task)) goto interrupted; if (state == TASK_KILLABLE && fatal_signal_pending(task)) @@ -221,7 +218,7 @@ static inline int __sched __down_common( spin_unlock_irq(&sem->lock); timeout = schedule_timeout(timeout); spin_lock_irq(&sem->lock); - if (waiter.up) + if (sem->count > 0) return 0; } @@ -259,6 +256,5 @@ static noinline void __sched __up(struct struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, struct semaphore_waiter, list); list_del(&waiter->list); - waiter->up = 1; wake_up_process(waiter->task); } ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 12:01 ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Ingo Molnar @ 2008-05-08 12:28 ` Ingo Molnar 2008-05-08 14:43 ` Ingo Molnar 2008-05-08 16:02 ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Linus Torvalds 2008-05-08 13:20 ` Matthew Wilcox 2008-05-08 13:56 ` Arjan van de Ven 2 siblings, 2 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-08 12:28 UTC (permalink / raw) To: Linus Torvalds Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin * Ingo Molnar <mingo@elte.hu> wrote: > + if (unlikely(sem->count <= 0)) > __down(sem); > + sem->count--; Peter pointed it out that because sem->count is u32, the <= 0 is in fact a "== 0" condition - the patch below does that. As expected gcc figured out the same thing too so the resulting code output did not change. (so this is just a cleanup) i've got this lined up in sched.git and it's undergoing testing right now. If that testing goes fine and if there are no objections i'll send a pull request for it later today. Ingo ----------------> Subject: semaphores: improve code From: Ingo Molnar <mingo@elte.hu> Date: Thu May 08 14:19:23 CEST 2008 No code changed: kernel/semaphore.o: text data bss dec hex filename 1207 0 0 1207 4b7 semaphore.o.before 1207 0 0 1207 4b7 semaphore.o.after md5: c10198c2952bd345a1edaac6db891548 semaphore.o.before.asm c10198c2952bd345a1edaac6db891548 semaphore.o.after.asm Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/semaphore.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) Index: linux/kernel/semaphore.c =================================================================== --- linux.orig/kernel/semaphore.c +++ linux/kernel/semaphore.c @@ -54,7 +54,7 @@ void down(struct semaphore *sem) unsigned long flags; spin_lock_irqsave(&sem->lock, flags); - if (unlikely(sem->count <= 0)) + if (unlikely(!sem->count)) __down(sem); sem->count--; spin_unlock_irqrestore(&sem->lock, flags); @@ -76,7 +76,7 @@ int down_interruptible(struct semaphore int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (unlikely(sem->count <= 0)) + if (unlikely(!sem->count)) result = __down_interruptible(sem); if (!result) sem->count--; @@ -102,7 +102,7 @@ int down_killable(struct semaphore *sem) int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (unlikely(sem->count <= 0)) + if (unlikely(!sem->count)) result = __down_killable(sem); if (!result) sem->count--; @@ -156,7 +156,7 @@ int down_timeout(struct semaphore *sem, int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (unlikely(sem->count <= 0)) + if (unlikely(!sem->count)) result = __down_timeout(sem, jiffies); if (!result) sem->count--; ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 12:28 ` Ingo Molnar @ 2008-05-08 14:43 ` Ingo Molnar 2008-05-08 15:10 ` [git pull] scheduler fixes Ingo Molnar 2008-05-08 16:02 ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Linus Torvalds 1 sibling, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-08 14:43 UTC (permalink / raw) To: Linus Torvalds Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin * Ingo Molnar <mingo@elte.hu> wrote: > Peter pointed it out that because sem->count is u32, the <= 0 is in > fact a "== 0" condition - the patch below does that. As expected gcc > figured out the same thing too so the resulting code output did not > change. (so this is just a cleanup) a second update patch, i've further simplified the semaphore wakeup logic: there's no need for the wakeup to remove the task from the wait list. This will make them a slighly bit more fair, but more importantly, this closes a race in my first patch for the unlikely case of a signal (or a timeout) and an unlock coming in at the same time and the task not getting removed from the wait-list. ( my performance testing with 2000 AIM7 tasks on a quad never hit that race but x86.git QA actually triggered it after about 30 random kernel bootups and it caused a nasty crash and lockup. ) Ingo ----------------> Subject: sem: simplify queue management From: Ingo Molnar <mingo@elte.hu> Date: Tue May 06 19:32:42 CEST 2008 kernel/semaphore.o: text data bss dec hex filename 1040 0 0 1040 410 semaphore.o.before 975 0 0 975 3cf semaphore.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/semaphore.c | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) Index: linux/kernel/semaphore.c =================================================================== --- linux.orig/kernel/semaphore.c +++ linux/kernel/semaphore.c @@ -202,33 +202,34 @@ static inline int __sched __down_common( { struct task_struct *task = current; struct semaphore_waiter waiter; + int ret = 0; waiter.task = task; + list_add_tail(&waiter.list, &sem->wait_list); for (;;) { - list_add_tail(&waiter.list, &sem->wait_list); - - if (state == TASK_INTERRUPTIBLE && signal_pending(task)) - goto interrupted; - if (state == TASK_KILLABLE && fatal_signal_pending(task)) - goto interrupted; - if (timeout <= 0) - goto timed_out; + if (state == TASK_INTERRUPTIBLE && signal_pending(task)) { + ret = -EINTR; + break; + } + if (state == TASK_KILLABLE && fatal_signal_pending(task)) { + ret = -EINTR; + break; + } + if (timeout <= 0) { + ret = -ETIME; + break; + } __set_task_state(task, state); spin_unlock_irq(&sem->lock); timeout = schedule_timeout(timeout); spin_lock_irq(&sem->lock); if (sem->count > 0) - return 0; + break; } - timed_out: - list_del(&waiter.list); - return -ETIME; - - interrupted: list_del(&waiter.list); - return -EINTR; + return ret; } static noinline void __sched __down(struct semaphore *sem) @@ -255,6 +256,5 @@ static noinline void __sched __up(struct { struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, struct semaphore_waiter, list); - list_del(&waiter->list); wake_up_process(waiter->task); } ^ permalink raw reply [flat|nested] 140+ messages in thread
* [git pull] scheduler fixes 2008-05-08 14:43 ` Ingo Molnar @ 2008-05-08 15:10 ` Ingo Molnar 2008-05-08 15:33 ` Adrian Bunk 2008-05-11 11:03 ` Matthew Wilcox 0 siblings, 2 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-08 15:10 UTC (permalink / raw) To: Linus Torvalds Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin * Ingo Molnar <mingo@elte.hu> wrote: > > Peter pointed it out that because sem->count is u32, the <= 0 is in > > fact a "== 0" condition - the patch below does that. As expected gcc > > figured out the same thing too so the resulting code output did not > > change. (so this is just a cleanup) > > a second update patch, i've further simplified the semaphore wakeup > logic: there's no need for the wakeup to remove the task from the wait > list. This will make them a slighly bit more fair, but more > importantly, this closes a race in my first patch for the unlikely > case of a signal (or a timeout) and an unlock coming in at the same > time and the task not getting removed from the wait-list. > > ( my performance testing with 2000 AIM7 tasks on a quad never hit that > race but x86.git QA actually triggered it after about 30 random > kernel bootups and it caused a nasty crash and lockup. ) ok, it's looking good here so far so here's the scheduler fixes tree that you can pull if my semaphore fix looks good to you too: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes.git for-linus also includes a scheduler arithmetics fix from Mike. Find the shortlog and diff below. Ingo ------------------> Ingo Molnar (1): semaphore: fix Mike Galbraith (1): sched: fix weight calculations kernel/sched_fair.c | 11 ++++++-- kernel/semaphore.c | 64 ++++++++++++++++++++++++--------------------------- 2 files changed, 38 insertions(+), 37 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index c863663..e24ecd3 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -662,10 +662,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) if (!initial) { /* sleeps upto a single latency don't count. */ if (sched_feat(NEW_FAIR_SLEEPERS)) { + unsigned long thresh = sysctl_sched_latency; + + /* + * convert the sleeper threshold into virtual time + */ if (sched_feat(NORMALIZED_SLEEPER)) - vruntime -= calc_delta_weight(sysctl_sched_latency, se); - else - vruntime -= sysctl_sched_latency; + thresh = calc_delta_fair(thresh, se); + + vruntime -= thresh; } /* ensure we never gain time by being placed backwards. */ diff --git a/kernel/semaphore.c b/kernel/semaphore.c index 5c2942e..5e41217 100644 --- a/kernel/semaphore.c +++ b/kernel/semaphore.c @@ -54,10 +54,9 @@ void down(struct semaphore *sem) unsigned long flags; spin_lock_irqsave(&sem->lock, flags); - if (likely(sem->count > 0)) - sem->count--; - else + if (unlikely(!sem->count)) __down(sem); + sem->count--; spin_unlock_irqrestore(&sem->lock, flags); } EXPORT_SYMBOL(down); @@ -77,10 +76,10 @@ int down_interruptible(struct semaphore *sem) int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (likely(sem->count > 0)) - sem->count--; - else + if (unlikely(!sem->count)) result = __down_interruptible(sem); + if (!result) + sem->count--; spin_unlock_irqrestore(&sem->lock, flags); return result; @@ -103,10 +102,10 @@ int down_killable(struct semaphore *sem) int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (likely(sem->count > 0)) - sem->count--; - else + if (unlikely(!sem->count)) result = __down_killable(sem); + if (!result) + sem->count--; spin_unlock_irqrestore(&sem->lock, flags); return result; @@ -157,10 +156,10 @@ int down_timeout(struct semaphore *sem, long jiffies) int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (likely(sem->count > 0)) - sem->count--; - else + if (unlikely(!sem->count)) result = __down_timeout(sem, jiffies); + if (!result) + sem->count--; spin_unlock_irqrestore(&sem->lock, flags); return result; @@ -179,9 +178,8 @@ void up(struct semaphore *sem) unsigned long flags; spin_lock_irqsave(&sem->lock, flags); - if (likely(list_empty(&sem->wait_list))) - sem->count++; - else + sem->count++; + if (unlikely(!list_empty(&sem->wait_list))) __up(sem); spin_unlock_irqrestore(&sem->lock, flags); } @@ -192,7 +190,6 @@ EXPORT_SYMBOL(up); struct semaphore_waiter { struct list_head list; struct task_struct *task; - int up; }; /* @@ -205,33 +202,34 @@ static inline int __sched __down_common(struct semaphore *sem, long state, { struct task_struct *task = current; struct semaphore_waiter waiter; + int ret = 0; - list_add_tail(&waiter.list, &sem->wait_list); waiter.task = task; - waiter.up = 0; + list_add_tail(&waiter.list, &sem->wait_list); for (;;) { - if (state == TASK_INTERRUPTIBLE && signal_pending(task)) - goto interrupted; - if (state == TASK_KILLABLE && fatal_signal_pending(task)) - goto interrupted; - if (timeout <= 0) - goto timed_out; + if (state == TASK_INTERRUPTIBLE && signal_pending(task)) { + ret = -EINTR; + break; + } + if (state == TASK_KILLABLE && fatal_signal_pending(task)) { + ret = -EINTR; + break; + } + if (timeout <= 0) { + ret = -ETIME; + break; + } __set_task_state(task, state); spin_unlock_irq(&sem->lock); timeout = schedule_timeout(timeout); spin_lock_irq(&sem->lock); - if (waiter.up) - return 0; + if (sem->count > 0) + break; } - timed_out: - list_del(&waiter.list); - return -ETIME; - - interrupted: list_del(&waiter.list); - return -EINTR; + return ret; } static noinline void __sched __down(struct semaphore *sem) @@ -258,7 +256,5 @@ static noinline void __sched __up(struct semaphore *sem) { struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, struct semaphore_waiter, list); - list_del(&waiter->list); - waiter->up = 1; wake_up_process(waiter->task); } ^ permalink raw reply related [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-08 15:10 ` [git pull] scheduler fixes Ingo Molnar @ 2008-05-08 15:33 ` Adrian Bunk 2008-05-08 15:41 ` Ingo Molnar 2008-05-11 11:03 ` Matthew Wilcox 1 sibling, 1 reply; 140+ messages in thread From: Adrian Bunk @ 2008-05-08 15:33 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Mike Galbraith On Thu, May 08, 2008 at 05:10:28PM +0200, Ingo Molnar wrote: >... > also includes a scheduler arithmetics fix from Mike. Find the shortlog > and diff below. > > Ingo > > ------------------> >... > Mike Galbraith (1): > sched: fix weight calculations >... The commit description says: <-- snip --> ... This bug could be related to the regression reported by Yanmin Zhang: | Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has lots | of regressions with 2.6.26-rc1: | | 1) 8-core stoakley: 28%; | 2) 16-core tigerton: 20%; | 3) Itanium Montvale: 50%. Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> ... <-- snip --> Can we get that verified and the description updated before it hits Linus' tree? Otherwise this "could be related" will become unchangable metadata that will stay forever - no matter whether there's any relation at all. Thanks Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-08 15:33 ` Adrian Bunk @ 2008-05-08 15:41 ` Ingo Molnar 2008-05-08 19:42 ` Adrian Bunk 0 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-08 15:41 UTC (permalink / raw) To: Adrian Bunk Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Mike Galbraith * Adrian Bunk <bunk@kernel.org> wrote: > Can we get that verified and the description updated before it hits > Linus' tree? that's not needed. Mike's fix is correct, regardless of whether it fixes the other regression or not. > Otherwise this "could be related" will become unchangable metadata > that will stay forever - no matter whether there's any relation at > all. ... and the problem with that is exactly what? Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-08 15:41 ` Ingo Molnar @ 2008-05-08 19:42 ` Adrian Bunk 0 siblings, 0 replies; 140+ messages in thread From: Adrian Bunk @ 2008-05-08 19:42 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Mike Galbraith On Thu, May 08, 2008 at 05:41:16PM +0200, Ingo Molnar wrote: > > * Adrian Bunk <bunk@kernel.org> wrote: > > > Can we get that verified and the description updated before it hits > > Linus' tree? > > that's not needed. Mike's fix is correct, regardless of whether it fixes > the other regression or not. Then scrap the part about it possibly fixing a regression and the Reported-by: line. > > Otherwise this "could be related" will become unchangable metadata > > that will stay forever - no matter whether there's any relation at > > all. > > ... and the problem with that is exactly what? It is important that our metadata is as complete and correct as reasonably possible. Our code is not as well documented as it should be, and in my experience often the only way to understand what happens and why it happens is to ask git for the metadata (and I'm actually doing this even for most of my "trivial" patches). In 3 hours or 3 years someone might look at this commit trying to understand what it does and why it does this. And there's a big difference between "we do it because it's correct from a theoretical point of view" and "it is supposed to fix a huge performance regression". > Ingo cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-08 15:10 ` [git pull] scheduler fixes Ingo Molnar 2008-05-08 15:33 ` Adrian Bunk @ 2008-05-11 11:03 ` Matthew Wilcox 2008-05-11 11:14 ` Matthew Wilcox ` (3 more replies) 1 sibling, 4 replies; 140+ messages in thread From: Matthew Wilcox @ 2008-05-11 11:03 UTC (permalink / raw) To: Ingo Molnar, Sven Wegener Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin On Thu, May 08, 2008 at 05:10:28PM +0200, Ingo Molnar wrote: > @@ -258,7 +256,5 @@ static noinline void __sched __up(struct semaphore *sem) > { > struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, > struct semaphore_waiter, list); > - list_del(&waiter->list); > - waiter->up = 1; > wake_up_process(waiter->task); > } This might be the problem that causes the missing wakeups. If you have a semaphore with n=2, and four processes calling down(), tasks A and B acquire the semaphore and tasks C and D go to sleep. Task A calls up() and wakes up C. Then task B calls up() and doesn't wake up anyone because C hasn't run yet. I think we need another wakeup when task C finishes in __down_common, like this (on top of your patch): diff --git a/kernel/semaphore.c b/kernel/semaphore.c index 5e41217..e520ad4 100644 --- a/kernel/semaphore.c +++ b/kernel/semaphore.c @@ -229,6 +229,11 @@ static inline int __sched __down_common(struct semaphore *sem, long state, } list_del(&waiter.list); + + /* It's possible we need to wake up the next task on the list too */ + if (unlikely(sem->count > 1) && !list_empty(&sem->wait_list)) + __up(sem); + return ret; } Sven, can you try this with your workload? I suspect this might be it because XFS does use semaphores with n>1. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply related [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 11:03 ` Matthew Wilcox @ 2008-05-11 11:14 ` Matthew Wilcox 2008-05-11 11:48 ` Matthew Wilcox ` (2 subsequent siblings) 3 siblings, 0 replies; 140+ messages in thread From: Matthew Wilcox @ 2008-05-11 11:14 UTC (permalink / raw) To: Ingo Molnar, Sven Wegener Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin On Sun, May 11, 2008 at 05:03:06AM -0600, Matthew Wilcox wrote: > This might be the problem that causes the missing wakeups. If you have a > semaphore with n=2, and four processes calling down(), tasks A and B > acquire the semaphore and tasks C and D go to sleep. Task A calls up() > and wakes up C. Then task B calls up() and doesn't wake up anyone > because C hasn't run yet. I think we need another wakeup when task C Er, I mis-wrote there. Task A calls up() and wakes up C. Then task B calls up() and wakes up C again because C hasn't removed itself from the list yet. D never receives a wakeup. The solution is for C to pass a wakeup along to the next in line. (The solution remains the same). -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 11:03 ` Matthew Wilcox 2008-05-11 11:14 ` Matthew Wilcox @ 2008-05-11 11:48 ` Matthew Wilcox 2008-05-11 12:50 ` Ingo Molnar 2008-05-11 13:01 ` Ingo Molnar 2008-05-11 14:10 ` Sven Wegener 3 siblings, 1 reply; 140+ messages in thread From: Matthew Wilcox @ 2008-05-11 11:48 UTC (permalink / raw) To: Ingo Molnar, Sven Wegener Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin On Sun, May 11, 2008 at 05:03:06AM -0600, Matthew Wilcox wrote: > This might be the problem that causes the missing wakeups. If you have a > semaphore with n=2, and four processes calling down(), tasks A and B > acquire the semaphore and tasks C and D go to sleep. Task A calls up() [...] > Sven, can you try this with your workload? I suspect this might be it > because XFS does use semaphores with n>1. This is exactly it. Or rather, it's even simpler. Three tasks involved; A and B call xlog_state_get_iclog_space() and end up calling psema(&log->l_flushsema) [down() by any other name -- the semaphore is initialised to 0; it's really a completion]. Then task C calls xlog_state_do_callback() which does: while (flushcnt--) vsema(&log->l_flushsema); [vsema is AKA up()] It assumes this wakes up both A and B ... but it won't with Ingo's code; it'll wake up A twice. So I deem my fix "proven by thought experiment". I haven't tried booting it or anything. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 11:48 ` Matthew Wilcox @ 2008-05-11 12:50 ` Ingo Molnar 2008-05-11 12:52 ` Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-11 12:50 UTC (permalink / raw) To: Matthew Wilcox Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin * Matthew Wilcox <matthew@wil.cx> wrote: > So I deem my fix "proven by thought experiment". I haven't tried > booting it or anything. i actually have two fixes, made earlier today. The 'fix3' one has been confirmed by Sven to fix the regression - but i think we need the 'fix #2' one below as well to make it complete. Ingo --------------------> Subject: semaphore: fix3 From: Ingo Molnar <mingo@elte.hu> Date: Sun May 11 09:51:07 CEST 2008 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/semaphore.c | 7 +++++++ 1 file changed, 7 insertions(+) Index: linux/kernel/semaphore.c =================================================================== --- linux.orig/kernel/semaphore.c +++ linux/kernel/semaphore.c @@ -258,5 +258,12 @@ static noinline void __sched __up(struct { struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, struct semaphore_waiter, list); + + /* + * Rotate sleepers - to make sure all of them get woken in case + * of parallel up()s: + */ + list_move_tail(&waiter->list, &sem->wait_list); + wake_up_process(waiter->task); } ----------------> Subject: semaphore: fix #2 From: Ingo Molnar <mingo@elte.hu> Date: Thu May 08 11:53:48 CEST 2008 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/semaphore.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) Index: linux/kernel/semaphore.c =================================================================== --- linux.orig/kernel/semaphore.c +++ linux/kernel/semaphore.c @@ -194,6 +194,13 @@ struct semaphore_waiter { struct task_struct *task; }; +static noinline void __sched __up(struct semaphore *sem) +{ + struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, + struct semaphore_waiter, list); + wake_up_process(waiter->task); +} + /* * Because this function is inlined, the 'state' parameter will be * constant, and thus optimised away by the compiler. Likewise the @@ -231,6 +238,9 @@ static inline int __sched __down_common( } list_del(&waiter.list); + if (unlikely(!list_empty(&sem->wait_list)) && sem->count) + __up(sem); + return ret; } @@ -254,9 +264,3 @@ static noinline int __sched __down_timeo return __down_common(sem, TASK_UNINTERRUPTIBLE, jiffies); } -static noinline void __sched __up(struct semaphore *sem) -{ - struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, - struct semaphore_waiter, list); - wake_up_process(waiter->task); -} ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 12:50 ` Ingo Molnar @ 2008-05-11 12:52 ` Ingo Molnar 2008-05-11 13:02 ` Matthew Wilcox 0 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-11 12:52 UTC (permalink / raw) To: Matthew Wilcox Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin * Ingo Molnar <mingo@elte.hu> wrote: > > So I deem my fix "proven by thought experiment". I haven't tried > > booting it or anything. > > i actually have two fixes, made earlier today. The 'fix3' one has been > confirmed by Sven to fix the regression - but i think we need the 'fix > #2' one below as well to make it complete. i just combined them into a single fix, see below. Ingo ---------------------------> Subject: semaphore: fix #3 From: Ingo Molnar <mingo@elte.hu> Date: Sun May 11 09:51:07 CEST 2008 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/semaphore.c | 23 +++++++++++++++++------ 1 file changed, 17 insertions(+), 6 deletions(-) Index: linux/kernel/semaphore.c =================================================================== --- linux.orig/kernel/semaphore.c +++ linux/kernel/semaphore.c @@ -194,6 +194,13 @@ struct semaphore_waiter { struct task_struct *task; }; +static noinline void __sched __up(struct semaphore *sem) +{ + struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, + struct semaphore_waiter, list); + wake_up_process(waiter->task); +} + /* * Because this function is inlined, the 'state' parameter will be * constant, and thus optimised away by the compiler. Likewise the @@ -231,6 +238,9 @@ static inline int __sched __down_common( } list_del(&waiter.list); + if (unlikely(!list_empty(&sem->wait_list)) && sem->count) + __up(sem); + return ret; } @@ -254,9 +264,10 @@ static noinline int __sched __down_timeo return __down_common(sem, TASK_UNINTERRUPTIBLE, jiffies); } -static noinline void __sched __up(struct semaphore *sem) -{ - struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, - struct semaphore_waiter, list); - wake_up_process(waiter->task); -} + + /* + * Rotate sleepers - to make sure all of them get woken in case + * of parallel up()s: + */ + list_move_tail(&waiter->list, &sem->wait_list); + ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 12:52 ` Ingo Molnar @ 2008-05-11 13:02 ` Matthew Wilcox 2008-05-11 13:26 ` Matthew Wilcox 2008-05-11 13:54 ` Ingo Molnar 0 siblings, 2 replies; 140+ messages in thread From: Matthew Wilcox @ 2008-05-11 13:02 UTC (permalink / raw) To: Ingo Molnar Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin On Sun, May 11, 2008 at 02:52:16PM +0200, Ingo Molnar wrote: > * Ingo Molnar <mingo@elte.hu> wrote: > > > > So I deem my fix "proven by thought experiment". I haven't tried > > > booting it or anything. > > > > i actually have two fixes, made earlier today. The 'fix3' one has been > > confirmed by Sven to fix the regression - but i think we need the 'fix > > #2' one below as well to make it complete. > > i just combined them into a single fix, see below. That's mangled ... why did you move __up around? > list_del(&waiter.list); > + if (unlikely(!list_empty(&sem->wait_list)) && sem->count) > + __up(sem); That's an unnecessary wakeup compared to my patch. > return ret; > } > > @@ -254,9 +264,10 @@ static noinline int __sched __down_timeo > return __down_common(sem, TASK_UNINTERRUPTIBLE, jiffies); > } > > -static noinline void __sched __up(struct semaphore *sem) > -{ > - struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, > - struct semaphore_waiter, list); > - wake_up_process(waiter->task); > -} > + > + /* > + * Rotate sleepers - to make sure all of them get woken in case > + * of parallel up()s: > + */ > + list_move_tail(&waiter->list, &sem->wait_list); Seems like extra cache line dirtying for no real gain over my solution. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 13:02 ` Matthew Wilcox @ 2008-05-11 13:26 ` Matthew Wilcox 2008-05-11 14:00 ` Ingo Molnar 2008-05-11 13:54 ` Ingo Molnar 1 sibling, 1 reply; 140+ messages in thread From: Matthew Wilcox @ 2008-05-11 13:26 UTC (permalink / raw) To: Ingo Molnar Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin On Sun, May 11, 2008 at 07:02:26AM -0600, Matthew Wilcox wrote: > > + list_move_tail(&waiter->list, &sem->wait_list); > > Seems like extra cache line dirtying for no real gain over my solution. Actually, let me just go into this a little further. In principle, you'd think that we'd want to wake up all the tasks possible as soon as possible. In practice, Dave Chinner has said that the l_flushsema introduces a thundering herd (a few hundred tasks can build up behind it on systems such as Columbia apparently) that then run into a bottleneck as soon as they're unleashed. Current XFS CVS has a fix from myself and Christoph that gets rid of the l_flushsema and replaces it with a staggered wakeup of each task that's waiting as the previously woken task clears the critical section. Obviously, generic up() can't possibly do as well, but by staggering the release of tasks from __down_common(), we mitigate the herd somewhat. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 13:26 ` Matthew Wilcox @ 2008-05-11 14:00 ` Ingo Molnar 2008-05-11 14:18 ` Matthew Wilcox 0 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-11 14:00 UTC (permalink / raw) To: Matthew Wilcox Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin * Matthew Wilcox <matthew@wil.cx> wrote: > Current XFS CVS has a fix from myself and Christoph that gets rid of > the l_flushsema and replaces it with a staggered wakeup of each task > that's waiting as the previously woken task clears the critical > section. the solution is to reduce semaphore usage by converting them to mutexes. Is anyone working on removing legacy semaphore use from XFS? Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 14:00 ` Ingo Molnar @ 2008-05-11 14:18 ` Matthew Wilcox 2008-05-11 14:42 ` Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Matthew Wilcox @ 2008-05-11 14:18 UTC (permalink / raw) To: Ingo Molnar Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin On Sun, May 11, 2008 at 04:00:17PM +0200, Ingo Molnar wrote: > * Matthew Wilcox <matthew@wil.cx> wrote: > > > Current XFS CVS has a fix from myself and Christoph that gets rid of > > the l_flushsema and replaces it with a staggered wakeup of each task > > that's waiting as the previously woken task clears the critical > > section. > > the solution is to reduce semaphore usage by converting them to mutexes. > Is anyone working on removing legacy semaphore use from XFS? This race is completely irrelevant to converting semaphores to mutexes. It can only occur for semaphores which /can't/ be converted to mutexes. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 14:18 ` Matthew Wilcox @ 2008-05-11 14:42 ` Ingo Molnar 2008-05-11 14:48 ` Matthew Wilcox 0 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-11 14:42 UTC (permalink / raw) To: Matthew Wilcox Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra * Matthew Wilcox <matthew@wil.cx> wrote: > > > Current XFS CVS has a fix from myself and Christoph that gets rid > > > of the l_flushsema and replaces it with a staggered wakeup of each > > > task that's waiting as the previously woken task clears the > > > critical section. > > > > the solution is to reduce semaphore usage by converting them to > > mutexes. Is anyone working on removing legacy semaphore use from > > XFS? > > This race is completely irrelevant to converting semaphores to > mutexes. [...] i was not talking about the race. I was just reacting on your comments about thundering herds and staggered wakeups - which is a performance detail. Semaphores should not regress AIM7 by 50% but otherwise they are legacy code and their use should be reduced monotonically, so i was asking why anyone still cares about tuning semaphore details in XFS instead of just working on removing semaphore use from them. > [...] It can only occur for semaphores which /can't/ be converted to > mutexes. exactly what usecase is that? Perhaps it could be converted to an atomic counter + the wait_event() APIs. Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 14:42 ` Ingo Molnar @ 2008-05-11 14:48 ` Matthew Wilcox 2008-05-11 15:19 ` Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Matthew Wilcox @ 2008-05-11 14:48 UTC (permalink / raw) To: Ingo Molnar Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra On Sun, May 11, 2008 at 04:42:03PM +0200, Ingo Molnar wrote: > > * Matthew Wilcox <matthew@wil.cx> wrote: > > > > > Current XFS CVS has a fix from myself and Christoph that gets rid > > > > of the l_flushsema and replaces it with a staggered wakeup of each > > > > task that's waiting as the previously woken task clears the > > > > critical section. > > i was not talking about the race. I was just reacting on your comments > about thundering herds and staggered wakeups - which is a performance > detail. Semaphores should not regress AIM7 by 50% but otherwise they are > legacy code and their use should be reduced monotonically, so i was > asking why anyone still cares about tuning semaphore details in XFS > instead of just working on removing semaphore use from them. That's what we did. l_flushsema is gone. It actually got replaced with a condition variable, but it's equivalent to a wait_event() > exactly what usecase is that? Perhaps it could be converted to an atomic > counter + the wait_event() APIs. Effectively, it's a completion. It just works better with staggered wakeups than it does with the naive completion. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 14:48 ` Matthew Wilcox @ 2008-05-11 15:19 ` Ingo Molnar 2008-05-11 15:29 ` Matthew Wilcox 0 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-11 15:19 UTC (permalink / raw) To: Matthew Wilcox Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra * Matthew Wilcox <matthew@wil.cx> wrote: > > exactly what usecase is that? Perhaps it could be converted to an > > atomic counter + the wait_event() APIs. > > Effectively, it's a completion. It just works better with staggered > wakeups than it does with the naive completion. So why not transform it to real completions instead? And if our current 'struct completion' abstraction is insufficient for whatever reason, why not extend that instead? Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 15:19 ` Ingo Molnar @ 2008-05-11 15:29 ` Matthew Wilcox 2008-05-13 14:11 ` Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Matthew Wilcox @ 2008-05-11 15:29 UTC (permalink / raw) To: Ingo Molnar Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra On Sun, May 11, 2008 at 05:19:09PM +0200, Ingo Molnar wrote: > * Matthew Wilcox <matthew@wil.cx> wrote: > > > exactly what usecase is that? Perhaps it could be converted to an > > > atomic counter + the wait_event() APIs. > > > > Effectively, it's a completion. It just works better with staggered > > wakeups than it does with the naive completion. > > So why not transform it to real completions instead? And if our current > 'struct completion' abstraction is insufficient for whatever reason, why > not extend that instead? My point is that for the only user of counting semaphores and/or semaphores-abused-as-completions that has so far hit this race, the serialised wake-up performs better. You have not pointed at a scenario that _shows_ a parallel wake-up to perform better. Some hand-waving and talking about lofty principles, yes. But no actual data. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 15:29 ` Matthew Wilcox @ 2008-05-13 14:11 ` Ingo Molnar 2008-05-13 14:21 ` Matthew Wilcox 0 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-13 14:11 UTC (permalink / raw) To: Matthew Wilcox Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra * Matthew Wilcox <matthew@wil.cx> wrote: > > So why not transform it to real completions instead? And if our > > current 'struct completion' abstraction is insufficient for whatever > > reason, why not extend that instead? > > My point is that for the only user of counting semaphores and/or > semaphores-abused-as-completions that has so far hit this race, the > serialised wake-up performs better. You have not pointed at a > scenario that _shows_ a parallel wake-up to perform better. [...] a 50% AIM7 slowdown maybe? With the BKL being a spinlock again it doesnt matter much in practice though. Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-13 14:11 ` Ingo Molnar @ 2008-05-13 14:21 ` Matthew Wilcox 2008-05-13 14:42 ` Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Matthew Wilcox @ 2008-05-13 14:21 UTC (permalink / raw) To: Ingo Molnar Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra On Tue, May 13, 2008 at 04:11:29PM +0200, Ingo Molnar wrote: > > * Matthew Wilcox <matthew@wil.cx> wrote: > > > > So why not transform it to real completions instead? And if our > > > current 'struct completion' abstraction is insufficient for whatever > > > reason, why not extend that instead? > > > > My point is that for the only user of counting semaphores and/or > > semaphores-abused-as-completions that has so far hit this race, the > > serialised wake-up performs better. You have not pointed at a > > scenario that _shows_ a parallel wake-up to perform better. [...] > > a 50% AIM7 slowdown maybe? With the BKL being a spinlock again it doesnt > matter much in practice though. You're not understanding me. This is completely inapplicable to the BKL because only one task can be in wakeup at a time (due to it having a maximum value of 1). There's no way to hit this race with the BKL. The only kind of semaphore that can hit this race is the kind that can have more than one wakeup in progress at a time -- ie one which can have a value >1. Like completions and real counting semaphores. So the only thing worth talking about (and indeed, it's now entirely moot) is what's the best way to solve this problem /for this kind of semaphore/. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-13 14:21 ` Matthew Wilcox @ 2008-05-13 14:42 ` Ingo Molnar 2008-05-13 15:28 ` Matthew Wilcox 0 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-13 14:42 UTC (permalink / raw) To: Matthew Wilcox Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra * Matthew Wilcox <matthew@wil.cx> wrote: > > a 50% AIM7 slowdown maybe? With the BKL being a spinlock again it > > doesnt matter much in practice though. > > You're not understanding me. This is completely inapplicable to the > BKL because only one task can be in wakeup at a time (due to it having > a maximum value of 1). There's no way to hit this race with the BKL. > The only kind of semaphore that can hit this race is the kind that can > have more than one wakeup in progress at a time -- ie one which can > have a value >1. Like completions and real counting semaphores. yes, but even for parallel wakeups for completions it's good in general to keep more tasks in flight than to keep less tasks in flight. Perhaps the code could throttle them to nr_cpus, but otherwise, as the BKL example has shown it (in another context), we do much better if we overload the scheduler (in which case it can and does batch intelligently) than if we try to second-guess it and under-load it and create lots of scheduling events. i'd agree with you that are no numbers available pro or contra, so you are right that my 50% point does not apply to your argument. > So the only thing worth talking about (and indeed, it's now entirely > moot) is what's the best way to solve this problem /for this kind of > semaphore/. it's not really moot in terms of improving the completions code i suspect? For XFS i guess performance matters. Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-13 14:42 ` Ingo Molnar @ 2008-05-13 15:28 ` Matthew Wilcox 2008-05-13 17:13 ` Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Matthew Wilcox @ 2008-05-13 15:28 UTC (permalink / raw) To: Ingo Molnar Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra On Tue, May 13, 2008 at 04:42:07PM +0200, Ingo Molnar wrote: > yes, but even for parallel wakeups for completions it's good in general > to keep more tasks in flight than to keep less tasks in flight. That might be the case for some users, but it isn't the case for XFS. The first thing that each task does is grab a spinlock, so if you put as much in flight as early as possible, you end up with horrible contention on that spinlock. I have no idea whether this is the common case for multi-valued semaphores or not, it's just the only one I have data for. > > So the only thing worth talking about (and indeed, it's now entirely > > moot) is what's the best way to solve this problem /for this kind of > > semaphore/. > > it's not really moot in terms of improving the completions code i > suspect? For XFS i guess performance matters. I think the completion code is less optimised than the semaphore code today. Clearly the same question does arise, but I don't know what the answer is for completion users either. Let's do a quick survey. drivers/net has 5 users: 3c527.c -- execution_cmd has a mutex held, so never more than one task waiting anyway. xceiver_cmd is called during open and close which I think are serialised at a higher level. In any case, no performance issue here. iseries_veth.c -- grabs a spinlock soon after being woken. plip.c -- called in close, no perf implication. ppp_synctty.c -- called in close, no perf implication. ps3_gelic_wireless.c - If this isn't serialised, it's buggy. Maybe drivers/net is a bad example. Let's look at */*.c: as-iosched.c -- in exit path. blk-barrier.c -- completion on stack, so only one waiter. blk-exec.c -- ditto cfq-iosched.c -- in exit path crypto/api.c -- in init path crypto/gcm.c -- in setkey path crypto/tcrypt.c -- crypto testing. Not a perf path. fs/exec.c -- waiting for coredumps. kernel/exit.c -- likewise kernel/fork.c -- completion on stack kernel/kmod.c -- completion on stack kernel/kthread.c -- kthread creation and deletion. Shouldn't be a hot path, plus this looks like there's only going to be one task waiting. kernel/rcupdate.c -- one completion on stack, one synchronised by a mutex kernel/rcutorture.c -- doesn't matter kernel/sched.c -- both completions on stack kernel/stop_machine.c -- completion on stack kernel/sysctl.c -- completion on stack kernel/workqueue.c -- completion on stack lib/klist.c -- This one seems like it could potentially have lots of waiters, if only anything actually used klists. It seems like most users use completions where it'd be just as easy to use a task pointer and call wake_up_task(). In any case, I think there's no evidence one way or the other about how people are using multi-sleeper completions. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-13 15:28 ` Matthew Wilcox @ 2008-05-13 17:13 ` Ingo Molnar 2008-05-13 17:22 ` Linus Torvalds 0 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-13 17:13 UTC (permalink / raw) To: Matthew Wilcox Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra * Matthew Wilcox <matthew@wil.cx> wrote: > > yes, but even for parallel wakeups for completions it's good in > > general to keep more tasks in flight than to keep less tasks in > > flight. > > That might be the case for some users, but it isn't the case for XFS. > The first thing that each task does is grab a spinlock, so if you put > as much in flight as early as possible, you end up with horrible > contention on that spinlock. [...] hm, this sounds like damage that is inflicted on itself by the XFS code. Why does it signal to its waiters that "resource is available", when in reality that resource is not available but immediately serialized via a lock? (even if the lock might technically be some _other_ object) I have not looked closely at this but the more natural wakeup flow here would be that if you know there's going to be immediate contention, to signal a _single_ resource to a _single_ waiter, and then once that contention point is over a (hopefully) much more parallel processing phase occurs, to use a multi-value completion there. in other words: dont tell the scheduler that there is parallelism in the system when in reality there is not. And for the same reason, do not throttle wakeups in a completion mechanism artificially because one given user utilizes it suboptimally. Once throttled it's not possible to regain that lost parallelism. > [...] I have no idea whether this is the common case for multi-valued > semaphores or not, it's just the only one I have data for. yeah. I'd guess XFS would be the primary user in this area who cares about performance. > It seems like most users use completions where it'd be just as easy to > use a task pointer and call wake_up_task(). [...] yeah - although i guess in general it's a bit safer to use an explicit completion. With a task pointer you have to be sure the task is still present, etc. (with a completion you are forced to put that completion object _somewhere_, which immediately forces one to think about lifetime issues. A wakeup to a single task pointer is way too easy to get wrong.) So in general i'd recommend the use of completions. > [...] In any case, I think there's no evidence one way or the other > about how people are using multi-sleeper completions. yeah, that's definitely so. Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-13 17:13 ` Ingo Molnar @ 2008-05-13 17:22 ` Linus Torvalds 2008-05-13 21:05 ` Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Linus Torvalds @ 2008-05-13 17:22 UTC (permalink / raw) To: Ingo Molnar Cc: Matthew Wilcox, Sven Wegener, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra On Tue, 13 May 2008, Ingo Molnar wrote: > > hm, this sounds like damage that is inflicted on itself by the XFS code. No. You're confusing what a counting semaphore is. > Why does it signal to its waiters that "resource is available", when in > reality that resource is not available but immediately serialized via a > lock? (even if the lock might technically be some _other_ object) So you have 'n' resources available, and you use a counting semaphore for that resource counting. But you'd still need a spinlock to actually protect the list of resources themselves. The spinlock protects a different thing than the semaphore. The semaphore isn't about mutual exclusion - it's about counting resources and waiting when there are too many things in flight. And you seem to think that a counting semaphore is about mutual exclusion. It has nothing what-so-ever to do with that. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-13 17:22 ` Linus Torvalds @ 2008-05-13 21:05 ` Ingo Molnar 0 siblings, 0 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-13 21:05 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Sven Wegener, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Why does it signal to its waiters that "resource is available", when > > in reality that resource is not available but immediately serialized > > via a lock? (even if the lock might technically be some _other_ > > object) > > So you have 'n' resources available, and you use a counting semaphore > for that resource counting. > > But you'd still need a spinlock to actually protect the list of > resources themselves. The spinlock protects a different thing than the > semaphore. The semaphore isn't about mutual exclusion - it's about > counting resources and waiting when there are too many things in > flight. > > And you seem to think that a counting semaphore is about mutual > exclusion. It has nothing what-so-ever to do with that. i was reacting to the point that Matthew made: " The first thing that each task does is grab a spinlock, so if you put as much in flight as early as possible, you end up with horrible contention on that spinlock. " We were talking about the case of double, parallel up()s. My point was that the best guess is to put two tasks in flight in the synchronization primitive. Matthew's point was that the wakeups of the two tasks should be chained: one task gets woken first, which then wakes the second task in a chain. [ Matthew, i'm paraphrasing your opinion so please correct me if i'm misinterpreting your point. ] My argument is that chaining like that in the synchronization primitive is bad for parallelism in general, because wakeup latencies are just too long in general and any action required in the context of the "first waker" throttles parallelism artificially and introduces artificial workload delays. Even in the simple list+lock case you are talking about it's beneficial to keep as many wakers in flight as possible. The reason is that even with the worst-possible cacheline bouncing imaginable, it's much better to spin a bit on a spinlock (which is "horrible bouncing" only on paper, in practice it's nicely ordered waiting on a FIFO spinlock, with a list op every 100 nsecs or so) than to implement "soft bouncing" of tasks via an artificial chain of wakeups. That artificial chain of wakeups has a latency of a few microseconds even in the best case - and in the worst-case it can be a delay of milliseconds later - throttling not just the current parallelism of the workload but also hiding potential future parallelism. The hardware is so much faster at messaging. [*] [**] Ingo [*] Not without qualifications - maybe not so on 1024 CPU systems, but certainly so on anything sane up to 16way or so. But if worry about 1024 CPU systems i'd suggest to first take a look at the current practice in kernel/semaphore.c taking the semaphore internal spinlock again _after_ a task has woken up, just to remove itself from the list of waiters. That is rather unnecessary serialization - at up() time we are already holding the lock, so we should remove the entry there. That's what the mutex code does too. [**] The only place where i can see staggered/chained wakeups help is in the specific case when the wakee runs into a heavy lock _right after_ wakeup. XFS might be running into that and my reply above talks about that hypothetical scenario. If it is so then it is handled incorrectly, because in that case we dont have any true parallelism at the point of up(), and we know it right there. There is no reason to pretend at that point that we have more parallelism, when all we do is we block on a heavy lock right after wakeup. Instead, the correct implementation in that case would be to have a wait queue for that heavy lock (which in other words can be thought of as a virtual single-resource which is either 'available' or 'unavailable'), and _then_ after that to use a counting semaphore for the rest of the program flow, which is hopefully more parallel. I.e. precisely map the true parallelism of the code via the synchronization primitives, do not over-synchronize it and do not under-synchronize it. And if there's any doubt which one should be used, under-synchronize it - because while the scheduler is rather good at dealing with too many tasks it cannot conjure runnable tasks out of thin air. Btw., the AIM7 regression was exactly that: in the 50% regressed workload the semaphore code hid the true parallelism of the workload and we had only had 5 tasks on the runqueue and the scheduler had no chance to saturate all CPUs. In the "good" case (be that spinlock based or proper-semaphores based BKL) there were 2000 runnable tasks on the runqueues, and the scheduler sorted them out and batched the workload just fine! ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 13:02 ` Matthew Wilcox 2008-05-11 13:26 ` Matthew Wilcox @ 2008-05-11 13:54 ` Ingo Molnar 2008-05-11 14:22 ` Matthew Wilcox 1 sibling, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-11 13:54 UTC (permalink / raw) To: Matthew Wilcox Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin * Matthew Wilcox <matthew@wil.cx> wrote: > > + /* > > + * Rotate sleepers - to make sure all of them get woken in case > > + * of parallel up()s: > > + */ > > + list_move_tail(&waiter->list, &sem->wait_list); > > Seems like extra cache line dirtying for no real gain over my > solution. the gain is rather obvious: two parallel up()s (or just up()s which come close enough after each other) will wake up two tasks in parallel. With your patch, the first guy wakes up and then it wakes up the second guy. I.e. your patch serializes the wakeup chain, mine keeps it parallel. the cache line dirtying is rather secondary to any solution - the first goal for any locking primitive is to get scheduling precise: to not wake up more tasks than optimal and to not wake up less tasks than optimal. i.e. can you see any conceptual hole in the patch below? Ingo --- kernel/semaphore.c | 6 ++++++ 1 file changed, 6 insertions(+) Index: linux/kernel/semaphore.c =================================================================== --- linux.orig/kernel/semaphore.c +++ linux/kernel/semaphore.c @@ -258,5 +258,11 @@ static noinline void __sched __up(struct { struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, struct semaphore_waiter, list); + /* + * Rotate sleepers - to make sure all of them get woken in case + * of parallel up()s: + */ + list_move_tail(&waiter->list, &sem->wait_list); + wake_up_process(waiter->task); } ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 13:54 ` Ingo Molnar @ 2008-05-11 14:22 ` Matthew Wilcox 2008-05-11 14:32 ` Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Matthew Wilcox @ 2008-05-11 14:22 UTC (permalink / raw) To: Ingo Molnar Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin On Sun, May 11, 2008 at 03:54:14PM +0200, Ingo Molnar wrote: > the gain is rather obvious: two parallel up()s (or just up()s which come > close enough after each other) will wake up two tasks in parallel. With > your patch, the first guy wakes up and then it wakes up the second guy. > I.e. your patch serializes the wakeup chain, mine keeps it parallel. Yup. I explained why that's actually beneficial in an earlier email. > the cache line dirtying is rather secondary to any solution - the first > goal for any locking primitive is to get scheduling precise: to not wake > up more tasks than optimal and to not wake up less tasks than optimal. That's a laudable goal, but ultimately it's secondary to performance (or this thread wouldn't exist). > i.e. can you see any conceptual hole in the patch below? No conceptual holes, just a performance one. Either we want your patch below or mine; definitely not both. diff --git a/kernel/semaphore.c b/kernel/semaphore.c index 5e41217..e520ad4 100644 --- a/kernel/semaphore.c +++ b/kernel/semaphore.c @@ -229,6 +229,11 @@ static inline int __sched __down_common(struct semaphore *sem, long state, } list_del(&waiter.list); + + /* It's possible we need to wake up the next task on the list too */ + if (unlikely(sem->count > 1) && !list_empty(&sem->wait_list)) + __up(sem); + return ret; } > Ingo > > --- > kernel/semaphore.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > Index: linux/kernel/semaphore.c > =================================================================== > --- linux.orig/kernel/semaphore.c > +++ linux/kernel/semaphore.c > @@ -258,5 +258,11 @@ static noinline void __sched __up(struct > { > struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, > struct semaphore_waiter, list); > + /* > + * Rotate sleepers - to make sure all of them get woken in case > + * of parallel up()s: > + */ > + list_move_tail(&waiter->list, &sem->wait_list); > + > wake_up_process(waiter->task); > } -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply related [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 14:22 ` Matthew Wilcox @ 2008-05-11 14:32 ` Ingo Molnar 2008-05-11 14:46 ` Matthew Wilcox 2008-05-11 16:47 ` Linus Torvalds 0 siblings, 2 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-11 14:32 UTC (permalink / raw) To: Matthew Wilcox Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin * Matthew Wilcox <matthew@wil.cx> wrote: > > the gain is rather obvious: two parallel up()s (or just up()s which > > come close enough after each other) will wake up two tasks in > > parallel. With your patch, the first guy wakes up and then it wakes > > up the second guy. I.e. your patch serializes the wakeup chain, mine > > keeps it parallel. > > Yup. I explained why that's actually beneficial in an earlier email. but the problem is that by serializing the wakeup chains naively you introduced a more than 50% AIM7 performance regression. Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 14:32 ` Ingo Molnar @ 2008-05-11 14:46 ` Matthew Wilcox 2008-05-11 16:47 ` Linus Torvalds 1 sibling, 0 replies; 140+ messages in thread From: Matthew Wilcox @ 2008-05-11 14:46 UTC (permalink / raw) To: Ingo Molnar Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin On Sun, May 11, 2008 at 04:32:27PM +0200, Ingo Molnar wrote: > > * Matthew Wilcox <matthew@wil.cx> wrote: > > > > the gain is rather obvious: two parallel up()s (or just up()s which > > > come close enough after each other) will wake up two tasks in > > > parallel. With your patch, the first guy wakes up and then it wakes > > > up the second guy. I.e. your patch serializes the wakeup chain, mine > > > keeps it parallel. > > > > Yup. I explained why that's actually beneficial in an earlier email. > > but the problem is that by serializing the wakeup chains naively you > introduced a more than 50% AIM7 performance regression. That's a different issue. The AIM7 regression is to do with whether a task that is currently running and hits a semaphore that has no current holder but someone else waiting should be allowed to jump the queue. No argument there; performance trumps theoretical fairness. This issue is whether multiple sleepers should be woken up all-at-once or one-at-a-time. Here, you seem to be arguing for theoretical fairness to trump performance. (Let's be quite clear; this issue affects *only* multiple sleepers and multiple wakes given to those sleepers. ie semaphores-being-used-as-completions and true counting semaphores). -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 14:32 ` Ingo Molnar 2008-05-11 14:46 ` Matthew Wilcox @ 2008-05-11 16:47 ` Linus Torvalds 1 sibling, 0 replies; 140+ messages in thread From: Linus Torvalds @ 2008-05-11 16:47 UTC (permalink / raw) To: Ingo Molnar Cc: Matthew Wilcox, Sven Wegener, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin On Sun, 11 May 2008, Ingo Molnar wrote: > > but the problem is that by serializing the wakeup chains naively you > introduced a more than 50% AIM7 performance regression. No, the problem is that the BKL shouldn't be a semaphore at all. Performance fixed. After that, it's *purely* a correctness issue, and then we might as well be fair and not allow the stealing of semaphores from under waiters at ALL. Which is what Matthews original code did. In other words, current -git is fine, and was already confirmed by Sven to fix the bug (before *any* of your patches were), and was earlier confirmed to fix the AIM7 performance regression (better than *any* of your patches were). So I fixed the problems last night already. Stop wasting everybody's time. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 11:03 ` Matthew Wilcox 2008-05-11 11:14 ` Matthew Wilcox 2008-05-11 11:48 ` Matthew Wilcox @ 2008-05-11 13:01 ` Ingo Molnar 2008-05-11 13:06 ` Matthew Wilcox 2008-05-11 14:10 ` Sven Wegener 3 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-11 13:01 UTC (permalink / raw) To: Matthew Wilcox Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin * Matthew Wilcox <matthew@wil.cx> wrote: > + /* It's possible we need to wake up the next task on the list too */ > + if (unlikely(sem->count > 1) && !list_empty(&sem->wait_list)) > + __up(sem); this needs to check for ret != 0 as well, otherwise we can be woken but a timeout can also trigger => we lose a wakeup. I.e. like the patch below. Hm? Ingo -----------------------------> Subject: semaphore: fix #3 From: Ingo Molnar <mingo@elte.hu> Date: Sun May 11 09:51:07 CEST 2008 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/semaphore.c | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) Index: linux/kernel/semaphore.c =================================================================== --- linux.orig/kernel/semaphore.c +++ linux/kernel/semaphore.c @@ -194,6 +194,13 @@ struct semaphore_waiter { struct task_struct *task; }; +static noinline void __sched __up(struct semaphore *sem) +{ + struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, + struct semaphore_waiter, list); + wake_up_process(waiter->task); +} + /* * Because this function is inlined, the 'state' parameter will be * constant, and thus optimised away by the compiler. Likewise the @@ -231,6 +238,10 @@ static inline int __sched __down_common( } list_del(&waiter.list); + if (unlikely(!list_empty(&sem->wait_list)) && + ((sem->count > 1) || ret)) + __up(sem); + return ret; } @@ -254,9 +265,10 @@ static noinline int __sched __down_timeo return __down_common(sem, TASK_UNINTERRUPTIBLE, jiffies); } -static noinline void __sched __up(struct semaphore *sem) -{ - struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, - struct semaphore_waiter, list); - wake_up_process(waiter->task); -} + + /* + * Rotate sleepers - to make sure all of them get woken in case + * of parallel up()s: + */ + list_move_tail(&waiter->list, &sem->wait_list); + ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 13:01 ` Ingo Molnar @ 2008-05-11 13:06 ` Matthew Wilcox 2008-05-11 13:45 ` Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Matthew Wilcox @ 2008-05-11 13:06 UTC (permalink / raw) To: Ingo Molnar Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin On Sun, May 11, 2008 at 03:01:27PM +0200, Ingo Molnar wrote: > * Matthew Wilcox <matthew@wil.cx> wrote: > > > + /* It's possible we need to wake up the next task on the list too */ > > + if (unlikely(sem->count > 1) && !list_empty(&sem->wait_list)) > > + __up(sem); > > this needs to check for ret != 0 as well, otherwise we can be woken but > a timeout can also trigger => we lose a wakeup. I.e. like the patch > below. Hm? Still mangled ... and I don't see how we lose a wakeup. We test for having the semaphore before we check for having been interrupted, and we hold the lock the whole time. IOW, what I think you're checking for is: task A task B if sem->count >0 break; sem->count++ wake_up_process(B) if (state == TASK_INTERRUPTIBLE && signal_pending(task)) break; which can't happen because of sem->lock. > Ingo > > -----------------------------> > Subject: semaphore: fix #3 > From: Ingo Molnar <mingo@elte.hu> > Date: Sun May 11 09:51:07 CEST 2008 > > Signed-off-by: Ingo Molnar <mingo@elte.hu> > --- > kernel/semaphore.c | 24 ++++++++++++++++++------ > 1 file changed, 18 insertions(+), 6 deletions(-) > > Index: linux/kernel/semaphore.c > =================================================================== > --- linux.orig/kernel/semaphore.c > +++ linux/kernel/semaphore.c > @@ -194,6 +194,13 @@ struct semaphore_waiter { > struct task_struct *task; > }; > > +static noinline void __sched __up(struct semaphore *sem) > +{ > + struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, > + struct semaphore_waiter, list); > + wake_up_process(waiter->task); > +} > + > /* > * Because this function is inlined, the 'state' parameter will be > * constant, and thus optimised away by the compiler. Likewise the > @@ -231,6 +238,10 @@ static inline int __sched __down_common( > } > > list_del(&waiter.list); > + if (unlikely(!list_empty(&sem->wait_list)) && > + ((sem->count > 1) || ret)) > + __up(sem); > + > return ret; > } > > @@ -254,9 +265,10 @@ static noinline int __sched __down_timeo > return __down_common(sem, TASK_UNINTERRUPTIBLE, jiffies); > } > > -static noinline void __sched __up(struct semaphore *sem) > -{ > - struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, > - struct semaphore_waiter, list); > - wake_up_process(waiter->task); > -} > + > + /* > + * Rotate sleepers - to make sure all of them get woken in case > + * of parallel up()s: > + */ > + list_move_tail(&waiter->list, &sem->wait_list); > + -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 13:06 ` Matthew Wilcox @ 2008-05-11 13:45 ` Ingo Molnar 0 siblings, 0 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-11 13:45 UTC (permalink / raw) To: Matthew Wilcox Cc: Sven Wegener, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin * Matthew Wilcox <matthew@wil.cx> wrote: > IOW, what I think you're checking for is: > > task A task B > if sem->count >0 > break; > sem->count++ > wake_up_process(B) > if (state == TASK_INTERRUPTIBLE && signal_pending(task)) > break; > > which can't happen because of sem->lock. ok, agreed, that race cannot happen. Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [git pull] scheduler fixes 2008-05-11 11:03 ` Matthew Wilcox ` (2 preceding siblings ...) 2008-05-11 13:01 ` Ingo Molnar @ 2008-05-11 14:10 ` Sven Wegener 3 siblings, 0 replies; 140+ messages in thread From: Sven Wegener @ 2008-05-11 14:10 UTC (permalink / raw) To: Matthew Wilcox Cc: Ingo Molnar, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin On Sun, 11 May 2008, Matthew Wilcox wrote: > On Thu, May 08, 2008 at 05:10:28PM +0200, Ingo Molnar wrote: >> @@ -258,7 +256,5 @@ static noinline void __sched __up(struct semaphore *sem) >> { >> struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, >> struct semaphore_waiter, list); >> - list_del(&waiter->list); >> - waiter->up = 1; >> wake_up_process(waiter->task); >> } > > This might be the problem that causes the missing wakeups. If you have a > semaphore with n=2, and four processes calling down(), tasks A and B > acquire the semaphore and tasks C and D go to sleep. Task A calls up() > and wakes up C. Then task B calls up() and doesn't wake up anyone > because C hasn't run yet. I think we need another wakeup when task C > finishes in __down_common, like this (on top of your patch): > > diff --git a/kernel/semaphore.c b/kernel/semaphore.c > index 5e41217..e520ad4 100644 > --- a/kernel/semaphore.c > +++ b/kernel/semaphore.c > @@ -229,6 +229,11 @@ static inline int __sched __down_common(struct semaphore *sem, long state, > } > > list_del(&waiter.list); > + > + /* It's possible we need to wake up the next task on the list too */ > + if (unlikely(sem->count > 1) && !list_empty(&sem->wait_list)) > + __up(sem); > + > return ret; > } > > Sven, can you try this with your workload? I suspect this might be it > because XFS does use semaphores with n>1. This one fixes the regression too, after applying it on top of bf726e. Sven ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 12:28 ` Ingo Molnar 2008-05-08 14:43 ` Ingo Molnar @ 2008-05-08 16:02 ` Linus Torvalds 2008-05-08 18:30 ` Linus Torvalds 1 sibling, 1 reply; 140+ messages in thread From: Linus Torvalds @ 2008-05-08 16:02 UTC (permalink / raw) To: Ingo Molnar Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin On Thu, 8 May 2008, Ingo Molnar wrote: > > Peter pointed it out that because sem->count is u32, the <= 0 is in fact > a "== 0" condition - the patch below does that. As expected gcc figured > out the same thing too so the resulting code output did not change. (so > this is just a cleanup) Why don't we just make it do the same thing that the x86 semaphores used to do: make it signed, and decrement unconditionally. And callt eh slow-path if it became negative. IOW, make the fast-path be spin_lock_irqsave(&sem->lock, flags); if (--sem->count < 0) __down(); spin_unlock_irqrestore(&sem->lock, flags); and now we have an existing known-good implementation to look at? Rather than making up a totally new and untested thing. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 16:02 ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Linus Torvalds @ 2008-05-08 18:30 ` Linus Torvalds 2008-05-08 20:19 ` Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Linus Torvalds @ 2008-05-08 18:30 UTC (permalink / raw) To: Ingo Molnar Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin On Thu, 8 May 2008, Linus Torvalds wrote: > > Why don't we just make it do the same thing that the x86 semaphores used > to do: make it signed, and decrement unconditionally. And callt eh > slow-path if it became negative. > ... > and now we have an existing known-good implementation to look at? Ok, after having thought that over, and looked at the code, I think I like your version after all. The old implementation was pretty complex due to the need to be so extra careful about the count that could change outside of the lock, so everything considered, a new implementation that is simpler is probably the better choice. Ergo, I will just pull your scheduler tree. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 18:30 ` Linus Torvalds @ 2008-05-08 20:19 ` Ingo Molnar 2008-05-08 20:27 ` Linus Torvalds 0 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-08 20:19 UTC (permalink / raw) To: Linus Torvalds Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, 8 May 2008, Linus Torvalds wrote: > > > > Why don't we just make it do the same thing that the x86 semaphores used > > to do: make it signed, and decrement unconditionally. And callt eh > > slow-path if it became negative. > > ... > > and now we have an existing known-good implementation to look at? > > Ok, after having thought that over, and looked at the code, I think I > like your version after all. The old implementation was pretty complex > due to the need to be so extra careful about the count that could > change outside of the lock, so everything considered, a new > implementation that is simpler is probably the better choice. yeah. I thought about that too, the problem i found is this thing in the old lib/semaphore-sleepers.c code's __down() path: remove_wait_queue_locked(&sem->wait, &wait); wake_up_locked(&sem->wait); spin_unlock_irqrestore(&sem->wait.lock, flags); tsk->state = TASK_RUNNING; that mystery wakeup once i understood to be necessary for some weird ordering reason, but it would probably be hard to justify in the new code, because it's done unconditionally, regardless of whether there are sleepers around. And once we deviate from the old code, we might as well go for the simplest approach - which also happens to be rather close to the mutex code's current slowpath - just with counting property added, legacy semantics and no lockdep coverage. > Ergo, I will just pull your scheduler tree. great! Meanwhile a 100 randconfigs booted fine with that tree so i'd say the implementation is robust. i also did a quick re-test of AIM7 because the wakeup logic changed a bit from what i tested initially (from round-robin to strict FIFO), and as expected not much changed in the AIM7 results on the quad: Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 55019.9 96 211.6 806.5 0.4585 2000 55116.2 90 211.2 804.7 0.4593 2000 55082.3 82 211.3 805.5 0.4590 this is slightly lower but the test was not fully apples to apples because this also had some tracing active and other small details. It's still very close to the v2.6.25 numbers. I suspect some more performance could be won in this particular workload by getting rid of the BKL dependency altogether. Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 20:19 ` Ingo Molnar @ 2008-05-08 20:27 ` Linus Torvalds 2008-05-08 21:45 ` Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Linus Torvalds @ 2008-05-08 20:27 UTC (permalink / raw) To: Ingo Molnar Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin On Thu, 8 May 2008, Ingo Molnar wrote: > > I suspect some more performance could be won in this particular workload > by getting rid of the BKL dependency altogether. Did somebody have any trace on which BKL taker it is? It apparently wasn't file locking. Was it the tty layer? Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 20:27 ` Linus Torvalds @ 2008-05-08 21:45 ` Ingo Molnar 2008-05-08 22:02 ` Ingo Molnar 2008-05-08 22:55 ` Linus Torvalds 0 siblings, 2 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-08 21:45 UTC (permalink / raw) To: Linus Torvalds Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I suspect some more performance could be won in this particular > > workload by getting rid of the BKL dependency altogether. > > Did somebody have any trace on which BKL taker it is? It apparently > wasn't file locking. Was it the tty layer? yeah, i captured a trace of all the down()s that happen in the workload, using ftrace/sched_switch + stacktrace + a tracepoint in down(). Here's the semaphore activities in the trace: # grep lock_kernel /debug/tracing/trace | cut -d: -f2- | sort | \ uniq -c | sort -n | cut -d= -f1-4 9 down <= lock_kernel <= proc_lookup_de <= proc_lookup < 12 down <= lock_kernel <= de_put <= proc_delete_inode < 14 down <= lock_kernel <= proc_lookup_de <= proc_lookup < 19 down <= lock_kernel <= vfs_ioctl <= do_vfs_ioctl < 58 down <= lock_kernel <= tty_release <= __fput < 62 down <= lock_kernel <= chrdev_open <= __dentry_open < 70 down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs < 2512 down <= lock_kernel <= opost <= write_chan < 2574 down <= lock_kernel <= write_chan <= tty_write < note that this is running the fixed semaphore code, so contended semaphores are really rare in the trace. The histogram above includes all calls to down(). here's the full trace file (from a single CPU, on a dual-core box running aim7): http://redhat.com/~mingo/misc/aim7-ftrace.txt some other interesting stats. Top 5 wakeups sources: # grep wake_up aim7-ftrace.txt | cut -d: -f2- | sort | uniq -c | sort -n | cut -d= -f1-6 | tail -5 [...] 340 default_wake_function <= __wake_up_common <= __wake_up_sync <= unix_write_space <= sock_wfree <= skb_release_all < 411 default_wake_function <= autoremove_wake_function <= __wake_up_common <= __wake_up_sync <= pipe_read <= do_sync_read < 924 default_wake_function <= autoremove_wake_function <= __wake_up_common <= __wake_up_sync <= pipe_write <= do_sync_write < 1301 default_wake_function <= __wake_up_common <= __wake_up <= n_tty_receive_buf <= pty_write <= write_chan < 2065 wake_up_state <= prepare_signal <= send_signal <= __group_send_sig_info <= group_send_sig_info <= __kill_pgrp_info < top 10 scheduling points: # grep -w schedule aim7-ftrace.txt | cut -d: -f2- | sort | uniq -c | sort -n | cut -d= -f1-4 | tail -10 [...] 465 schedule <= __cond_resched <= _cond_resched <= shrink_active_list < 582 schedule <= cpu_idle <= start_secondary <= < 929 schedule <= pipe_wait <= pipe_read <= do_sync_read < 990 schedule <= __cond_resched <= _cond_resched <= shrink_page_list < 1034 schedule <= io_schedule <= get_request_wait <= __make_request < 1140 schedule <= worker_thread <= kthread <= child_rip < 1512 schedule <= retint_careful <= <= 0 < 1571 schedule <= __cond_resched <= _cond_resched <= shrink_active_list < 2034 schedule <= schedule_timeout <= do_select <= core_sys_select < 2355 schedule <= sysret_careful <= <= 0 < as visible from the trace, this is a CONFIG_PREEMPT_NONE kernel, so most of the preemptions get triggered by wakeups and get executed from the return-from-syscall need_resched check. But there's a fair amount of select() related sleeping as well, and a fair amount of IRQ-hits-userspace driven preemptions as well. a rather hectic workload, with a surprisingly large amount of TTY related activity. and here's a few seconds worth of NMI driven readprofile output: 1131 page_fault 70.6875 1181 find_lock_page 8.2014 1302 __isolate_lru_page 12.8911 1344 copy_page_c 84.0000 1588 page_lock_anon_vma 34.5217 1616 ext3fs_dirhash 3.4753 1683 ext3_htree_store_dirent 6.8975 1976 str2hashbuf 12.7484 1992 copy_user_generic_string 31.1250 2362 do_unlinkat 6.5069 2969 try_to_unmap 2.0791 3031 will_become_orphaned_pgrp 24.4435 4009 __copy_user_nocache 12.3735 4627 congestion_wait 34.0221 6624 clear_page_c 414.0000 18711 try_to_unmap_one 30.8254 34447 page_referenced 140.0285 38669 do_filp_open 17.7789 166569 page_referenced_one 886.0053 213361 *unknown* 216021 sync_page 3375.3281 391888 page_check_address 1414.7581 962212 total 0.3039 system overhead is consistently 20% during this test. the page_check_address() overhead is surprising - tons of rmap contention? about 10% wall-clock overhead in that function alone - and this is just on a dual-core box! Below is the instruction level profile of that function. The second column is the # of profile hits. The spin_lock() overhead is clearly visible. Ingo ___----- # of profile hits v ffffffff80286244: 3478 <page_check_address>: ffffffff80286244: 3478 55 push %rbp ffffffff80286245: 1092 48 89 d0 mov %rdx,%rax ffffffff80286248: 0 48 c1 e8 27 shr $0x27,%rax ffffffff8028624c: 0 48 89 e5 mov %rsp,%rbp ffffffff8028624f: 717 41 57 push %r15 ffffffff80286251: 1 25 ff 01 00 00 and $0x1ff,%eax ffffffff80286256: 0 49 89 cf mov %rcx,%r15 ffffffff80286259: 1354 41 56 push %r14 ffffffff8028625b: 0 49 89 fe mov %rdi,%r14 ffffffff8028625e: 0 48 89 d7 mov %rdx,%rdi ffffffff80286261: 1507 41 55 push %r13 ffffffff80286263: 0 41 54 push %r12 ffffffff80286265: 0 53 push %rbx ffffffff80286266: 1763 48 83 ec 08 sub $0x8,%rsp ffffffff8028626a: 0 48 8b 56 48 mov 0x48(%rsi),%rdx ffffffff8028626e: 4 48 8b 14 c2 mov (%rdx,%rax,8),%rdx ffffffff80286272: 3174 f6 c2 01 test $0x1,%dl ffffffff80286275: 0 0f 84 cc 00 00 00 je ffffffff80286347 <page_check_address+0x103> ffffffff8028627b: 0 48 89 f8 mov %rdi,%rax ffffffff8028627e: 64468 48 be 00 f0 ff ff ff mov $0x3ffffffff000,%rsi ffffffff80286285: 0 3f 00 00 ffffffff80286288: 1 48 b9 00 00 00 00 00 mov $0xffff810000000000,%rcx ffffffff8028628f: 0 81 ff ff ffffffff80286292: 4686 48 c1 e8 1b shr $0x1b,%rax ffffffff80286296: 0 48 21 f2 and %rsi,%rdx ffffffff80286299: 0 25 f8 0f 00 00 and $0xff8,%eax ffffffff8028629e: 7468 48 01 d0 add %rdx,%rax ffffffff802862a1: 0 48 8b 14 08 mov (%rax,%rcx,1),%rdx ffffffff802862a5: 11 f6 c2 01 test $0x1,%dl ffffffff802862a8: 4409 0f 84 99 00 00 00 je ffffffff80286347 <page_check_address+0x103> ffffffff802862ae: 0 48 89 f8 mov %rdi,%rax ffffffff802862b1: 0 48 21 f2 and %rsi,%rdx ffffffff802862b4: 1467 48 c1 e8 12 shr $0x12,%rax ffffffff802862b8: 0 25 f8 0f 00 00 and $0xff8,%eax ffffffff802862bd: 0 48 01 d0 add %rdx,%rax ffffffff802862c0: 944 48 8b 14 08 mov (%rax,%rcx,1),%rdx ffffffff802862c4: 17 f6 c2 01 test $0x1,%dl ffffffff802862c7: 1 74 7e je ffffffff80286347 <page_check_address+0x103> ffffffff802862c9: 927 48 89 d0 mov %rdx,%rax ffffffff802862cc: 77 48 c1 ef 09 shr $0x9,%rdi ffffffff802862d0: 3 48 21 f0 and %rsi,%rax ffffffff802862d3: 1238 81 e7 f8 0f 00 00 and $0xff8,%edi ffffffff802862d9: 1 48 01 c8 add %rcx,%rax ffffffff802862dc: 0 4c 8d 24 38 lea (%rax,%rdi,1),%r12 ffffffff802862e0: 792 41 f6 04 24 81 testb $0x81,(%r12) ffffffff802862e5: 25 74 60 je ffffffff80286347 <page_check_address+0x103> ffffffff802862e7: 118074 48 c1 ea 0c shr $0xc,%rdx ffffffff802862eb: 41187 48 b8 00 00 00 00 00 mov $0xffffe20000000000,%rax ffffffff802862f2: 0 e2 ff ff ffffffff802862f5: 182 48 6b d2 38 imul $0x38,%rdx,%rdx ffffffff802862f9: 25998 48 8d 1c 02 lea (%rdx,%rax,1),%rbx ffffffff802862fd: 0 4c 8d 6b 10 lea 0x10(%rbx),%r13 ffffffff80286301: 0 4c 89 ef mov %r13,%rdi ffffffff80286304: 80598 e8 4b 17 28 00 callq ffffffff80507a54 <_spin_lock> ffffffff80286309: 36022 49 8b 0c 24 mov (%r12),%rcx ffffffff8028630d: 1623 f6 c1 81 test $0x81,%cl ffffffff80286310: 5 74 32 je ffffffff80286344 <page_check_address+0x100> ffffffff80286312: 16 48 b8 00 00 00 00 00 mov $0x1e0000000000,%rax ffffffff80286319: 0 1e 00 00 ffffffff8028631c: 359 48 ba b7 6d db b6 6d mov $0x6db6db6db6db6db7,%rdx ffffffff80286323: 0 db b6 6d ffffffff80286326: 12 48 c1 e1 12 shl $0x12,%rcx ffffffff8028632a: 0 49 8d 04 06 lea (%r14,%rax,1),%rax ffffffff8028632e: 492 48 c1 e9 1e shr $0x1e,%rcx ffffffff80286332: 23 48 c1 f8 03 sar $0x3,%rax ffffffff80286336: 0 48 0f af c2 imul %rdx,%rax ffffffff8028633a: 1390 48 39 c8 cmp %rcx,%rax ffffffff8028633d: 0 75 05 jne ffffffff80286344 <page_check_address+0x100> ffffffff8028633f: 0 4d 89 2f mov %r13,(%r15) ffffffff80286342: 165 eb 06 jmp ffffffff8028634a <page_check_address+0x106> ffffffff80286344: 0 fe 43 10 incb 0x10(%rbx) ffffffff80286347: 11886 45 31 e4 xor %r12d,%r12d ffffffff8028634a: 17451 5a pop %rdx ffffffff8028634b: 14507 5b pop %rbx ffffffff8028634c: 42 4c 89 e0 mov %r12,%rax ffffffff8028634f: 1736 41 5c pop %r12 ffffffff80286351: 40 41 5d pop %r13 ffffffff80286353: 1727 41 5e pop %r14 ffffffff80286355: 1420 41 5f pop %r15 ffffffff80286357: 44 c9 leaveq ffffffff80286358: 1685 c3 retq gcc-4.2.3, the config is at: http://redhat.com/~mingo/misc/config-Thu_May__8_22_23_21_CEST_2008 ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 21:45 ` Ingo Molnar @ 2008-05-08 22:02 ` Ingo Molnar 2008-05-08 22:55 ` Linus Torvalds 1 sibling, 0 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-08 22:02 UTC (permalink / raw) To: Linus Torvalds Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin * Ingo Molnar <mingo@elte.hu> wrote: > yeah, i captured a trace of all the down()s that happen in the > workload, using ftrace/sched_switch + stacktrace + a tracepoint in > down(). Here's the semaphore activities in the trace: i've updated the trace with a better one: > http://redhat.com/~mingo/misc/aim7-ftrace.txt the first one had some idle time in it as well plus the effects of a 2000-task Ctrl-Z - which skewed the histograms. the new stats are more straightforward. down() callsite histogram: 42 down <= lock_kernel <= de_put <= proc_delete_inode < 42 down <= lock_kernel <= proc_lookup_de <= proc_lookup < 78 down <= lock_kernel <= proc_lookup_de <= proc_lookup < 310 down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs < 332 down <= lock_kernel <= vfs_ioctl <= do_vfs_ioctl < 380 down <= lock_kernel <= tty_release <= __fput < 422 down <= lock_kernel <= chrdev_open <= __dentry_open < hm, why is chrdev_open() called all that often? top-5 wakeups: 4 default_wake_function <= __wake_up_common <= complete <= migration_thread <= kthread <= child_rip < 4 wake_up_process <= sched_exec <= do_execve <= sys_execve <= stub_execve <= < 8 wake_up_process <= __mutex_unlock_slowpath <= mutex_unlock <= do_filp_open <= do_sys_open <= sys_open < 40 default_wake_function <= autoremove_wake_function <= __wake_up_common <= __wake_up <= sock_def_wakeup <= tcp_rcv_state_process < 98 default_wake_function <= __wake_up_common <= __wake_up_sync <= do_notify_parent <= do_exit <= do_group_exit < i.e. very little wakeup activities. Top 10 scheduling points: 10 schedule <= kjournald <= kthread <= child_rip < 12 schedule <= __down_write_nested <= __down_write <= down_write < 29 schedule <= worker_thread <= kthread <= child_rip < 40 schedule <= schedule_timeout <= inet_stream_connect <= sys_connect < 59 schedule <= __cond_resched <= _cond_resched <= generic_file_buffered_write < 111 schedule <= ksoftirqd <= kthread <= child_rip < 119 schedule <= do_wait <= sys_wait4 <= system_call_after_swapgs < 659 schedule <= do_exit <= do_group_exit <= sys_exit_group < 781 schedule <= sysret_careful <= <= 0 < 1347 schedule <= retint_careful <= <= 0 < > and here's a few seconds worth of NMI driven readprofile output: the NMI profiling results were accurate. Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 21:45 ` Ingo Molnar 2008-05-08 22:02 ` Ingo Molnar @ 2008-05-08 22:55 ` Linus Torvalds 2008-05-08 23:07 ` Linus Torvalds 2008-05-08 23:16 ` Alan Cox 1 sibling, 2 replies; 140+ messages in thread From: Linus Torvalds @ 2008-05-08 22:55 UTC (permalink / raw) To: Ingo Molnar Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Alan Cox On Thu, 8 May 2008, Ingo Molnar wrote: > > 2512 down <= lock_kernel <= opost <= write_chan < > 2574 down <= lock_kernel <= write_chan <= tty_write < Ok. tty write handling. Nasty. But not as nasty as the open/close code, perhaps, and maybe we'll get it fixed some day. In fact, I thought we had fixed most of this already, but hey, I was clearly wrong. I assume Alan looks at it occasionally and groans. Alan? > > some other interesting stats. Top wakeups sources: > > [...] > 1301 default_wake_function <= __wake_up_common <= __wake_up <= n_tty_receive_buf <= pty_write <= write_chan < > 2065 wake_up_state <= prepare_signal <= send_signal <= __group_send_sig_info <= group_send_sig_info <= __kill_pgrp_info < Ok, signals being the top one, but that tty code is pretty high again. > and here's a few seconds worth of NMI driven readprofile output: > > 216021 sync_page 3375.3281 > 391888 page_check_address 1414.7581 > 962212 total 0.3039 > > system overhead is consistently 20% during this test. > > the page_check_address() overhead is surprising - tons of rmap > contention? about 10% wall-clock overhead in that function alone - and > this is just on a dual-core box! No, it's not rmap contention. Your profile hits are just on the actual calculations, and it's all data-dependent arithmetic and loads. Some cache misses on the page tables, clearly, but it looks like a lot of it is even just the plain arithmetic (the imul followed by a data-dependent 'lea' instruction). Some of it is that "page_to_pfn(page)", which involves a nasty division (divide by sizeof(struct page)). It gets turned into that shift and multiply, but it's still quite expensive with big constants etc. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 22:55 ` Linus Torvalds @ 2008-05-08 23:07 ` Linus Torvalds 2008-05-08 23:14 ` Linus Torvalds 2008-05-08 23:16 ` Alan Cox 1 sibling, 1 reply; 140+ messages in thread From: Linus Torvalds @ 2008-05-08 23:07 UTC (permalink / raw) To: Ingo Molnar Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Alan Cox On Thu, 8 May 2008, Linus Torvalds wrote: > > Some of it is that "page_to_pfn(page)", which involves a nasty division > (divide by sizeof(struct page)). It gets turned into that shift and > multiply, but it's still quite expensive with big constants etc. Btw, sparse will complain about those, because the source code *looks* really cheap. The normal "page_to_pfn()" looks trivial: ((unsigned long)((page) - mem_map) + ARCH_PFN_OFFSET) which looks like a trivial subtraction and addition of a constant, but the subtraction is on C pointers, and basically turns into ((unsigned long)page - (unsigned long)mem_map) / sizeof(struct page) and because "struct page" is not some nice power-of-two in size, that division is rather nasty even though it's a constant size. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 23:07 ` Linus Torvalds @ 2008-05-08 23:14 ` Linus Torvalds 0 siblings, 0 replies; 140+ messages in thread From: Linus Torvalds @ 2008-05-08 23:14 UTC (permalink / raw) To: Ingo Molnar Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Alan Cox On Thu, 8 May 2008, Linus Torvalds wrote: > > Btw, sparse will complain about those, because the source code *looks* > really cheap. Sometimes you can fix it. For example, this change: - if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) { + if (pte_present(*pte) && page == pfn_to_page(pte_pfn(*pte))) { can simplify things: instead of moving from a 'struct page' to a pfn, it moves from a pfn to a 'struct page', and that is generally cheaper (multiply rather than divide by size of struct page). It's not always the same thing to do, but I think in this case we can. For me, the code generation changes: - movabsq $7905747460161236407, %rdx #, tmp111 - movabsq $32985348833280, %rax #, tmp107 - leaq (%r12,%rax), %rax #, tmp106 - sarq $3, %rax #, tmp106 - imulq %rdx, %rax # tmp111, tmp106 - movabsq $70368744177663, %rdx #, tmp113 - andq %rdx, %rcx # tmp113, pte$pte - shrq $12, %rcx #, pte$pte - cmpq %rcx, %rax # pte$pte, tmp106 + movabsq $70368744177663, %rax #, tmp107 + andq %rax, %rdx # tmp107, pte$pte + shrq $12, %rdx #, pte$pte + imulq $56, %rdx, %rax #, pte$pte, tmp109 + movabsq $-32985348833280, %rdx #, tmp111 + addq %rdx, %rax # tmp111, tmp110 + cmpq %rax, %r13 # tmp110, page which isn't a *huge* deal, but it certainly looks better. One less big constant, and one less shift. It's not going to make a huge difference, though. That function is just called too much, and it would still be entirely data-dependent all the way through. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 22:55 ` Linus Torvalds 2008-05-08 23:07 ` Linus Torvalds @ 2008-05-08 23:16 ` Alan Cox 2008-05-08 23:33 ` Linus Torvalds 1 sibling, 1 reply; 140+ messages in thread From: Alan Cox @ 2008-05-08 23:16 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin > In fact, I thought we had fixed most of this already, but hey, I was > clearly wrong. I assume Alan looks at it occasionally and groans. Alan? I have pushed it down to n_tty line discipline code but not within that. It is on the hit list but I'm working on more pressing stuff first (USB layer, extracting commonality to start to tackle open etc etc) I don't think fixing n_tty is now a big job if someone wants to take a swing at it. The driver write/throttle/etc routines below the n_tty ldisc layer are now BKL clean so it should just be the internal locking of the buffers, window and the like to tackle. Feel free to have a go 8) ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 23:16 ` Alan Cox @ 2008-05-08 23:33 ` Linus Torvalds 2008-05-08 23:27 ` Alan Cox ` (2 more replies) 0 siblings, 3 replies; 140+ messages in thread From: Linus Torvalds @ 2008-05-08 23:33 UTC (permalink / raw) To: Alan Cox Cc: Ingo Molnar, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin On Fri, 9 May 2008, Alan Cox wrote: > > I don't think fixing n_tty is now a big job if someone wants to take a > swing at it. The driver write/throttle/etc routines below the n_tty ldisc > layer are now BKL clean so it should just be the internal locking of the > buffers, window and the like to tackle. Well, it turns out that Ingo's fixed statistics actually put the real cost in fcntl/ioctl/open/release: 310 down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs < 332 down <= lock_kernel <= vfs_ioctl <= do_vfs_ioctl < 380 down <= lock_kernel <= tty_release <= __fput < 422 down <= lock_kernel <= chrdev_open <= __dentry_open < rather than the write routines. But it may be that Ingo was just profiling two different sections, and it's really all of them. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 23:33 ` Linus Torvalds @ 2008-05-08 23:27 ` Alan Cox 2008-05-09 6:50 ` Ingo Molnar 2008-05-09 8:29 ` Andi Kleen 2 siblings, 0 replies; 140+ messages in thread From: Alan Cox @ 2008-05-08 23:27 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin > 380 down <= lock_kernel <= tty_release <= __fput < > 422 down <= lock_kernel <= chrdev_open <= __dentry_open < > > rather than the write routines. But it may be that Ingo was just profiling > two different sections, and it's really all of them. tty release is probably a few months away from getting cured - I'm afraid it will almost certainly be the very last user of the BKL in tty to get fixed as it depends on everything else being sanely locked. ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 23:33 ` Linus Torvalds 2008-05-08 23:27 ` Alan Cox @ 2008-05-09 6:50 ` Ingo Molnar 2008-05-09 8:29 ` Andi Kleen 2 siblings, 0 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-09 6:50 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I don't think fixing n_tty is now a big job if someone wants to take > > a swing at it. The driver write/throttle/etc routines below the > > n_tty ldisc layer are now BKL clean so it should just be the > > internal locking of the buffers, window and the like to tackle. > > Well, it turns out that Ingo's fixed statistics actually put the real > cost in fcntl/ioctl/open/release: > > 310 down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs < > 332 down <= lock_kernel <= vfs_ioctl <= do_vfs_ioctl < > 380 down <= lock_kernel <= tty_release <= __fput < > 422 down <= lock_kernel <= chrdev_open <= __dentry_open < > > rather than the write routines. But it may be that Ingo was just > profiling two different sections, and it's really all of them. the first trace had general desktop load mixed into it as well - so while it's not interesting to AIM7 the BKL does matter in those situations and i'd not be surprised if it was responsible for certain categories of desktop lag. The second trace was the correct 'pure' AIM7 workload which produces very little tty output. It is a quite stable workload and the trace i uploaded is representative of the totality of that workload. AIM7 runs for several minutes so there's no significant rampup/rampdown interaction either. Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 23:33 ` Linus Torvalds 2008-05-08 23:27 ` Alan Cox 2008-05-09 6:50 ` Ingo Molnar @ 2008-05-09 8:29 ` Andi Kleen 2 siblings, 0 replies; 140+ messages in thread From: Andi Kleen @ 2008-05-09 8:29 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Ingo Molnar, Zhang, Yanmin, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin Linus Torvalds <torvalds@linux-foundation.org> writes: > > Well, it turns out that Ingo's fixed statistics actually put the real cost > in fcntl/ioctl/open/release: > > 310 down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs < That must be ->fasync? If it was file locks the lock_kernel would not be inlined into sys_fcntl. Or is that an truncated back trace? -Andi (wondering if he should plug ->fasync_unlocked again ...) ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 12:01 ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Ingo Molnar 2008-05-08 12:28 ` Ingo Molnar @ 2008-05-08 13:20 ` Matthew Wilcox 2008-05-08 15:01 ` Ingo Molnar 2008-05-08 13:56 ` Arjan van de Ven 2 siblings, 1 reply; 140+ messages in thread From: Matthew Wilcox @ 2008-05-08 13:20 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin On Thu, May 08, 2008 at 02:01:30PM +0200, Ingo Molnar wrote: > Looking at the workload i found and fixed what i believe to be the real > bug causing the AIM7 regression: it was inefficient wakeup / scheduling > / locking behavior of the new generic semaphore code, causing suboptimal > performance. I did note that earlier downthread ... although to be fair, I thought of it in terms of three tasks with the third task coming in and stealing the second tasks's wakeup rather than the first task starving the second by repeatedly locking/unlocking the semaphore. > So if the old owner, even if just a few instructions later, does a > down() [lock_kernel()] again, it will be blocked and will have to wait > on the new owner to eventually be scheduled (possibly on another CPU)! > Or if another other task gets to lock_kernel() sooner than the "new > owner" scheduled, it will be blocked unnecessarily and for a very long > time when there are 2000 tasks running. > > I.e. the implementation of the new semaphores code does wake-one and > lock ownership in a very restrictive way - it does not allow > opportunistic re-locking of the lock at all and keeps the scheduler from > picking task order intelligently. Fair is certainly the enemy of throughput (see also dbench arguments passim). It may be that some semaphore users really do want fairness -- it seems pretty clear that we don't want fairness for the BKL. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 13:20 ` Matthew Wilcox @ 2008-05-08 15:01 ` Ingo Molnar 0 siblings, 0 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-08 15:01 UTC (permalink / raw) To: Matthew Wilcox Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra * Matthew Wilcox <matthew@wil.cx> wrote: > Fair is certainly the enemy of throughput (see also dbench arguments > passim). It may be that some semaphore users really do want fairness > -- it seems pretty clear that we don't want fairness for the BKL. i dont think we need to consider any theoretical arguments about fairness here as there's a fundamental down-to-earth maintenance issue that governs: old semaphores were similarly unfair too, so it is just a bad idea (and a bug) to change behavior when implementing new, generic semaphores that are supposed to be a seemless replacement! This is about legacy code that is intended to be phased out anyway. This is already a killer argument and we wouldnt have to look any further. but even on the more theoretical level i disagree: fairness of CPU time is something that is implemented by the scheduler in a natural way already. Putting extra ad-hoc synchronization and scheduling into the locking primitives around data structures only gives mathematical fairness and artificial micro-scheduling, it does not actually make the end result more useful! This is especially true for the BKL which is auto-dropped by the scheduler anyway. (so descheduling a task will automatically release it of its BKL ownership) For example we've invested a _lot_ of time and effort into adding lock stealing (i.e. intentional "unfairness") to kernel/rtmutex.c. Which is a _lot_ harder to do atomically with PI constraints but still possible and makes sense in the grand scheme of things. kernel/mutex.c is also "unfair" - and that's correct IMO. For the BKL in particular there's almost no sense to talk about any underlying resource and there's almost no expectation from users for that imaginery resource to be shared fairly. Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) 2008-05-08 12:01 ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Ingo Molnar 2008-05-08 12:28 ` Ingo Molnar 2008-05-08 13:20 ` Matthew Wilcox @ 2008-05-08 13:56 ` Arjan van de Ven 2 siblings, 0 replies; 140+ messages in thread From: Arjan van de Ven @ 2008-05-08 13:56 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin On Thu, 8 May 2008 14:01:30 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > The contention comes from the following property of the new semaphore > code: the new owner owns the semaphore exclusively, even if it is not > running yet. > > So if the old owner, even if just a few instructions later, does a > down() [lock_kernel()] again, it will be blocked and will have to > wait on the new owner to eventually be scheduled (possibly on another > CPU)! Or if another other task gets to lock_kernel() sooner than the > "new owner" scheduled, it will be blocked unnecessarily and for a > very long time when there are 2000 tasks running. ok sounds like I like the fairness part of the new semaphores (but obviously not the 67% performance downside; I'd expect to sacrifice a little performance.. but this much??). It sucks though; if this were a mutex, we could wake up the owner of the bugger in the contented acquire path synchronously.... but these are semaphores, and don't have an owner ;( bah bah bah ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-08 2:44 ` Zhang, Yanmin 2008-05-08 3:29 ` Linus Torvalds @ 2008-05-08 6:43 ` Ingo Molnar 2008-05-08 6:48 ` Andrew Morton 2008-05-08 7:14 ` Zhang, Yanmin 1 sibling, 2 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-08 6:43 UTC (permalink / raw) To: Zhang, Yanmin Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > > Here's a trial balloon patch to do that. > > > > Yanmin - this is not well tested, but the code is fairly obvious, > > and it would be interesting to hear if this fixes the performance > > regression. Because if it doesn't, then it's not the BKL, or > > something totally different is going on. > > Congratulations! The patch really fixes the regression completely! > vmstat showed cpu idle is 0%, just like 2.6.25's. great! Yanmin, could you please also check the other patch i sent (also attached below), does it solve the regression similarly? Ingo --- lib/kernel_lock.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) Index: linux/lib/kernel_lock.c =================================================================== --- linux.orig/lib/kernel_lock.c +++ linux/lib/kernel_lock.c @@ -46,7 +46,8 @@ int __lockfunc __reacquire_kernel_lock(v task->lock_depth = -1; preempt_enable_no_resched(); - down(&kernel_sem); + while (down_trylock(&kernel_sem)) + cpu_relax(); preempt_disable(); task->lock_depth = saved_lock_depth; @@ -67,11 +68,13 @@ void __lockfunc lock_kernel(void) struct task_struct *task = current; int depth = task->lock_depth + 1; - if (likely(!depth)) + if (likely(!depth)) { /* * No recursion worries - we set up lock_depth _after_ */ - down(&kernel_sem); + while (down_trylock(&kernel_sem)) + cpu_relax(); + } task->lock_depth = depth; } ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-08 6:43 ` AIM7 40% regression with 2.6.26-rc1 Ingo Molnar @ 2008-05-08 6:48 ` Andrew Morton 2008-05-08 7:14 ` Zhang, Yanmin 1 sibling, 0 replies; 140+ messages in thread From: Andrew Morton @ 2008-05-08 6:48 UTC (permalink / raw) To: Ingo Molnar Cc: Zhang, Yanmin, Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro On Thu, 8 May 2008 08:43:40 +0200 Ingo Molnar <mingo@elte.hu> wrote: > great! Yanmin, could you please also check the other patch i sent (also > attached below), does it solve the regression similarly? > > Ingo > > --- > lib/kernel_lock.c | 9 ++++++--- but but but. Some other users of down() have presumably also regressed. We just have't found the workload to demonstrate that yet. ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-08 6:43 ` AIM7 40% regression with 2.6.26-rc1 Ingo Molnar 2008-05-08 6:48 ` Andrew Morton @ 2008-05-08 7:14 ` Zhang, Yanmin 2008-05-08 7:39 ` Ingo Molnar 1 sibling, 1 reply; 140+ messages in thread From: Zhang, Yanmin @ 2008-05-08 7:14 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton On Thu, 2008-05-08 at 08:43 +0200, Ingo Molnar wrote: > * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > > > > Here's a trial balloon patch to do that. > > > > > > Yanmin - this is not well tested, but the code is fairly obvious, > > > and it would be interesting to hear if this fixes the performance > > > regression. Because if it doesn't, then it's not the BKL, or > > > something totally different is going on. > > > > Congratulations! The patch really fixes the regression completely! > > vmstat showed cpu idle is 0%, just like 2.6.25's. > > great! Yanmin, could you please also check the other patch i sent (also > attached below), does it solve the regression similarly? With your patch, aim7 regression becomes less than 2%. I ran the testing twice. Linus' patch could recover it completely. As aim7 result is quite stable(usually fluctuating less than 1%), 1.5%~2% is a little big. > > Ingo > > --- > lib/kernel_lock.c | 9 ++++++--- > 1 file changed, 6 insertions(+), 3 deletions(-) > > Index: linux/lib/kernel_lock.c > =================================================================== > --- linux.orig/lib/kernel_lock.c > +++ linux/lib/kernel_lock.c > @@ -46,7 +46,8 @@ int __lockfunc __reacquire_kernel_lock(v > task->lock_depth = -1; > preempt_enable_no_resched(); > > - down(&kernel_sem); > + while (down_trylock(&kernel_sem)) > + cpu_relax(); > > preempt_disable(); > task->lock_depth = saved_lock_depth; > @@ -67,11 +68,13 @@ void __lockfunc lock_kernel(void) > struct task_struct *task = current; > int depth = task->lock_depth + 1; > > - if (likely(!depth)) > + if (likely(!depth)) { > /* > * No recursion worries - we set up lock_depth _after_ > */ > - down(&kernel_sem); > + while (down_trylock(&kernel_sem)) > + cpu_relax(); > + } > > task->lock_depth = depth; > } ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-08 7:14 ` Zhang, Yanmin @ 2008-05-08 7:39 ` Ingo Molnar 2008-05-08 8:44 ` Zhang, Yanmin 0 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-08 7:39 UTC (permalink / raw) To: Zhang, Yanmin Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > > great! Yanmin, could you please also check the other patch i sent > > (also attached below), does it solve the regression similarly? > > With your patch, aim7 regression becomes less than 2%. I ran the > testing twice. > > Linus' patch could recover it completely. As aim7 result is quite > stable(usually fluctuating less than 1%), 1.5%~2% is a little big. is this the old original aim7 you are running, or osdl-aim-7 or re-aim-7? if it's aim7 then this is a workload that starts+stops 2000 parallel tasks that each start and exit at the same time. That might explain its sensitivity on the BKL - this is all about tty-controlled task startup and exit. i could not get it to produce anywhere close to stable results though. I also frequently get into this problem: AIM Multiuser Benchmark - Suite VII Run Beginning Tasks jobs/min jti jobs/min/task real cpu 2000 Failed to execute new_raph 200 Unable to solve equation in 100 tries. P = 1.5708, P0 = 1.5708, delta = 6.12574e-17 Failed to execute disk_cp /mnt/shm disk_cp (1): cannot open /mnt/shm/tmpa.common disk1.c: No such file or directory [.. etc. a large stream of them .. ] system has 2GB of RAM and tmpfs mounted to the place where aim7 puts its work files. Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-08 7:39 ` Ingo Molnar @ 2008-05-08 8:44 ` Zhang, Yanmin 2008-05-08 9:21 ` Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Zhang, Yanmin @ 2008-05-08 8:44 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 2127 bytes --] On Thu, 2008-05-08 at 09:39 +0200, Ingo Molnar wrote: > * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > > > > great! Yanmin, could you please also check the other patch i sent > > > (also attached below), does it solve the regression similarly? > > > > With your patch, aim7 regression becomes less than 2%. I ran the > > testing twice. > > > > Linus' patch could recover it completely. As aim7 result is quite > > stable(usually fluctuating less than 1%), 1.5%~2% is a little big. > > is this the old original aim7 you are running, I useold AIM7 plus a small patch which is just to change a couple of data type to match 64bit. > or osdl-aim-7 or > re-aim-7? > > if it's aim7 then this is a workload that starts+stops 2000 parallel > tasks that each start and exit at the same time. Yes. > That might explain its > sensitivity on the BKL - this is all about tty-controlled task startup > and exit. > > i could not get it to produce anywhere close to stable results though. I > also frequently get into this problem: > > AIM Multiuser Benchmark - Suite VII Run Beginning > Tasks jobs/min jti jobs/min/task real cpu > 2000 > Failed to execute > new_raph 200 > Unable to solve equation in 100 tries. P = 1.5708, P0 = 1.5708, delta = 6.12574e-17 > > Failed to execute > disk_cp /mnt/shm > disk_cp (1): cannot open /mnt/shm/tmpa.common > disk1.c: No such file or directory > > [.. etc. a large stream of them .. ] > > system has 2GB of RAM and tmpfs mounted to the place where aim7 puts its > work files. My machine has 8GB. To simulate your environment, I reserve 6GB for hugetlb, then reran the testing and didn't see any failure except: AIM Multiuser Benchmark - Suite VII Run Beginning Tasks jobs/min jti jobs/min/task real cpu 2000create_shared_memory(): can't create semaphore, pausing... create_shared_memory(): can't create semaphore, pausing... Above info doesn't mean errors. Perhaps you could: 1) Apply the attched aim9 patch; 2) check if you have write right under /mnt/shm; 3) echo "/mnt/shm">aim7_path/config; [-- Attachment #2: aim9.diff --] [-- Type: text/x-patch, Size: 2481 bytes --] diff -urp aim9.orig/creat-clo.c aim9/creat-clo.c --- aim9.orig/creat-clo.c 2002-04-22 15:25:16.000000000 -0700 +++ aim9/creat-clo.c 2005-07-11 10:20:13.000000000 -0700 @@ -352,7 +352,7 @@ page_test(char *argv, */ oldbrk = sbrk(0); /* get current break value */ newbrk = sbrk(1024 * 1024); /* move up 1 megabyte */ - if ((int)newbrk == -1) { + if (newbrk == (void *)-1L) { perror("\npage_test"); /* tell more info */ fprintf(stderr, "page_test: Unable to do initial sbrk.\n"); return (-1); @@ -365,7 +365,7 @@ page_test(char *argv, newbrk = sbrk(-4096 * 16); /* deallocate some space */ for (i = 0; i < 16; i++) { /* now get it back in pieces */ newbrk = sbrk(4096); /* Get pointer to new space */ - if ((int)newbrk == -1) { + if (newbrk == (void *)-1L) { perror("\npage_test"); /* tell more info */ fprintf(stderr, "page_test: Unable to do sbrk.\n"); @@ -406,7 +406,7 @@ brk_test(char *argv, */ oldbrk = sbrk(0); /* get old break value */ newbrk = sbrk(1024 * 1024); /* move up 1 megabyte */ - if ((int)newbrk == -1) { + if (newbrk == (void *)-1L) { perror("\nbrk_test"); /* tell more info */ fprintf(stderr, "brk_test: Unable to do initial sbrk.\n"); return (-1); @@ -419,7 +419,7 @@ brk_test(char *argv, newbrk = sbrk(-4096 * 16); /* deallocate some space */ for (i = 0; i < 16; i++) { /* allocate it back */ newbrk = sbrk(4096); /* 4k at a time (should be ~ 1 page) */ - if ((int)newbrk == -1) { + if (newbrk == (void *)-1L) { perror("\nbrk_test"); /* tell more info */ fprintf(stderr, "brk_test: Unable to do sbrk.\n"); diff -urp aim9.orig/pipe_test.c aim9/pipe_test.c --- aim9.orig/pipe_test.c 2002-04-22 15:25:16.000000000 -0700 +++ aim9/pipe_test.c 2005-07-11 10:21:19.000000000 -0700 @@ -493,8 +493,8 @@ readn(int fd, buf += result; /* update pointer */ if (--count <= 0) { fprintf(stderr, - "\nMaximum iterations exceeded in readn(%d, %#x, %d)", - fd, (unsigned)buf, size); + "\nMaximum iterations exceeded in readn(%d, %p, %d)", + fd, buf, size); return (-1); } } /* and loop */ @@ -523,8 +523,8 @@ writen(int fd, buf += result; /* update pointer */ if (--count <= 0) { /* handle too many loops */ fprintf(stderr, - "\nMaximum iterations exceeded in writen(%d, %#x, %d)", - fd, (unsigned)buf, size); + "\nMaximum iterations exceeded in writen(%d, %p, %d)", + fd, buf, size); return (-1); } } /* and loop */ ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-08 8:44 ` Zhang, Yanmin @ 2008-05-08 9:21 ` Ingo Molnar 2008-05-08 9:29 ` Ingo Molnar 2008-05-08 9:30 ` Zhang, Yanmin 0 siblings, 2 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-08 9:21 UTC (permalink / raw) To: Zhang, Yanmin Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > > disk_cp /mnt/shm > > disk_cp (1): cannot open /mnt/shm/tmpa.common > > disk1.c: No such file or directory > > > > [.. etc. a large stream of them .. ] > > > > system has 2GB of RAM and tmpfs mounted to the place where aim7 puts its > > work files. > My machine has 8GB. To simulate your environment, I reserve 6GB for > hugetlb, then reran the testing and didn't see any failure except: AIM > Multiuser Benchmark - Suite VII Run Beginning > > Tasks jobs/min jti jobs/min/task real cpu > 2000create_shared_memory(): can't create semaphore, pausing... > create_shared_memory(): can't create semaphore, pausing... that failure message you got worries me - it indicates that your test ran out of IPC semaphores. You can fix it via upping the semaphore limits via: echo "500 32000 128 512" > /proc/sys/kernel/sem could you check that you still get similar results with this limit fixed? note that once i've fixed the semaphore limits it started running fine here. And i see zero idle time during the run on a quad core box. here are my numbers: # on v2.6.26-rc1-166-gc0a1811 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 55851.4 93 208.4 793.6 0.4654 # BKL: sleep 2000 55402.2 79 210.1 800.1 0.4617 2000 55728.4 93 208.9 795.5 0.4644 # BKL: spin 2000 55787.2 93 208.7 794.5 0.4649 # so the results are the same within noise. I'll also check this workload on an 8-way box to make sure it's OK on larger CPU counts too. could you double-check your test? plus a tty tidbit as well, during the test i saw a few of these: Warning: dev (tty1) tty->count(639) != #fd's(638) in release_dev Warning: dev (tty1) tty->count(462) != #fd's(463) in release_dev Warning: dev (tty1) tty->count(274) != #fd's(275) in release_dev Warning: dev (tty1) tty->count(4) != #fd's(3) in release_dev Warning: dev (tty1) tty->count(164) != #fd's(163) in release_dev Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-08 9:21 ` Ingo Molnar @ 2008-05-08 9:29 ` Ingo Molnar 2008-05-08 9:30 ` Zhang, Yanmin 1 sibling, 0 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-08 9:29 UTC (permalink / raw) To: Zhang, Yanmin Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton * Ingo Molnar <mingo@elte.hu> wrote: > plus a tty tidbit as well, during the test i saw a few of these: > > Warning: dev (tty1) tty->count(639) != #fd's(638) in release_dev false alarm there - these were due to the breakage in the hack-patch i used ... Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-08 9:21 ` Ingo Molnar 2008-05-08 9:29 ` Ingo Molnar @ 2008-05-08 9:30 ` Zhang, Yanmin 1 sibling, 0 replies; 140+ messages in thread From: Zhang, Yanmin @ 2008-05-08 9:30 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton On Thu, 2008-05-08 at 11:21 +0200, Ingo Molnar wrote: > * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > > > > disk_cp /mnt/shm > > > disk_cp (1): cannot open /mnt/shm/tmpa.common > > > disk1.c: No such file or directory > > > > > > [.. etc. a large stream of them .. ] > > > > > > system has 2GB of RAM and tmpfs mounted to the place where aim7 puts its > > > work files. > > > My machine has 8GB. To simulate your environment, I reserve 6GB for > > hugetlb, then reran the testing and didn't see any failure except: AIM > > Multiuser Benchmark - Suite VII Run Beginning > > > > Tasks jobs/min jti jobs/min/task real cpu > > 2000create_shared_memory(): can't create semaphore, pausing... > > create_shared_memory(): can't create semaphore, pausing... > > that failure message you got worries me - it indicates that your test > ran out of IPC semaphores. You can fix it via upping the semaphore > limits via: > > echo "500 32000 128 512" > /proc/sys/kernel/sem A quick test showed it does work. Thanks. I need to take shuttle bus or I need walk to home for 2 hours if missing it. :) > > could you check that you still get similar results with this limit > fixed? > > note that once i've fixed the semaphore limits it started running fine > here. And i see zero idle time during the run on a quad core box. > > here are my numbers: > > # on v2.6.26-rc1-166-gc0a1811 > > Tasks Jobs/Min JTI Real CPU Jobs/sec/task > 2000 55851.4 93 208.4 793.6 0.4654 # BKL: sleep > 2000 55402.2 79 210.1 800.1 0.4617 > > 2000 55728.4 93 208.9 795.5 0.4644 # BKL: spin > 2000 55787.2 93 208.7 794.5 0.4649 # > > so the results are the same within noise. > > I'll also check this workload on an 8-way box to make sure it's OK on > larger CPU counts too. > > could you double-check your test? > > plus a tty tidbit as well, during the test i saw a few of these: > > Warning: dev (tty1) tty->count(639) != #fd's(638) in release_dev > Warning: dev (tty1) tty->count(462) != #fd's(463) in release_dev > Warning: dev (tty1) tty->count(274) != #fd's(275) in release_dev > Warning: dev (tty1) tty->count(4) != #fd's(3) in release_dev > Warning: dev (tty1) tty->count(164) != #fd's(163) in release_dev > > Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 14:36 ` Linus Torvalds ` (2 preceding siblings ...) 2008-05-07 15:19 ` Linus Torvalds @ 2008-05-07 16:20 ` Ingo Molnar 2008-05-07 16:35 ` Linus Torvalds 3 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-07 16:20 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton * Linus Torvalds <torvalds@linux-foundation.org> wrote: > I think turning the BKL into a semaphore was fine per se, but that was > when semaphores were fast. hm, do we know it for a fact that the 40% AIM regression is due to the fastpath overhead of the BKL? It would be extraordinary if so. I think it is far more likely that it's due to the different scheduling and wakeup behavior of the new kernel/semaphore.c code. So the fix would be to restore the old scheduling behavior - that's what Yanmin's manual revert did and that's what got him back the previous AIM7 performance. Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 16:20 ` Ingo Molnar @ 2008-05-07 16:35 ` Linus Torvalds 2008-05-07 17:05 ` Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Linus Torvalds @ 2008-05-07 16:35 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton On Wed, 7 May 2008, Ingo Molnar wrote: > > I think it is far more likely that it's due to the different scheduling > and wakeup behavior of the new kernel/semaphore.c code. So the fix would > be to restore the old scheduling behavior - that's what Yanmin's manual > revert did and that's what got him back the previous AIM7 performance. Yes, Yanmin's manual revert got rid of the new semaphores entirely. Which was what, 7500 lines of code removed that got reverted. And the *WHOLE* and *ONLY* excuse for dropping the spinlock lock_kernel was this (and I quote your message): remove the !PREEMPT_BKL code. this removes 160 lines of legacy code. in other words, your only stated valid reason for getting rid of the spinlock was 160 lines, and the comment didn't even match what it did (it removed the spinlocks entirely, not just the preemptible version). In contrast, the revert adds 7500 lines. If you go by the only documented reason for the crap that is the current BKL, then I know which one I'll take. I'll take the spinlock back, and I'd rather put preemption back than ever take those semaphores. And even that's ignoring another issue: did anybody ever even do that AIM7 benchmark comparing spinlocks to the semaphore-BKL? It's quite possible that the semaphores (even the well-behaved ones) behaved worse than the spinlocks. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 16:35 ` Linus Torvalds @ 2008-05-07 17:05 ` Ingo Molnar 2008-05-07 17:24 ` Linus Torvalds 0 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-07 17:05 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I think it is far more likely that it's due to the different > > scheduling and wakeup behavior of the new kernel/semaphore.c code. > > So the fix would be to restore the old scheduling behavior - that's > > what Yanmin's manual revert did and that's what got him back the > > previous AIM7 performance. > > Yes, Yanmin's manual revert got rid of the new semaphores entirely. > Which was what, 7500 lines of code removed that got reverted. i wouldnt advocate a 7500 revert instead of a 160 lines change. my suggestion was that the scheduling behavior of the new kernel/semaphore.c code is causing the problem - i.e. making it match the old semaphore code's behavior would give us back performance. > And the *WHOLE* and *ONLY* excuse for dropping the spinlock > lock_kernel was this (and I quote your message): > > remove the !PREEMPT_BKL code. > > this removes 160 lines of legacy code. > > in other words, your only stated valid reason for getting rid of the > spinlock was 160 lines, and the comment didn't even match what it did > (it removed the spinlocks entirely, not just the preemptible version). it was removed by me in the course of this discussion: http://lkml.org/lkml/2008/1/2/58 the whole discussion started IIRC because !CONFIG_PREEMPT_BKL [the spinlock version] was broken for a longer period of time (it crashed trivially), because nobody apparently used it. People (Nick) asked why it was still there and i agreed and removed it. CONFIG_PREEMPT_BKL=y was the default, that was what all distros used. I.e. the spinlock code was in essence dead code at that point in time. the spinlock code might in fact perform _better_, but nobody came up with such a workload before. > In contrast, the revert adds 7500 lines. If you go by the only > documented reason for the crap that is the current BKL, then I know > which one I'll take. I'll take the spinlock back, and I'd rather put > preemption back than ever take those semaphores. > > And even that's ignoring another issue: did anybody ever even do that > AIM7 benchmark comparing spinlocks to the semaphore-BKL? It's quite > possible that the semaphores (even the well-behaved ones) behaved > worse than the spinlocks. that's a good question... Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 17:05 ` Ingo Molnar @ 2008-05-07 17:24 ` Linus Torvalds 2008-05-07 17:36 ` Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Linus Torvalds @ 2008-05-07 17:24 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton On Wed, 7 May 2008, Ingo Molnar wrote: > > it was removed by me in the course of this discussion: > > http://lkml.org/lkml/2008/1/2/58 > > the whole discussion started IIRC because !CONFIG_PREEMPT_BKL [the > spinlock version] was broken for a longer period of time (it crashed > trivially), because nobody apparently used it. Hmm. I've generally used PREEMPT_NONE, and always thought PREEMPT_BKL was the known-flaky one. The thread you point to also says that it's PREEMPT_BKL=y that was the problem (ie "I've seen 1s+ desktop latencies due to PREEMPT_BKL when I was still using reiserfs."), not the plain spinlock approach. But it would definitely be interesting to see the crash reports. And the help message always said "Say N if you are unsure." even if it ended up being marked 'y' by default at some point (and then in January was made first unconditional, and then removed entirely) Because in many ways, the non-preempt BKL is the *much* simpler case. I don't see why it would crash - it just turns the BKL into a trivial counting spinlock that can sleep. Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 17:24 ` Linus Torvalds @ 2008-05-07 17:36 ` Ingo Molnar 2008-05-07 17:55 ` Linus Torvalds 0 siblings, 1 reply; 140+ messages in thread From: Ingo Molnar @ 2008-05-07 17:36 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Wed, 7 May 2008, Ingo Molnar wrote: > > > > it was removed by me in the course of this discussion: > > > > http://lkml.org/lkml/2008/1/2/58 > > > > the whole discussion started IIRC because !CONFIG_PREEMPT_BKL [the > > spinlock version] was broken for a longer period of time (it crashed > > trivially), because nobody apparently used it. > > Hmm. I've generally used PREEMPT_NONE, and always thought PREEMPT_BKL > was the known-flaky one. > > The thread you point to also says that it's PREEMPT_BKL=y that was the > problem (ie "I've seen 1s+ desktop latencies due to PREEMPT_BKL when I > was still using reiserfs."), not the plain spinlock approach. no, there was another problem (which i couldnt immediately find because lkml.org only indexes part of the threads, i'll research it some more), which was some cond_resched() thing in the !PREEMPT_BKL case. > But it would definitely be interesting to see the crash reports. And > the help message always said "Say N if you are unsure." even if it > ended up being marked 'y' by default at some point (and then in > January was made first unconditional, and then removed entirely) > > Because in many ways, the non-preempt BKL is the *much* simpler case. > I don't see why it would crash - it just turns the BKL into a trivial > counting spinlock that can sleep. yeah. The latencies are a different problem, and indeed were reported against PREEMPT_BKL, and believed to be due to reiser3 and the tty code. (reiser3 runs almost all of its code under the BKL) The !PREEMPT_BKL crash was some simple screwup on my part of getting atomicity checks wrong in cond_resched() - and it went unnoticed for a long time - or something like that. I'll try to find that discussion. Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 17:36 ` Ingo Molnar @ 2008-05-07 17:55 ` Linus Torvalds 2008-05-07 17:59 ` Matthew Wilcox 0 siblings, 1 reply; 140+ messages in thread From: Linus Torvalds @ 2008-05-07 17:55 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton On Wed, 7 May 2008, Ingo Molnar wrote: > > no, there was another problem (which i couldnt immediately find because > lkml.org only indexes part of the threads, i'll research it some more), > which was some cond_resched() thing in the !PREEMPT_BKL case. Hmm. I do agree that _cond_resched() looks a bit iffy, although in a safe way. It uses just !(preempt_count() & PREEMPT_ACTIVE) to see whether it can schedule, and it should probably use in_atomic() which ignores the kernel lock. But right now, that whole thing is disabled if PREEMPT is on anyway, so in effect (with my test patch, at least) cond_preempt() would just be a no-op if PREEMPT is on, even if BKL isn't preemptable. So it doesn't look buggy, but it looks like it might cause longer latencies than strictly necessary. And if somebody depends on cond_resched() to avoid some bad livelock situation, that would obviously not work (but that sounds like a fundamental bug anyway, I really hope nobody has ever written their code that way). > The !PREEMPT_BKL crash was some simple screwup on my part of getting > atomicity checks wrong in cond_resched() - and it went unnoticed for a > long time - or something like that. I'll try to find that discussion. Yes, some silly bug sounds more likely. Especially considering how many different cases there were (semaphores vs spinlocks vs preemptable spinlocks). Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 17:55 ` Linus Torvalds @ 2008-05-07 17:59 ` Matthew Wilcox 2008-05-07 18:17 ` Linus Torvalds 0 siblings, 1 reply; 140+ messages in thread From: Matthew Wilcox @ 2008-05-07 17:59 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Andi Kleen, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton On Wed, May 07, 2008 at 10:55:26AM -0700, Linus Torvalds wrote: > Hmm. I do agree that _cond_resched() looks a bit iffy, although in a safe > way. It uses just > > !(preempt_count() & PREEMPT_ACTIVE) > > to see whether it can schedule, and it should probably use in_atomic() > which ignores the kernel lock. > > But right now, that whole thing is disabled if PREEMPT is on anyway, so in > effect (with my test patch, at least) cond_preempt() would just be a no-op > if PREEMPT is on, even if BKL isn't preemptable. > > So it doesn't look buggy, but it looks like it might cause longer > latencies than strictly necessary. And if somebody depends on > cond_resched() to avoid some bad livelock situation, that would obviously > not work (but that sounds like a fundamental bug anyway, I really hope > nobody has ever written their code that way). Funny you should mention it; locks.c uses cond_resched() assuming that it ignores the BKL. Not through needing to avoid livelock, but it does presume that other higher priority tasks contending for the lock will get a chance to take it. You'll notice the patch I posted yesterday drops the file_lock_lock around the call to cond_resched(). -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 17:59 ` Matthew Wilcox @ 2008-05-07 18:17 ` Linus Torvalds 2008-05-07 18:49 ` Ingo Molnar 0 siblings, 1 reply; 140+ messages in thread From: Linus Torvalds @ 2008-05-07 18:17 UTC (permalink / raw) To: Matthew Wilcox Cc: Ingo Molnar, Andi Kleen, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton On Wed, 7 May 2008, Matthew Wilcox wrote: > > > > So it doesn't look buggy, but it looks like it might cause longer > > latencies than strictly necessary. And if somebody depends on > > cond_resched() to avoid some bad livelock situation, that would obviously > > not work (but that sounds like a fundamental bug anyway, I really hope > > nobody has ever written their code that way). > > Funny you should mention it; locks.c uses cond_resched() assuming that > it ignores the BKL. Not through needing to avoid livelock, but it does > presume that other higher priority tasks contending for the lock will > get a chance to take it. You'll notice the patch I posted yesterday > drops the file_lock_lock around the call to cond_resched(). Well, this would only be noticeable with CONFIG_PREEMPT. If you don't have preempt enabled, it looks like everything should work ok: the kernel lock wouldn't increase the preempt count, and _cond_resched() works fine. If you're PREEMPT, then the kernel lock would increase the preempt count, and _cond_resched() would refuse to re-schedule it, *but* with PREEMPT you'd never see it *anyway*, because PREEMPT will disable cond_resched() entirely (because preemption takes care of normal scheduling latencies without it). And I'm also sure that this all worked fine at some point, and it's largely a result just of the multiple different variations of BKL preemption coupled with some of them getting removed entirely, so the code that used to handle it just got corrupt over time. See commit 02b67cc3b, for example. .. Hmm ... Time passes. Linus looks at git history. It does look like "cond_resched()" has not worked with the BKL since 2005, and hasn't taken the BKL into account. Commit 5bbcfd9000: [PATCH] cond_resched(): fix bogus might_sleep() warning + if (unlikely(preempt_count())) + return; which talks about the BKS, ie it only took the *semaphore* implementation into account. Never the spinlock-with-preemption-count one. Or am I blind? Linus ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 18:17 ` Linus Torvalds @ 2008-05-07 18:49 ` Ingo Molnar 0 siblings, 0 replies; 140+ messages in thread From: Ingo Molnar @ 2008-05-07 18:49 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Andi Kleen, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton * Linus Torvalds <torvalds@linux-foundation.org> wrote: > .. Hmm ... Time passes. Linus looks at git history. > > It does look like "cond_resched()" has not worked with the BKL since > 2005, and hasn't taken the BKL into account. Commit 5bbcfd9000: > > [PATCH] cond_resched(): fix bogus might_sleep() warning > > + if (unlikely(preempt_count())) > + return; > > which talks about the BKS, ie it only took the *semaphore* > implementation into account. Never the spinlock-with-preemption-count > one. > > Or am I blind? hm, i think you are right. most latency reduction was concentrated on the PREEMPT+PREEMPT_BKL case, and not getting proper cond_resched() behavior in case of !PREEMPT_BKL would certainly not be noticed by distros or users. We made CONFIG_PREEMPT_BKL=y the default on SMP in v2.6.8, in this post-2.6.7 commit that introduced the feature: | commit fb8f6499abc6a847109d9602b797aa6afd2d5a3d | Author: Ingo Molnar <mingo@elte.hu> | Date: Fri Jan 7 21:59:57 2005 -0800 | | [PATCH] remove the BKL by turning it into a semaphore There was constant trouble around all these variations of preemptability and their combination with debugging helpers. (So i was rather happy to get rid of !PREEMPT_BKL - in the (apparently wrong) assumption that no tears will be shed.) Ingo ^ permalink raw reply [flat|nested] 140+ messages in thread
* Re: AIM7 40% regression with 2.6.26-rc1 2008-05-07 11:00 ` Andi Kleen 2008-05-07 11:46 ` Matthew Wilcox @ 2008-05-07 13:59 ` Alan Cox 1 sibling, 0 replies; 140+ messages in thread From: Alan Cox @ 2008-05-07 13:59 UTC (permalink / raw) To: Andi Kleen Cc: Zhang, Yanmin, Ingo Molnar, Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton On Wed, 07 May 2008 13:00:14 +0200 Andi Kleen <andi@firstfloor.org> wrote: > "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes: > > 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open. > > I have an older patchkit that introduced unlocked_fnctl for some cases. It was > briefly in mm but then dropped. Sounds like it is worth resurrecting? > > tty_* is being taken care of by Alan. The tty open/close paths are probably a few months away from dropping the BKL. Alan ^ permalink raw reply [flat|nested] 140+ messages in thread
end of thread, other threads:[~2008-05-13 21:06 UTC | newest] Thread overview: 140+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-05-06 5:48 AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin 2008-05-06 11:18 ` Matthew Wilcox 2008-05-06 11:44 ` Ingo Molnar 2008-05-06 12:09 ` Matthew Wilcox 2008-05-06 16:23 ` Matthew Wilcox 2008-05-06 16:36 ` Linus Torvalds 2008-05-06 16:42 ` Matthew Wilcox 2008-05-06 16:39 ` Alan Cox 2008-05-06 16:51 ` Matthew Wilcox 2008-05-06 16:45 ` Alan Cox 2008-05-06 17:42 ` Linus Torvalds 2008-05-06 20:28 ` Linus Torvalds 2008-05-06 16:44 ` J. Bruce Fields 2008-05-06 17:21 ` Andrew Morton 2008-05-06 17:31 ` Matthew Wilcox 2008-05-06 17:49 ` Ingo Molnar 2008-05-06 18:07 ` Andrew Morton 2008-05-11 11:11 ` Matthew Wilcox 2008-05-06 17:39 ` Ingo Molnar 2008-05-07 6:49 ` Zhang, Yanmin 2008-05-06 17:45 ` Linus Torvalds 2008-05-07 16:38 ` Matthew Wilcox 2008-05-07 16:55 ` Linus Torvalds 2008-05-07 17:08 ` Linus Torvalds 2008-05-07 17:16 ` Andrew Morton 2008-05-07 17:27 ` Linus Torvalds 2008-05-07 17:22 ` Ingo Molnar 2008-05-07 17:25 ` Ingo Molnar 2008-05-07 17:31 ` Linus Torvalds 2008-05-07 17:47 ` Linus Torvalds 2008-05-07 17:49 ` Ingo Molnar 2008-05-07 18:02 ` Linus Torvalds 2008-05-07 18:17 ` Ingo Molnar 2008-05-07 18:27 ` Linus Torvalds 2008-05-07 18:43 ` Ingo Molnar 2008-05-07 19:01 ` Linus Torvalds 2008-05-07 19:09 ` Ingo Molnar 2008-05-07 19:24 ` Matthew Wilcox 2008-05-07 19:44 ` Linus Torvalds 2008-05-07 20:00 ` Oi. NFS people. Read this Matthew Wilcox 2008-05-07 22:10 ` Trond Myklebust 2008-05-09 1:43 ` J. Bruce Fields 2008-05-08 3:24 ` AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin 2008-05-08 3:34 ` Linus Torvalds 2008-05-08 4:37 ` Zhang, Yanmin 2008-05-08 14:58 ` Linus Torvalds 2008-05-07 2:11 ` Zhang, Yanmin 2008-05-07 3:41 ` Zhang, Yanmin 2008-05-07 3:59 ` Andrew Morton 2008-05-07 4:46 ` Zhang, Yanmin 2008-05-07 6:26 ` Ingo Molnar 2008-05-07 6:28 ` Ingo Molnar 2008-05-07 7:05 ` Zhang, Yanmin 2008-05-07 11:00 ` Andi Kleen 2008-05-07 11:46 ` Matthew Wilcox 2008-05-07 12:21 ` Andi Kleen 2008-05-07 14:36 ` Linus Torvalds 2008-05-07 14:35 ` Alan Cox 2008-05-07 15:00 ` Linus Torvalds 2008-05-07 15:02 ` Linus Torvalds 2008-05-07 14:57 ` Andi Kleen 2008-05-07 15:31 ` Andrew Morton 2008-05-07 16:22 ` Matthew Wilcox 2008-05-07 15:19 ` Linus Torvalds 2008-05-07 17:14 ` Ingo Molnar 2008-05-08 2:44 ` Zhang, Yanmin 2008-05-08 3:29 ` Linus Torvalds 2008-05-08 4:08 ` Zhang, Yanmin 2008-05-08 4:17 ` Linus Torvalds 2008-05-08 12:01 ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Ingo Molnar 2008-05-08 12:28 ` Ingo Molnar 2008-05-08 14:43 ` Ingo Molnar 2008-05-08 15:10 ` [git pull] scheduler fixes Ingo Molnar 2008-05-08 15:33 ` Adrian Bunk 2008-05-08 15:41 ` Ingo Molnar 2008-05-08 19:42 ` Adrian Bunk 2008-05-11 11:03 ` Matthew Wilcox 2008-05-11 11:14 ` Matthew Wilcox 2008-05-11 11:48 ` Matthew Wilcox 2008-05-11 12:50 ` Ingo Molnar 2008-05-11 12:52 ` Ingo Molnar 2008-05-11 13:02 ` Matthew Wilcox 2008-05-11 13:26 ` Matthew Wilcox 2008-05-11 14:00 ` Ingo Molnar 2008-05-11 14:18 ` Matthew Wilcox 2008-05-11 14:42 ` Ingo Molnar 2008-05-11 14:48 ` Matthew Wilcox 2008-05-11 15:19 ` Ingo Molnar 2008-05-11 15:29 ` Matthew Wilcox 2008-05-13 14:11 ` Ingo Molnar 2008-05-13 14:21 ` Matthew Wilcox 2008-05-13 14:42 ` Ingo Molnar 2008-05-13 15:28 ` Matthew Wilcox 2008-05-13 17:13 ` Ingo Molnar 2008-05-13 17:22 ` Linus Torvalds 2008-05-13 21:05 ` Ingo Molnar 2008-05-11 13:54 ` Ingo Molnar 2008-05-11 14:22 ` Matthew Wilcox 2008-05-11 14:32 ` Ingo Molnar 2008-05-11 14:46 ` Matthew Wilcox 2008-05-11 16:47 ` Linus Torvalds 2008-05-11 13:01 ` Ingo Molnar 2008-05-11 13:06 ` Matthew Wilcox 2008-05-11 13:45 ` Ingo Molnar 2008-05-11 14:10 ` Sven Wegener 2008-05-08 16:02 ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Linus Torvalds 2008-05-08 18:30 ` Linus Torvalds 2008-05-08 20:19 ` Ingo Molnar 2008-05-08 20:27 ` Linus Torvalds 2008-05-08 21:45 ` Ingo Molnar 2008-05-08 22:02 ` Ingo Molnar 2008-05-08 22:55 ` Linus Torvalds 2008-05-08 23:07 ` Linus Torvalds 2008-05-08 23:14 ` Linus Torvalds 2008-05-08 23:16 ` Alan Cox 2008-05-08 23:33 ` Linus Torvalds 2008-05-08 23:27 ` Alan Cox 2008-05-09 6:50 ` Ingo Molnar 2008-05-09 8:29 ` Andi Kleen 2008-05-08 13:20 ` Matthew Wilcox 2008-05-08 15:01 ` Ingo Molnar 2008-05-08 13:56 ` Arjan van de Ven 2008-05-08 6:43 ` AIM7 40% regression with 2.6.26-rc1 Ingo Molnar 2008-05-08 6:48 ` Andrew Morton 2008-05-08 7:14 ` Zhang, Yanmin 2008-05-08 7:39 ` Ingo Molnar 2008-05-08 8:44 ` Zhang, Yanmin 2008-05-08 9:21 ` Ingo Molnar 2008-05-08 9:29 ` Ingo Molnar 2008-05-08 9:30 ` Zhang, Yanmin 2008-05-07 16:20 ` Ingo Molnar 2008-05-07 16:35 ` Linus Torvalds 2008-05-07 17:05 ` Ingo Molnar 2008-05-07 17:24 ` Linus Torvalds 2008-05-07 17:36 ` Ingo Molnar 2008-05-07 17:55 ` Linus Torvalds 2008-05-07 17:59 ` Matthew Wilcox 2008-05-07 18:17 ` Linus Torvalds 2008-05-07 18:49 ` Ingo Molnar 2008-05-07 13:59 ` Alan Cox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).