On Thu, Jun 27, 2013 at 5:54 PM, Dave Chinner wrote: > On Thu, Jun 27, 2013 at 04:54:53PM -1000, Linus Torvalds wrote: >> >> So what made it all start happening now? I don't recall us having had >> these kinds of issues before.. > > Not sure - it's a sudden surprise for me, too. Then again, I haven't > been looking at sync from a performance or lock contention point of > view any time recently. The algorithm that wait_sb_inodes() is > effectively unchanged since at least 2009, so it's probably a case > of it having been protected from contention by some external factor > we've fixed/removed recently. Perhaps the bdi-flusher thread > replacement in -rc1 has changed the timing sufficiently that it no > longer serialises concurrent sync calls as much.... > > However, the inode_sb_list_lock is known to be a badly contended > lock from a create/unlink fastpath for XFS, so it's not like this sort > of thing is completely unexpected. That whole inode_sb_list_lock seems moronic. Why isn't it a per-sb one? No, that won't fix all problems, but it might at least help a *bit*. Also, looking some more now at that wait_sb_inodes logic, I have to say that if the problem is primarily the inode->i_lock, then that's just crazy. We normally shouldn't even *need* that lock, since we could do a totally unlocked iget() as long as the count is non-zero. And no, I don't think really need the i_lock for checking "mapping->nrpages == 0" or the magical "inode is being freed" bits either. Or at least we could easily do some of this optimistically for the common cases. So instead of doing struct address_space *mapping = inode->i_mapping; spin_lock(&inode->i_lock); if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || (mapping->nrpages == 0)) { spin_unlock(&inode->i_lock); continue; } __iget(inode); spin_unlock(&inode->i_lock); I really think we could do that without getting the inode lock at *all* in the common case. I'm attaching a pretty trivial patch, which may obviously be trivially totally flawed. I have not tested this in any way, but half the new lines are comments about why it's doing what it is doing. And I really think that it should make the "actually take the inode lock" be something quite rare. And quite frankly, I'd much rather get *rid* of crazy i_lock accesses, than try to be clever and use a whole different list at this point. Not that I disagree that it wouldn't be much nicer to use a separate list in the long run, but for a short-term solution I'd much rather keep the old logic and just tweak it to be much more usable.. Hmm? Al? Jan? Comments? Linus