All of lore.kernel.org
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Dave Chinner <david@fromorbit.com>,
	Al Viro <viro@zeniv.linux.org.uk>, Jan Kara <jack@suse.cz>
Cc: Dave Jones <davej@redhat.com>, Oleg Nesterov <oleg@redhat.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Andrey Vagin <avagin@openvz.org>,
	Steven Rostedt <rostedt@goodmis.org>
Subject: Re: frequent softlockups with 3.10rc6.
Date: Thu, 27 Jun 2013 19:59:50 -1000	[thread overview]
Message-ID: <CA+55aFyZYsbMpP+6dkdkhdDn9gpTx0dkv25MUtcnswer_a2x9w@mail.gmail.com> (raw)
In-Reply-To: <20130628035437.GB29338@dastard>

[-- Attachment #1: Type: text/plain, Size: 2817 bytes --]

On Thu, Jun 27, 2013 at 5:54 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Jun 27, 2013 at 04:54:53PM -1000, Linus Torvalds wrote:
>>
>> So what made it all start happening now? I don't recall us having had
>> these kinds of issues before..
>
> Not sure - it's a sudden surprise for me, too. Then again, I haven't
> been looking at sync from a performance or lock contention point of
> view any time recently.  The algorithm that wait_sb_inodes() is
> effectively unchanged since at least 2009, so it's probably a case
> of it having been protected from contention by some external factor
> we've fixed/removed recently.  Perhaps the bdi-flusher thread
> replacement in -rc1 has changed the timing sufficiently that it no
> longer serialises concurrent sync calls as much....
>
> However, the inode_sb_list_lock is known to be a badly contended
> lock from a create/unlink fastpath for XFS, so it's not like this sort
> of thing is completely unexpected.

That whole inode_sb_list_lock seems moronic. Why isn't it a per-sb
one? No, that won't fix all problems, but it might at least help a
*bit*.

Also, looking some more now at that wait_sb_inodes logic, I have to
say that if the problem is primarily the inode->i_lock, then that's
just crazy. We normally shouldn't even *need* that lock, since we
could do a totally unlocked iget() as long as the count is non-zero.

And no, I don't think really need the i_lock for checking
"mapping->nrpages == 0" or the magical "inode is being freed" bits
either. Or at least we could easily do some of this optimistically for
the common cases.

So instead of doing

                struct address_space *mapping = inode->i_mapping;

                spin_lock(&inode->i_lock);
                if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
                    (mapping->nrpages == 0)) {
                        spin_unlock(&inode->i_lock);
                        continue;
                }
                __iget(inode);
                spin_unlock(&inode->i_lock);

I really think we could do that without getting the inode lock at
*all* in the common case.

I'm attaching a pretty trivial patch, which may obviously be trivially
totally flawed. I have not tested this in any way, but half the new
lines are comments about why it's doing what it is doing.  And I
really think that it should make the "actually take the inode lock" be
something quite rare.

And quite frankly, I'd much rather get *rid* of crazy i_lock accesses,
than try to be clever and use a whole different list at this point.
Not that I disagree that it wouldn't be much nicer to use a separate
list in the long run, but for a short-term solution I'd much rather
keep the old logic and just tweak it to be much more usable..

Hmm? Al? Jan? Comments?

                             Linus

[-- Attachment #2: patch.diff --]
[-- Type: application/octet-stream, Size: 2358 bytes --]

 fs/fs-writeback.c | 58 +++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 48 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3be57189efd5..3dcc8b202a40 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1206,6 +1206,52 @@ out_unlock_inode:
 }
 EXPORT_SYMBOL(__mark_inode_dirty);
 
+/*
+ * Do we want to get the inode for writeback?
+ */
+static int get_inode_for_writeback(struct inode *inode)
+{
+	struct address_space *mapping = inode->i_mapping;
+
+	/*
+	 * It's a data integrity sync, but we don't care about
+	 * racing with new pages - we're about data integrity
+	 * of things in the past, not the future
+	 */
+	if (!ACCESS_ONCE(mapping->nrpages))
+		return 0;
+
+	/* Similar logic wrt the I_NEW bit */
+	if (ACCESS_ONCE(inode->i_state) & I_NEW)
+		return 0;
+
+	/*
+	 * When the inode count goes down to zero, the
+	 * I_WILL_FREE and I_FREEING bits might get set.
+	 * But not before.
+	 *
+	 * So if we get this, we know those bits are
+	 * clear, and the inode is still interesting.
+	 */
+	if (atomic_inc_not_zero(&inode->i_count))
+		return 1;
+
+	/*
+	 * Slow path never happens normally, since any
+	 * active inode will be referenced by a dentry
+	 * and thus caught above
+	 */
+	spin_lock(&inode->i_lock);
+	if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+	    (mapping->nrpages == 0)) {
+		spin_unlock(&inode->i_lock);
+		return 0;
+	}
+	__iget(inode);
+	spin_unlock(&inode->i_lock);
+	return 1;
+}
+
 static void wait_sb_inodes(struct super_block *sb)
 {
 	struct inode *inode, *old_inode = NULL;
@@ -1226,16 +1272,8 @@ static void wait_sb_inodes(struct super_block *sb)
 	 * we still have to wait for that writeout.
 	 */
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		struct address_space *mapping = inode->i_mapping;
-
-		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (mapping->nrpages == 0)) {
-			spin_unlock(&inode->i_lock);
+		if (!get_inode_for_writeback(inode))
 			continue;
-		}
-		__iget(inode);
-		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_sb_list_lock);
 
 		/*
@@ -1249,7 +1287,7 @@ static void wait_sb_inodes(struct super_block *sb)
 		iput(old_inode);
 		old_inode = inode;
 
-		filemap_fdatawait(mapping);
+		filemap_fdatawait(inode->i_mapping);
 
 		cond_resched();
 

  reply	other threads:[~2013-06-28  5:59 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-19 16:45 frequent softlockups with 3.10rc6 Dave Jones
2013-06-19 17:53 ` Dave Jones
2013-06-19 18:13   ` Paul E. McKenney
2013-06-19 18:42     ` Dave Jones
2013-06-20  0:12     ` Dave Jones
2013-06-20 16:16       ` Paul E. McKenney
2013-06-20 16:27         ` Dave Jones
2013-06-21 15:11         ` Dave Jones
2013-06-21 19:59           ` Oleg Nesterov
2013-06-22  1:37             ` Dave Jones
2013-06-22 17:31               ` Oleg Nesterov
2013-06-22 21:59                 ` Dave Jones
2013-06-23  5:00                   ` Andrew Vagin
2013-06-23 14:36                   ` Oleg Nesterov
2013-06-23 15:06                     ` Dave Jones
2013-06-23 16:04                       ` Oleg Nesterov
2013-06-24  0:21                         ` Dave Jones
2013-06-24  2:00                         ` Dave Jones
2013-06-24 14:39                           ` Oleg Nesterov
2013-06-24 14:52                             ` Steven Rostedt
2013-06-24 16:00                               ` Dave Jones
2013-06-24 16:24                                 ` Steven Rostedt
2013-06-24 16:51                                   ` Dave Jones
2013-06-24 17:04                                     ` Steven Rostedt
2013-06-25 16:55                                       ` Dave Jones
2013-06-25 17:21                                         ` Steven Rostedt
2013-06-25 17:23                                           ` Steven Rostedt
2013-06-25 17:26                                           ` Dave Jones
2013-06-25 17:31                                             ` Steven Rostedt
2013-06-25 17:32                                             ` Steven Rostedt
2013-06-25 17:29                                           ` Steven Rostedt
2013-06-25 17:34                                             ` Dave Jones
2013-06-24 16:37                                 ` Oleg Nesterov
2013-06-24 16:49                                   ` Dave Jones
2013-06-24 15:57                         ` Dave Jones
2013-06-24 17:35                           ` Oleg Nesterov
2013-06-24 17:44                             ` Dave Jones
2013-06-24 17:53                             ` Steven Rostedt
2013-06-24 18:00                               ` Dave Jones
2013-06-25 15:35                             ` Dave Jones
2013-06-25 16:23                               ` Steven Rostedt
2013-06-26  5:23                                 ` Dave Jones
2013-06-26 19:52                                   ` Steven Rostedt
2013-06-26 20:00                                     ` Dave Jones
2013-06-27  3:01                                       ` Steven Rostedt
2013-06-26  5:48                                 ` Dave Jones
2013-06-26 19:18                               ` Oleg Nesterov
2013-06-26 19:40                                 ` Dave Jones
2013-06-27  0:22                                 ` Dave Jones
2013-06-27  1:06                                   ` Eric W. Biederman
2013-06-27  2:32                                     ` Tejun Heo
2013-06-27  7:55                                   ` Dave Chinner
2013-06-27 10:06                                     ` Dave Chinner
2013-06-27 12:52                                       ` Dave Chinner
2013-06-27 15:21                                         ` Dave Jones
2013-06-28  1:13                                           ` Dave Chinner
2013-06-28  3:58                                             ` Dave Chinner
2013-06-28 10:28                                               ` Jan Kara
2013-06-29  3:39                                                 ` Dave Chinner
2013-07-01 12:00                                                   ` Jan Kara
2013-07-02  6:29                                                     ` Dave Chinner
2013-07-02  8:19                                                       ` Jan Kara
2013-07-02 12:38                                                         ` Dave Chinner
2013-07-02 14:05                                                           ` Jan Kara
2013-07-02 16:13                                                             ` Linus Torvalds
2013-07-02 16:57                                                               ` Jan Kara
2013-07-02 17:38                                                                 ` Linus Torvalds
2013-07-03  3:07                                                                   ` Dave Chinner
2013-07-03  3:28                                                                     ` Linus Torvalds
2013-07-03  4:49                                                                       ` Dave Chinner
2013-07-04  7:19                                                                         ` Andrew Morton
2013-06-29 20:13                                               ` Dave Jones
2013-06-29 22:23                                                 ` Linus Torvalds
2013-06-29 23:44                                                   ` Dave Jones
2013-06-30  0:21                                                     ` Steven Rostedt
2013-07-01 12:49                                                     ` Pavel Machek
2013-06-30  0:17                                                   ` Steven Rostedt
2013-06-30  2:05                                                   ` Dave Chinner
2013-06-30  2:34                                                     ` Dave Chinner
2013-06-27 14:30                                     ` Dave Jones
2013-06-28  1:18                                       ` Dave Chinner
2013-06-28  2:54                                         ` Linus Torvalds
2013-06-28  3:54                                           ` Dave Chinner
2013-06-28  5:59                                             ` Linus Torvalds [this message]
2013-06-28  7:21                                               ` Dave Chinner
2013-06-28  8:22                                                 ` Linus Torvalds
2013-06-28  8:32                                                   ` Al Viro
2013-06-28  8:22                                               ` Al Viro
2013-06-28  9:49                                               ` Jan Kara
2013-07-01 17:57                                             ` block layer softlockup Dave Jones
2013-07-02  2:07                                               ` Dave Chinner
2013-07-02  6:01                                                 ` Dave Jones
2013-07-02  7:30                                                   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+55aFyZYsbMpP+6dkdkhdDn9gpTx0dkv25MUtcnswer_a2x9w@mail.gmail.com \
    --to=torvalds@linux-foundation.org \
    --cc=avagin@openvz.org \
    --cc=davej@redhat.com \
    --cc=david@fromorbit.com \
    --cc=ebiederm@xmission.com \
    --cc=jack@suse.cz \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=rostedt@goodmis.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.