All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Dave Jones <davej@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Oleg Nesterov <oleg@redhat.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Andrey Vagin <avagin@openvz.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	axboe@kernel.dk
Subject: Re: block layer softlockup
Date: Tue, 2 Jul 2013 12:07:41 +1000	[thread overview]
Message-ID: <20130702020741.GE4072@dastard> (raw)
In-Reply-To: <20130701175734.GA13641@redhat.com>

On Mon, Jul 01, 2013 at 01:57:34PM -0400, Dave Jones wrote:
> On Fri, Jun 28, 2013 at 01:54:37PM +1000, Dave Chinner wrote:
>  > On Thu, Jun 27, 2013 at 04:54:53PM -1000, Linus Torvalds wrote:
>  > > On Thu, Jun 27, 2013 at 3:18 PM, Dave Chinner <david@fromorbit.com> wrote:
>  > > >
>  > > > Right, that will be what is happening - the entire system will go
>  > > > unresponsive when a sync call happens, so it's entirely possible
>  > > > to see the soft lockups on inode_sb_list_add()/inode_sb_list_del()
>  > > > trying to get the lock because of the way ticket spinlocks work...
>  > > 
>  > > So what made it all start happening now? I don't recall us having had
>  > > these kinds of issues before..
>  > 
>  > Not sure - it's a sudden surprise for me, too. Then again, I haven't
>  > been looking at sync from a performance or lock contention point of
>  > view any time recently.  The algorithm that wait_sb_inodes() is
>  > effectively unchanged since at least 2009, so it's probably a case
>  > of it having been protected from contention by some external factor
>  > we've fixed/removed recently.  Perhaps the bdi-flusher thread
>  > replacement in -rc1 has changed the timing sufficiently that it no
>  > longer serialises concurrent sync calls as much....
> 
> This mornings new trace reminded me of this last sentence. Related ?

Was this running the last patch I posted, or a vanilla kernel?

> BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child1:7219]
....
> CPU: 0 PID: 7219 Comm: trinity-child1 Not tainted 3.10.0+ #38
.....
> RIP: 0010:[<ffffffff816ed037>]  [<ffffffff816ed037>] _raw_spin_unlock_irqrestore+0x67/0x80
.....
>  <IRQ> 
> 
>  [<ffffffff812da4c1>] blk_end_bidi_request+0x51/0x60
>  [<ffffffff812da4e0>] blk_end_request+0x10/0x20
>  [<ffffffff8149ba13>] scsi_io_completion+0xf3/0x6e0
>  [<ffffffff81491a60>] scsi_finish_command+0xb0/0x110
>  [<ffffffff8149b81f>] scsi_softirq_done+0x12f/0x160
>  [<ffffffff812e1e08>] blk_done_softirq+0x88/0xa0
>  [<ffffffff8105424f>] __do_softirq+0xff/0x440
>  [<ffffffff8105474d>] irq_exit+0xcd/0xe0
>  [<ffffffff816f760b>] smp_apic_timer_interrupt+0x6b/0x9b
>  [<ffffffff816f676f>] apic_timer_interrupt+0x6f/0x80
>  <EOI> 

That's doing IO completion processing in softirq time, and the lock
it just dropped was the q->queue_lock. But that lock is held over
end IO processing, so it is possible that the way the page writeback
transition handling of my POC patch caused this.

FWIW, I've attached a simple patch you might like to try to see if
it *minimises* the inode_sb_list_lock contention problems. All it
does is try to prevent concurrent entry in wait_sb_inodes() for a
given superblock and hence only have one walker on the contending
filesystem at a time. Replace the previous one I sent with it. If
that doesn't work, I have another simple patch that makes the
inode_sb_list_lock per-sb to take this isolation even further....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

sync: serialise per-superblock sync operations

From: Dave Chinner <dchinner@redhat.com>

When competing sync(2) calls walk the same filesystem, they need to
walk the list of inodes on the superblock to find all the inodes
that we need to wait for IO completion on. However, when multiple
wait_sb_inodes() calls do this at the same time, they contend on the
the inode_sb_list_lock and the contention causes system wide
slowdowns. In effect, concurrent sync(2) calls the take longer and
burn more CPU than if they were serialised.

Stop the worst of the contention by adding a per-sb mutex to wrap
around sync_inodes_sb() so that we only execute one sync(2)
operation at a time per superblock and hence mostly avoid
contention.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c  |    9 ++++++++-
 fs/super.c         |    1 +
 include/linux/fs.h |    2 ++
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 996f91a..4d7a90c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1353,7 +1353,12 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb);
  * @sb: the superblock
  *
  * This function writes and waits on any dirty inode belonging to this
- * super_block.
+ * super_block. The @s_sync_lock is used to serialise concurrent sync operations
+ * to avoid lock contention problems with concurrent wait_sb_inodes() calls.
+ * This also allows us to optimise wait_sb_inodes() to use private dirty lists
+ * as subsequent sync calls will block waiting for @s_sync_lock and hence always
+ * wait for the inodes in the private sync lists to be completed before they do
+ * their own private wait.
  */
 void sync_inodes_sb(struct super_block *sb)
 {
@@ -1372,10 +1377,12 @@ void sync_inodes_sb(struct super_block *sb)
 		return;
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
+	mutex_lock(&sb->s_sync_lock);
 	bdi_queue_work(sb->s_bdi, &work);
 	wait_for_completion(&done);
 
 	wait_sb_inodes(sb);
+	mutex_unlock(&sb->s_sync_lock);
 }
 EXPORT_SYMBOL(sync_inodes_sb);
 
diff --git a/fs/super.c b/fs/super.c
index 7465d43..887bfbe 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -181,6 +181,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_HLIST_NODE(&s->s_instances);
 		INIT_HLIST_BL_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
+		mutex_init(&s->s_sync_lock);
 		INIT_LIST_HEAD(&s->s_dentry_lru);
 		INIT_LIST_HEAD(&s->s_inode_lru);
 		spin_lock_init(&s->s_inode_lru_lock);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 41f0945..74ba328 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1257,6 +1257,8 @@ struct super_block {
 	const struct xattr_handler **s_xattr;
 
 	struct list_head	s_inodes;	/* all inodes */
+	struct mutex		s_sync_lock;	/* sync serialisation lock */
+
 	struct hlist_bl_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 #ifdef CONFIG_SMP
 	struct list_head __percpu *s_files;

  reply	other threads:[~2013-07-02  2:07 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-19 16:45 frequent softlockups with 3.10rc6 Dave Jones
2013-06-19 17:53 ` Dave Jones
2013-06-19 18:13   ` Paul E. McKenney
2013-06-19 18:42     ` Dave Jones
2013-06-20  0:12     ` Dave Jones
2013-06-20 16:16       ` Paul E. McKenney
2013-06-20 16:27         ` Dave Jones
2013-06-21 15:11         ` Dave Jones
2013-06-21 19:59           ` Oleg Nesterov
2013-06-22  1:37             ` Dave Jones
2013-06-22 17:31               ` Oleg Nesterov
2013-06-22 21:59                 ` Dave Jones
2013-06-23  5:00                   ` Andrew Vagin
2013-06-23 14:36                   ` Oleg Nesterov
2013-06-23 15:06                     ` Dave Jones
2013-06-23 16:04                       ` Oleg Nesterov
2013-06-24  0:21                         ` Dave Jones
2013-06-24  2:00                         ` Dave Jones
2013-06-24 14:39                           ` Oleg Nesterov
2013-06-24 14:52                             ` Steven Rostedt
2013-06-24 16:00                               ` Dave Jones
2013-06-24 16:24                                 ` Steven Rostedt
2013-06-24 16:51                                   ` Dave Jones
2013-06-24 17:04                                     ` Steven Rostedt
2013-06-25 16:55                                       ` Dave Jones
2013-06-25 17:21                                         ` Steven Rostedt
2013-06-25 17:23                                           ` Steven Rostedt
2013-06-25 17:26                                           ` Dave Jones
2013-06-25 17:31                                             ` Steven Rostedt
2013-06-25 17:32                                             ` Steven Rostedt
2013-06-25 17:29                                           ` Steven Rostedt
2013-06-25 17:34                                             ` Dave Jones
2013-06-24 16:37                                 ` Oleg Nesterov
2013-06-24 16:49                                   ` Dave Jones
2013-06-24 15:57                         ` Dave Jones
2013-06-24 17:35                           ` Oleg Nesterov
2013-06-24 17:44                             ` Dave Jones
2013-06-24 17:53                             ` Steven Rostedt
2013-06-24 18:00                               ` Dave Jones
2013-06-25 15:35                             ` Dave Jones
2013-06-25 16:23                               ` Steven Rostedt
2013-06-26  5:23                                 ` Dave Jones
2013-06-26 19:52                                   ` Steven Rostedt
2013-06-26 20:00                                     ` Dave Jones
2013-06-27  3:01                                       ` Steven Rostedt
2013-06-26  5:48                                 ` Dave Jones
2013-06-26 19:18                               ` Oleg Nesterov
2013-06-26 19:40                                 ` Dave Jones
2013-06-27  0:22                                 ` Dave Jones
2013-06-27  1:06                                   ` Eric W. Biederman
2013-06-27  2:32                                     ` Tejun Heo
2013-06-27  7:55                                   ` Dave Chinner
2013-06-27 10:06                                     ` Dave Chinner
2013-06-27 12:52                                       ` Dave Chinner
2013-06-27 15:21                                         ` Dave Jones
2013-06-28  1:13                                           ` Dave Chinner
2013-06-28  3:58                                             ` Dave Chinner
2013-06-28 10:28                                               ` Jan Kara
2013-06-29  3:39                                                 ` Dave Chinner
2013-07-01 12:00                                                   ` Jan Kara
2013-07-02  6:29                                                     ` Dave Chinner
2013-07-02  8:19                                                       ` Jan Kara
2013-07-02 12:38                                                         ` Dave Chinner
2013-07-02 14:05                                                           ` Jan Kara
2013-07-02 16:13                                                             ` Linus Torvalds
2013-07-02 16:57                                                               ` Jan Kara
2013-07-02 17:38                                                                 ` Linus Torvalds
2013-07-03  3:07                                                                   ` Dave Chinner
2013-07-03  3:28                                                                     ` Linus Torvalds
2013-07-03  4:49                                                                       ` Dave Chinner
2013-07-04  7:19                                                                         ` Andrew Morton
2013-06-29 20:13                                               ` Dave Jones
2013-06-29 22:23                                                 ` Linus Torvalds
2013-06-29 23:44                                                   ` Dave Jones
2013-06-30  0:21                                                     ` Steven Rostedt
2013-07-01 12:49                                                     ` Pavel Machek
2013-06-30  0:17                                                   ` Steven Rostedt
2013-06-30  2:05                                                   ` Dave Chinner
2013-06-30  2:34                                                     ` Dave Chinner
2013-06-27 14:30                                     ` Dave Jones
2013-06-28  1:18                                       ` Dave Chinner
2013-06-28  2:54                                         ` Linus Torvalds
2013-06-28  3:54                                           ` Dave Chinner
2013-06-28  5:59                                             ` Linus Torvalds
2013-06-28  7:21                                               ` Dave Chinner
2013-06-28  8:22                                                 ` Linus Torvalds
2013-06-28  8:32                                                   ` Al Viro
2013-06-28  8:22                                               ` Al Viro
2013-06-28  9:49                                               ` Jan Kara
2013-07-01 17:57                                             ` block layer softlockup Dave Jones
2013-07-02  2:07                                               ` Dave Chinner [this message]
2013-07-02  6:01                                                 ` Dave Jones
2013-07-02  7:30                                                   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130702020741.GE4072@dastard \
    --to=david@fromorbit.com \
    --cc=avagin@openvz.org \
    --cc=axboe@kernel.dk \
    --cc=davej@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=rostedt@goodmis.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.