All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Dave Jones <davej@redhat.com>, Oleg Nesterov <oleg@redhat.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Andrey Vagin <avagin@openvz.org>,
	Steven Rostedt <rostedt@goodmis.org>
Subject: Re: frequent softlockups with 3.10rc6.
Date: Fri, 28 Jun 2013 12:28:19 +0200	[thread overview]
Message-ID: <20130628102819.GA4725@quack.suse.cz> (raw)
In-Reply-To: <20130628035825.GC29338@dastard>

On Fri 28-06-13 13:58:25, Dave Chinner wrote:
> On Fri, Jun 28, 2013 at 11:13:01AM +1000, Dave Chinner wrote:
> > On Thu, Jun 27, 2013 at 11:21:51AM -0400, Dave Jones wrote:
> > > On Thu, Jun 27, 2013 at 10:52:18PM +1000, Dave Chinner wrote:
> > >  
> > >  
> > >  > > Yup, that's about three of orders of magnitude faster on this
> > >  > > workload....
> > >  > > 
> > >  > > Lightly smoke tested patch below - it passed the first round of
> > >  > > XFS data integrity tests in xfstests, so it's not completely
> > >  > > busted...
> > >  > 
> > >  > And now with even less smoke out that the first version. This one
> > >  > gets though a full xfstests run...
> > > 
> > > :sadface:
> > > 
> > > [  567.680836] ======================================================
> > > [  567.681582] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
> > > [  567.682389] 3.10.0-rc7+ #9 Not tainted
> > > [  567.682862] ------------------------------------------------------
> > > [  567.683607] trinity-child2/8665 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> > > [  567.684464]  (&sb->s_type->i_lock_key#3){+.+...}, at: [<ffffffff811d74e5>] sync_inodes_sb+0x225/0x3b0
> > > [  567.685632] 
> > > and this task is already holding:
> > > [  567.686334]  (&(&wb->wb_list_lock)->rlock){..-...}, at: [<ffffffff811d7451>] sync_inodes_sb+0x191/0x3b0
> > > [  567.687506] which would create a new lock dependency:
> > > [  567.688115]  (&(&wb->wb_list_lock)->rlock){..-...} -> (&sb->s_type->i_lock_key#3){+.+...}
> > 
> > .....
> > 
> > > other info that might help us debug this:
> > > 
> > > [  567.750396]  Possible interrupt unsafe locking scenario:
> > > 
> > > [  567.752062]        CPU0                    CPU1
> > > [  567.753025]        ----                    ----
> > > [  567.753981]   lock(&sb->s_type->i_lock_key#3);
> > > [  567.754969]                                local_irq_disable();
> > > [  567.756085]                                lock(&(&wb->wb_list_lock)->rlock);
> > > [  567.757368]                                lock(&sb->s_type->i_lock_key#3);
> > > [  567.758642]   <Interrupt>
> > > [  567.759370]     lock(&(&wb->wb_list_lock)->rlock);
> > 
> > Oh, that's easy enough to fix. It's just changing the wait_sb_inodes
> > loop to use a spin_trylock(&inode->i_lock), moving the inode to
> > the end of the sync list, dropping all locks and starting again...
> 
> New version below, went through xfstests with lockdep enabled this
> time....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> writeback: store inodes under writeback on a separate list
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> When there are lots of cached inodes, a sync(2) operation walks all
> of them to try to find which ones are under writeback and wait for
> IO completion on them. Run enough load, and this caused catastrophic
> lock contention on the inode_sb_list_lock.
> 
> Try to fix this problem by tracking inodes under data IO and wait
> specifically for only those inodes that haven't completed their data
> IO in wait_sb_inodes().
> 
> This is a bit hacky and messy, but demonstrates one method of
> solving this problem....
> 
> XXX: This will catch data IO - do we need to catch actual inode
> writeback (i.e. the metadata) here? I'm pretty sure we don't need to
> because the existing code just calls filemap_fdatawait() and that
> doesn't wait for the inode metadata writeback to occur....
> 
> [v3 - avoid deadlock due to interrupt while holding inode->i_lock]
> 
> [v2 - needs spin_lock_irq variants in wait_sb_inodes.
>     - move freeing inodes back to primary list, we don't wait for
>       them
>     - take mapping lock in wait_sb_inodes when requeuing.]
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/fs-writeback.c           |   70 +++++++++++++++++++++++++++++++++++++------
>  fs/inode.c                  |    1 +
>  include/linux/backing-dev.h |    3 ++
>  include/linux/fs.h          |    3 +-
>  mm/backing-dev.c            |    2 ++
>  mm/page-writeback.c         |   22 ++++++++++++++
>  6 files changed, 91 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 3be5718..589c40b 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1208,7 +1208,10 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>  
>  static void wait_sb_inodes(struct super_block *sb)
>  {
> -	struct inode *inode, *old_inode = NULL;
> +	struct backing_dev_info *bdi = sb->s_bdi;
> +	struct inode *old_inode = NULL;
> +	unsigned long flags;
> +	LIST_HEAD(sync_list);
>  
>  	/*
>  	 * We need to be protected against the filesystem going from
> @@ -1216,7 +1219,6 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 */
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -	spin_lock(&inode_sb_list_lock);
>  
>  	/*
>  	 * Data integrity sync. Must wait for all pages under writeback,
> @@ -1224,19 +1226,58 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 * call, but which had writeout started before we write it out.
>  	 * In which case, the inode may not be on the dirty list, but
>  	 * we still have to wait for that writeout.
> +	 *
> +	 * To avoid syncing inodes put under IO after we have started here,
> +	 * splice the io list to a temporary list head and walk that. Newly
> +	 * dirtied inodes will go onto the primary list so we won't wait for
> +	 * them.
> +	 *
> +	 * Inodes that have pages under writeback after we've finished the wait
> +	 * may or may not be on the primary list. They had pages under put IOi
> +	 * after
> +	 * we started our wait, so we need to make sure the next sync or IO
> +	 * completion treats them correctly, Move them back to the primary list
> +	 * and restart the walk.
> +	 *
> +	 * Inodes that are clean after we have waited for them don't belong
> +	 * on any list, and the cleaning of them should have removed them from
> +	 * the temporary list. Check this is true, and restart the walk.
>  	 */
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +	spin_lock_irqsave(&bdi->wb.wb_list_lock, flags);
> +	list_splice_init(&bdi->wb.b_wb, &sync_list);
> +
> +	while (!list_empty(&sync_list)) {
> +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> +						       i_io_list);
>  		struct address_space *mapping = inode->i_mapping;
>  
> -		spin_lock(&inode->i_lock);
> -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -		    (mapping->nrpages == 0)) {
> +		/*
> +		 * we are nesting the inode->i_lock inside a IRQ disabled
> +		 * section here, so there's the possibility that we could have
> +		 * a lock inversion due to an interrupt while holding the
> +		 * inode->i_lock elsewhere. This is the only place we take the
> +		 * inode->i_lock inside the wb_list_lock, so we need to use a
> +		 * trylock to avoid a deadlock. If we fail to get the lock,
> +		 * the only way to make progress is to also drop the
> +		 * wb_list_lock so the interrupt trying to get it can make
> +		 * progress.
> +		 */
> +		if (!spin_trylock(&inode->i_lock)) {
> +			list_move(&inode->i_io_list, &bdi->wb.b_wb);
> +			spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
> +			cpu_relax();
> +			spin_lock_irqsave(&bdi->wb.wb_list_lock, flags);
> +			continue;
> +		}
> +
> +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> +			list_move(&inode->i_io_list, &bdi->wb.b_wb);
>  			spin_unlock(&inode->i_lock);
>  			continue;
>  		}
  Ugh, the locking looks ugly. Plus the list handling is buggy because the
first wait_sb_inodes() invocation will move all inodes to its private
sync_list so if there's another wait_sb_inodes() invocation racing with it,
it won't wait properly for all the inodes it should.

Won't it be easier to remove inodes from b_wb list (btw, I'd slightly
prefer name b_writeback) lazily instead of from
test_clear_page_writeback()? I mean we would remove inodes from b_wb list
only in wait_sb_inodes() or when inodes get reclaimed from memory. That way
we save some work in test_clear_page_writeback() which is a fast path and
defer it to sync which isn't that performance critical. Also we would avoid
that ugly games with irq safe spinlocks.

>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> -		spin_unlock(&inode_sb_list_lock);
> +		spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
>  
>  		/*
>  		 * We hold a reference to 'inode' so it couldn't have been
> @@ -1253,9 +1294,20 @@ static void wait_sb_inodes(struct super_block *sb)
>  
>  		cond_resched();
>  
> -		spin_lock(&inode_sb_list_lock);
> +		/*
> +		 * the inode has been written back now, so check whether we
> +		 * still have pages under IO and move it back to the primary
> +		 * list if necessary
> +		 */
> +		spin_lock_irqsave(&mapping->tree_lock, flags);
> +		spin_lock(&bdi->wb.wb_list_lock);
> +		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +			WARN_ON(list_empty(&inode->i_io_list));
> +			list_move(&inode->i_io_list, &bdi->wb.b_wb);
> +		}
> +		spin_unlock(&mapping->tree_lock);
>  	}
> -	spin_unlock(&inode_sb_list_lock);
> +	spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
>  	iput(old_inode);
>  }
>  
> diff --git a/fs/inode.c b/fs/inode.c
> index 00d5fc3..f25c1ca 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -364,6 +364,7 @@ void inode_init_once(struct inode *inode)
>  	INIT_HLIST_NODE(&inode->i_hash);
>  	INIT_LIST_HEAD(&inode->i_devices);
>  	INIT_LIST_HEAD(&inode->i_wb_list);
> +	INIT_LIST_HEAD(&inode->i_io_list);
>  	INIT_LIST_HEAD(&inode->i_lru);
>  	address_space_init_once(&inode->i_data);
>  	i_size_ordered_init(inode);
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c388155..4a6283c 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -59,6 +59,9 @@ struct bdi_writeback {
>  	struct list_head b_io;		/* parked for writeback */
>  	struct list_head b_more_io;	/* parked for more writeback */
>  	spinlock_t list_lock;		/* protects the b_* lists */
> +
> +	spinlock_t wb_list_lock;	/* writeback list lock */
> +	struct list_head b_wb;		/* under writeback, for wait_sb_inodes */
>  };
>  
>  struct backing_dev_info {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 63cac31..7861017 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -573,7 +573,8 @@ struct inode {
>  	unsigned long		dirtied_when;	/* jiffies of first dirtying */
>  
>  	struct hlist_node	i_hash;
> -	struct list_head	i_wb_list;	/* backing dev IO list */
> +	struct list_head	i_wb_list;	/* backing dev WB list */
> +	struct list_head	i_io_list;	/* backing dev IO list */
>  	struct list_head	i_lru;		/* inode LRU list */
>  	struct list_head	i_sb_list;
>  	union {
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 5025174..896b8f5 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -421,7 +421,9 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
>  	INIT_LIST_HEAD(&wb->b_dirty);
>  	INIT_LIST_HEAD(&wb->b_io);
>  	INIT_LIST_HEAD(&wb->b_more_io);
> +	INIT_LIST_HEAD(&wb->b_wb);
>  	spin_lock_init(&wb->list_lock);
> +	spin_lock_init(&wb->wb_list_lock);
>  	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
>  }
>  
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 4514ad7..4c411fe 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2238,6 +2238,15 @@ int test_clear_page_writeback(struct page *page)
>  				__dec_bdi_stat(bdi, BDI_WRITEBACK);
>  				__bdi_writeout_inc(bdi);
>  			}
> +			if (mapping->host &&
> +			    !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +				struct inode *inode = mapping->host;
> +
> +				WARN_ON(list_empty(&inode->i_io_list));
> +				spin_lock(&bdi->wb.wb_list_lock);
> +				list_del_init(&inode->i_io_list);
> +				spin_unlock(&bdi->wb.wb_list_lock);
> +			}
>  		}
>  		spin_unlock_irqrestore(&mapping->tree_lock, flags);
>  	} else {
> @@ -2262,11 +2271,24 @@ int test_set_page_writeback(struct page *page)
>  		spin_lock_irqsave(&mapping->tree_lock, flags);
>  		ret = TestSetPageWriteback(page);
>  		if (!ret) {
> +			bool on_wblist;
> +
> +			/* __swap_writepage comes through here */
> +			on_wblist = mapping_tagged(mapping,
> +						   PAGECACHE_TAG_WRITEBACK);
>  			radix_tree_tag_set(&mapping->page_tree,
>  						page_index(page),
>  						PAGECACHE_TAG_WRITEBACK);
>  			if (bdi_cap_account_writeback(bdi))
>  				__inc_bdi_stat(bdi, BDI_WRITEBACK);
> +			if (!on_wblist && mapping->host) {
> +				struct inode *inode = mapping->host;
> +
> +				WARN_ON(!list_empty(&inode->i_io_list));
> +				spin_lock(&bdi->wb.wb_list_lock);
> +				list_add_tail(&inode->i_io_list, &bdi->wb.b_wb);
> +				spin_unlock(&bdi->wb.wb_list_lock);
> +			}
>  		}
>  		if (!PageDirty(page))
>  			radix_tree_tag_clear(&mapping->page_tree,
  I'm somewhat uneasy about this. Writeback code generally uses
inode_to_bdi() function to get from the mapping to backing_dev_info (which
uses inode->i_sb->s_bdi except for inodes on blockdev_superblock). That isn't
always the same as inode->i_mapping->backing_dev_info used here although
I now fail to remember a case where inode->i_mapping->backing_dev_info
would be a wrong bdi to use for sync purposes.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

  reply	other threads:[~2013-06-28 10:28 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-19 16:45 frequent softlockups with 3.10rc6 Dave Jones
2013-06-19 17:53 ` Dave Jones
2013-06-19 18:13   ` Paul E. McKenney
2013-06-19 18:42     ` Dave Jones
2013-06-20  0:12     ` Dave Jones
2013-06-20 16:16       ` Paul E. McKenney
2013-06-20 16:27         ` Dave Jones
2013-06-21 15:11         ` Dave Jones
2013-06-21 19:59           ` Oleg Nesterov
2013-06-22  1:37             ` Dave Jones
2013-06-22 17:31               ` Oleg Nesterov
2013-06-22 21:59                 ` Dave Jones
2013-06-23  5:00                   ` Andrew Vagin
2013-06-23 14:36                   ` Oleg Nesterov
2013-06-23 15:06                     ` Dave Jones
2013-06-23 16:04                       ` Oleg Nesterov
2013-06-24  0:21                         ` Dave Jones
2013-06-24  2:00                         ` Dave Jones
2013-06-24 14:39                           ` Oleg Nesterov
2013-06-24 14:52                             ` Steven Rostedt
2013-06-24 16:00                               ` Dave Jones
2013-06-24 16:24                                 ` Steven Rostedt
2013-06-24 16:51                                   ` Dave Jones
2013-06-24 17:04                                     ` Steven Rostedt
2013-06-25 16:55                                       ` Dave Jones
2013-06-25 17:21                                         ` Steven Rostedt
2013-06-25 17:23                                           ` Steven Rostedt
2013-06-25 17:26                                           ` Dave Jones
2013-06-25 17:31                                             ` Steven Rostedt
2013-06-25 17:32                                             ` Steven Rostedt
2013-06-25 17:29                                           ` Steven Rostedt
2013-06-25 17:34                                             ` Dave Jones
2013-06-24 16:37                                 ` Oleg Nesterov
2013-06-24 16:49                                   ` Dave Jones
2013-06-24 15:57                         ` Dave Jones
2013-06-24 17:35                           ` Oleg Nesterov
2013-06-24 17:44                             ` Dave Jones
2013-06-24 17:53                             ` Steven Rostedt
2013-06-24 18:00                               ` Dave Jones
2013-06-25 15:35                             ` Dave Jones
2013-06-25 16:23                               ` Steven Rostedt
2013-06-26  5:23                                 ` Dave Jones
2013-06-26 19:52                                   ` Steven Rostedt
2013-06-26 20:00                                     ` Dave Jones
2013-06-27  3:01                                       ` Steven Rostedt
2013-06-26  5:48                                 ` Dave Jones
2013-06-26 19:18                               ` Oleg Nesterov
2013-06-26 19:40                                 ` Dave Jones
2013-06-27  0:22                                 ` Dave Jones
2013-06-27  1:06                                   ` Eric W. Biederman
2013-06-27  2:32                                     ` Tejun Heo
2013-06-27  7:55                                   ` Dave Chinner
2013-06-27 10:06                                     ` Dave Chinner
2013-06-27 12:52                                       ` Dave Chinner
2013-06-27 15:21                                         ` Dave Jones
2013-06-28  1:13                                           ` Dave Chinner
2013-06-28  3:58                                             ` Dave Chinner
2013-06-28 10:28                                               ` Jan Kara [this message]
2013-06-29  3:39                                                 ` Dave Chinner
2013-07-01 12:00                                                   ` Jan Kara
2013-07-02  6:29                                                     ` Dave Chinner
2013-07-02  8:19                                                       ` Jan Kara
2013-07-02 12:38                                                         ` Dave Chinner
2013-07-02 14:05                                                           ` Jan Kara
2013-07-02 16:13                                                             ` Linus Torvalds
2013-07-02 16:57                                                               ` Jan Kara
2013-07-02 17:38                                                                 ` Linus Torvalds
2013-07-03  3:07                                                                   ` Dave Chinner
2013-07-03  3:28                                                                     ` Linus Torvalds
2013-07-03  4:49                                                                       ` Dave Chinner
2013-07-04  7:19                                                                         ` Andrew Morton
2013-06-29 20:13                                               ` Dave Jones
2013-06-29 22:23                                                 ` Linus Torvalds
2013-06-29 23:44                                                   ` Dave Jones
2013-06-30  0:21                                                     ` Steven Rostedt
2013-07-01 12:49                                                     ` Pavel Machek
2013-06-30  0:17                                                   ` Steven Rostedt
2013-06-30  2:05                                                   ` Dave Chinner
2013-06-30  2:34                                                     ` Dave Chinner
2013-06-27 14:30                                     ` Dave Jones
2013-06-28  1:18                                       ` Dave Chinner
2013-06-28  2:54                                         ` Linus Torvalds
2013-06-28  3:54                                           ` Dave Chinner
2013-06-28  5:59                                             ` Linus Torvalds
2013-06-28  7:21                                               ` Dave Chinner
2013-06-28  8:22                                                 ` Linus Torvalds
2013-06-28  8:32                                                   ` Al Viro
2013-06-28  8:22                                               ` Al Viro
2013-06-28  9:49                                               ` Jan Kara
2013-07-01 17:57                                             ` block layer softlockup Dave Jones
2013-07-02  2:07                                               ` Dave Chinner
2013-07-02  6:01                                                 ` Dave Jones
2013-07-02  7:30                                                   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130628102819.GA4725@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=avagin@openvz.org \
    --cc=davej@redhat.com \
    --cc=david@fromorbit.com \
    --cc=ebiederm@xmission.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=rostedt@goodmis.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.