All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>, Dave Jones <davej@redhat.com>,
	Oleg Nesterov <oleg@redhat.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Andrey Vagin <avagin@openvz.org>,
	Steven Rostedt <rostedt@goodmis.org>
Subject: Re: frequent softlockups with 3.10rc6.
Date: Tue, 2 Jul 2013 16:05:08 +0200	[thread overview]
Message-ID: <20130702140508.GB31770@quack.suse.cz> (raw)
In-Reply-To: <20130702123835.GF14996@dastard>

On Tue 02-07-13 22:38:35, Dave Chinner wrote:
> > As a bonus filesystems could also optimize their write_inode() methods when
> > they know ->sync_fs() is going to happen in future. E.g. ext4 wouldn't have
> > to do the stupid ext4_force_commit() after each written inode in
> > WB_SYNC_ALL mode.
> 
> Yeah, that's true.
> 
> Since XFS now catches all metadata modifications in it's journal, it
> doesn't have a ->write_inode method anymore.  Only ->fsync,
> ->sync_fs and ->commit_metadata are defined as data integrity
> operations that require metadata to be sychronised and we ensure the
> journal is committed in those methods. WB_SYNC_ALL writeback is
> really only a method for getting data dispatched to disk, so I
> suspect that we should ensure that waiting for data IO completion
> happens at higher levels, not be hidden deep in the guts of writing
> back inode metadata..
  Yeah. Ext4 could probably do the same, just noone took the time to audit
everything properly and remove the historical heritage... That being said
there are tricky things like making sure write_inode_now() from
iput_final() will do the right thing so it's not completely obvious.

> -- 
> Dave Chinner
> david@fromorbit.com
> 
> sync: don't block the flusher thread waiting on IO
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> When sync does it's WB_SYNC_ALL writeback, it issues data Io and
> then immediately waits for IO completion. This is done in the
> context of the flusher thread, and hence completely ties up the
> flusher thread for the backing device until all the dirty inodes
> have been synced. On filesystems that are dirtying inodes constantly
> and quickly, this means the flusher thread can be tied up for
> minutes per sync call and hence badly affect system level write IO
> performance as the page cache cannot be cleaned quickly.
> 
> We already have a wait loop for IO completion for sync(2), so cut
> this out of the flusher thread and delegate it to wait_sb_inodes().
> Hence we can do rapid IO submission, and then wait for it all to
> complete.
> 
> Effect of sync on fsmark before the patch:
> 
> FSUse%        Count         Size    Files/sec     App Overhead
> .....
>      0       640000         4096      35154.6          1026984
>      0       720000         4096      36740.3          1023844
>      0       800000         4096      36184.6           916599
>      0       880000         4096       1282.7          1054367
>      0       960000         4096       3951.3           918773
>      0      1040000         4096      40646.2           996448
>      0      1120000         4096      43610.1           895647
>      0      1200000         4096      40333.1           921048
> 
> And a single sync pass took:
> 
> real    0m52.407s
> user    0m0.000s
> sys     0m0.090s
> 
> After the patch, there is no impact on fsmark results, and each
> individual sync(2) operation run concurrently with the same fsmark
> workload takes roughly 7s:
> 
> real    0m6.930s
> user    0m0.000s
> sys     0m0.039s
> 
> IOWs, sync is 7-8x faster on a busy filesystem and does not have an
> adverse impact on ongoing async data write operations.
  The patch looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/fs-writeback.c         |    9 +++++++--
>  include/linux/writeback.h |    1 +
>  2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 25a766c..ea56583 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -45,6 +45,7 @@ struct wb_writeback_work {
>  	unsigned int for_kupdate:1;
>  	unsigned int range_cyclic:1;
>  	unsigned int for_background:1;
> +	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
>  	enum wb_reason reason;		/* why was writeback initiated? */
>  
>  	struct list_head list;		/* pending work list */
> @@ -476,9 +477,11 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>  	/*
>  	 * Make sure to wait on the data before writing out the metadata.
>  	 * This is important for filesystems that modify metadata on data
> -	 * I/O completion.
> +	 * I/O completion. We don't do it for sync(2) writeback because it has a
> +	 * separate, external IO completion path and ->sync_fs for guaranteeing
> +	 * inode metadata is written back correctly.
>  	 */
> -	if (wbc->sync_mode == WB_SYNC_ALL) {
> +	if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) {
>  		int err = filemap_fdatawait(mapping);
>  		if (ret == 0)
>  			ret = err;
> @@ -611,6 +614,7 @@ static long writeback_sb_inodes(struct super_block *sb,
>  		.tagged_writepages	= work->tagged_writepages,
>  		.for_kupdate		= work->for_kupdate,
>  		.for_background		= work->for_background,
> +		.for_sync		= work->for_sync,
>  		.range_cyclic		= work->range_cyclic,
>  		.range_start		= 0,
>  		.range_end		= LLONG_MAX,
> @@ -1442,6 +1446,7 @@ void sync_inodes_sb(struct super_block *sb)
>  		.range_cyclic	= 0,
>  		.done		= &done,
>  		.reason		= WB_REASON_SYNC,
> +		.for_sync	= 1,
>  	};
>  
>  	/* Nothing to do? */
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 579a500..abfe117 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -78,6 +78,7 @@ struct writeback_control {
>  	unsigned tagged_writepages:1;	/* tag-and-write to avoid livelock */
>  	unsigned for_reclaim:1;		/* Invoked from the page allocator */
>  	unsigned range_cyclic:1;	/* range_start is cyclic */
> +	unsigned for_sync:1;		/* sync(2) WB_SYNC_ALL writeback */
>  };
>  
>  /*
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

  reply	other threads:[~2013-07-02 14:05 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-19 16:45 frequent softlockups with 3.10rc6 Dave Jones
2013-06-19 17:53 ` Dave Jones
2013-06-19 18:13   ` Paul E. McKenney
2013-06-19 18:42     ` Dave Jones
2013-06-20  0:12     ` Dave Jones
2013-06-20 16:16       ` Paul E. McKenney
2013-06-20 16:27         ` Dave Jones
2013-06-21 15:11         ` Dave Jones
2013-06-21 19:59           ` Oleg Nesterov
2013-06-22  1:37             ` Dave Jones
2013-06-22 17:31               ` Oleg Nesterov
2013-06-22 21:59                 ` Dave Jones
2013-06-23  5:00                   ` Andrew Vagin
2013-06-23 14:36                   ` Oleg Nesterov
2013-06-23 15:06                     ` Dave Jones
2013-06-23 16:04                       ` Oleg Nesterov
2013-06-24  0:21                         ` Dave Jones
2013-06-24  2:00                         ` Dave Jones
2013-06-24 14:39                           ` Oleg Nesterov
2013-06-24 14:52                             ` Steven Rostedt
2013-06-24 16:00                               ` Dave Jones
2013-06-24 16:24                                 ` Steven Rostedt
2013-06-24 16:51                                   ` Dave Jones
2013-06-24 17:04                                     ` Steven Rostedt
2013-06-25 16:55                                       ` Dave Jones
2013-06-25 17:21                                         ` Steven Rostedt
2013-06-25 17:23                                           ` Steven Rostedt
2013-06-25 17:26                                           ` Dave Jones
2013-06-25 17:31                                             ` Steven Rostedt
2013-06-25 17:32                                             ` Steven Rostedt
2013-06-25 17:29                                           ` Steven Rostedt
2013-06-25 17:34                                             ` Dave Jones
2013-06-24 16:37                                 ` Oleg Nesterov
2013-06-24 16:49                                   ` Dave Jones
2013-06-24 15:57                         ` Dave Jones
2013-06-24 17:35                           ` Oleg Nesterov
2013-06-24 17:44                             ` Dave Jones
2013-06-24 17:53                             ` Steven Rostedt
2013-06-24 18:00                               ` Dave Jones
2013-06-25 15:35                             ` Dave Jones
2013-06-25 16:23                               ` Steven Rostedt
2013-06-26  5:23                                 ` Dave Jones
2013-06-26 19:52                                   ` Steven Rostedt
2013-06-26 20:00                                     ` Dave Jones
2013-06-27  3:01                                       ` Steven Rostedt
2013-06-26  5:48                                 ` Dave Jones
2013-06-26 19:18                               ` Oleg Nesterov
2013-06-26 19:40                                 ` Dave Jones
2013-06-27  0:22                                 ` Dave Jones
2013-06-27  1:06                                   ` Eric W. Biederman
2013-06-27  2:32                                     ` Tejun Heo
2013-06-27  7:55                                   ` Dave Chinner
2013-06-27 10:06                                     ` Dave Chinner
2013-06-27 12:52                                       ` Dave Chinner
2013-06-27 15:21                                         ` Dave Jones
2013-06-28  1:13                                           ` Dave Chinner
2013-06-28  3:58                                             ` Dave Chinner
2013-06-28 10:28                                               ` Jan Kara
2013-06-29  3:39                                                 ` Dave Chinner
2013-07-01 12:00                                                   ` Jan Kara
2013-07-02  6:29                                                     ` Dave Chinner
2013-07-02  8:19                                                       ` Jan Kara
2013-07-02 12:38                                                         ` Dave Chinner
2013-07-02 14:05                                                           ` Jan Kara [this message]
2013-07-02 16:13                                                             ` Linus Torvalds
2013-07-02 16:57                                                               ` Jan Kara
2013-07-02 17:38                                                                 ` Linus Torvalds
2013-07-03  3:07                                                                   ` Dave Chinner
2013-07-03  3:28                                                                     ` Linus Torvalds
2013-07-03  4:49                                                                       ` Dave Chinner
2013-07-04  7:19                                                                         ` Andrew Morton
2013-06-29 20:13                                               ` Dave Jones
2013-06-29 22:23                                                 ` Linus Torvalds
2013-06-29 23:44                                                   ` Dave Jones
2013-06-30  0:21                                                     ` Steven Rostedt
2013-07-01 12:49                                                     ` Pavel Machek
2013-06-30  0:17                                                   ` Steven Rostedt
2013-06-30  2:05                                                   ` Dave Chinner
2013-06-30  2:34                                                     ` Dave Chinner
2013-06-27 14:30                                     ` Dave Jones
2013-06-28  1:18                                       ` Dave Chinner
2013-06-28  2:54                                         ` Linus Torvalds
2013-06-28  3:54                                           ` Dave Chinner
2013-06-28  5:59                                             ` Linus Torvalds
2013-06-28  7:21                                               ` Dave Chinner
2013-06-28  8:22                                                 ` Linus Torvalds
2013-06-28  8:32                                                   ` Al Viro
2013-06-28  8:22                                               ` Al Viro
2013-06-28  9:49                                               ` Jan Kara
2013-07-01 17:57                                             ` block layer softlockup Dave Jones
2013-07-02  2:07                                               ` Dave Chinner
2013-07-02  6:01                                                 ` Dave Jones
2013-07-02  7:30                                                   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130702140508.GB31770@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=avagin@openvz.org \
    --cc=davej@redhat.com \
    --cc=david@fromorbit.com \
    --cc=ebiederm@xmission.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=rostedt@goodmis.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.