All of lore.kernel.org
 help / color / mirror / Atom feed
From: Feng Tang <feng.tang@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: "op.q.liu@gmail.com" <op.q.liu@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Wu, Fengguang" <fengguang.wu@intel.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"axboe@kernel.dk" <axboe@kernel.dk>
Subject: Re: ext2 write performance regression from 2.6.32
Date: Fri, 18 Feb 2011 09:52:11 +0800	[thread overview]
Message-ID: <20110218095211.556c22ed@feng-i7> (raw)
In-Reply-To: <20110217103214.GA4947@quack.suse.cz>

Hi Jan,

On Thu, 17 Feb 2011 18:32:14 +0800
Jan Kara <jack@suse.cz> wrote:

> On Thu 17-02-11 14:08:46, Feng Tang wrote:
> > On Wed, 16 Feb 2011 22:35:30 +0800
> > Jan Kara <jack@suse.cz> wrote:
> > > On Wed 16-02-11 17:40:31, Feng Tang wrote:
> > > > Hi,
> > > > 
> > > > I made out a debug patch which try to delay the pure FS metadata
> > > > writeback (maxim 30 seconds to match current writeback expire
> > > > time). It works for me on 2.6.32, and the dd performance is
> > > > restored.
> > > > 
> > > > Please help to review it, thanks!
> > > > 
> > > > btw, I've sent out the block dump info requested by Jan Kara,
> > > > but didn't see it on LKML, so attached them again.
> > > > 
> > > > - Feng
> > > > 
> > > > From c35548c7d0c3a334d24c26adab277ef62b9825db Mon Sep 17
> > > > 00:00:00 2001 From: Feng Tang <feng.tang@intel.com>
> > > > Date: Wed, 16 Feb 2011 17:27:36 +0800
> > > > Subject: [PATCH] writeback: delay the file system metadata
> > > > writeback in 30 seconds
> > > > 
> > > > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > > > ---
> > > >  fs/fs-writeback.c |   10 ++++++++++
> > > >  1 files changed, 10 insertions(+), 0 deletions(-)
> > > > 
> > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > index 9d5360c..418fd9e 100644
> > > > --- a/fs/fs-writeback.c
> > > > +++ b/fs/fs-writeback.c
> > > > @@ -635,6 +635,16 @@ static void writeback_inodes_wb(struct
> > > > bdi_writeback *wb, continue;
> > > >  		}
> > > >  
> > > > +		if ((wbc->sync_mode != WB_SYNC_ALL)
> > > > +			&& !inode->i_ino
> > > > +			&& !strcmp(inode->i_sb->s_id, "bdev"))
> > > > {
> > > > +			if (inode->dirtied_when + 30 * HZ >
> > > > jiffies) {
> > > > +				list_move(&inode->i_list,
> > > > &wb->b_dirty);
> > > > +				continue;
> > > > +			} 
> > > > +		}
> > > > +
> > > > +
> > >   Doh, this is a crude hack! Nice for debugging but no way to get
> > > this into the kernel. We have to find a cleaner way to speedup the
> > > writeback...
> > 
> > I just tested you 5 writeback patches, and they don't fix the
> > problem, the FS metadata are still periodically written back every
> > one or two seconds. Attached the block dump on 2.6.37 + your
> > patches.
> Yes, so I didn't expect the writeout of metadata will disappear, just
> the IO pattern should be better. So you didn't observe any change in
> throughput with my patches vs without them?

With your patches, it did bring some improvements, like from 7 MB/s to
7.7 MB/s, but not restore back to 10 MB/s. Kyle Liu, who is the original
reporter of this issue also see similar results for your patch.

Anyway, I really hope the IO-less patch either from you or Fengguang
can be merged, which will significantly improve the writeback.

> 
> Looking at the block trace, we write about 8 MB of data before doing
> metadata writeback. That's about the best what I'd expect with current
> writeback settings so things worked as expected.

Yes, it is

> 
> Hmm, but probably flash card is simple and does not have any
> Flash Translation Layer and so each write of one metadata block costs
> us as a rewrite of the whole erase block which may well be in a MB
> range? That would explain it. Raising MAX_WRITEBACK_PAGES would help
> here but given the throughput of the flash card, the fairness of
> writeback when there are more inodes to write would be already rather
> bad so that's not a good solution either.
>  
> > And ye, I agree my patch is kinds of hacky, but its main point is
> > to delay the file system metadata writeback (no longer than current
> > writeback expiration limit: 30 seconds) to make the normal data
> > pages writeback as sequential as possible. Could we go on tuning it
> > in this direction?
> Apart from my aesthetical feelings, the patch brings also some
> technical difficulties. For example if lots of dirtying happens
> against device inode (metadata heavy workload), we need to push out
> dirty pages from the device inode to clean dirty memory. Otherwise
> the processes would just stall in balance_dirty_pages() for 30 s
> waiting for pages to get cleaned. Another issue is that under heavier
> load, the inode will get redirtied while you are writing metadata.
> Thus i_dirtied_when need not change.  Finally, if the load is not so
> trivial like your single dd write, but there happens also some other
> activity in the filesystem (like syslog writing to the device once in
> a while or so), then your large dd write will be mixed with small
> writes to other files anyway and thus performance will degrade.
> 
> That being said I don't see an easy solution to your problem. The
> fact that 2.6.30 didn't write metadata was more a bug than a feature
> but happened to work good for your flash card. A solution that comes
> to my mind is that we could have a "write chunk size" parameter in
> each BDI (it would make sense not only for flash devices but also for
> RAID) and writeback would be aware that written amount is always
> rounded up to "write chunk size". This ignores filesystem
> fragmentation but that might be dealt with. And if we are doing well
> with cleaning pages, we may skip writes that have small
> cleaned_pages / write_chunk_size ratio. But we'd have to be careful
> not to delay such "inprofitable" writes for too long. So it's not so
> easy to implement this.
> 

Thanks for the detailed analysis, that patch can't handle heavy metadata
dirtier case. I'll try to do more check

Thanks,
Feng

> 									Honza

  reply	other threads:[~2011-02-18  1:52 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-28  7:15 ext2 write performance regression from 2.6.32 Kyle liu
     [not found] ` <AANLkTikvpyTPVnP1cxC4rSLARO3thpscyhmB4=BpFW-G@mail.gmail.com>
2011-02-15  6:46   ` Feng Tang
2011-02-15 11:11     ` Jan Kara
     [not found]       ` <AANLkTikGdud4FX0TcC-Sf_-_V-i8doZ73m63B=JA4kWp@mail.gmail.com>
     [not found]         ` <20110216174031.183180c4@feng-i7>
2011-02-16 11:03           ` Kyle liu
2011-02-16 14:35           ` Jan Kara
     [not found]             ` <20110217140846.2196b756@feng-i7>
2011-02-17 10:32               ` Jan Kara
2011-02-18  1:52                 ` Feng Tang [this message]
     [not found]       ` <20110216102055.48af0d85@feng-i7>
2011-02-16 15:40         ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110218095211.556c22ed@feng-i7 \
    --to=feng.tang@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=fengguang.wu@intel.com \
    --cc=jack@suse.cz \
    --cc=linux-kernel@vger.kernel.org \
    --cc=op.q.liu@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.