From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754807Ab1BRBw4 (ORCPT ); Thu, 17 Feb 2011 20:52:56 -0500 Received: from mga02.intel.com ([134.134.136.20]:46181 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752318Ab1BRBww (ORCPT ); Thu, 17 Feb 2011 20:52:52 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.62,184,1297065600"; d="scan'208";a="709283278" Date: Fri, 18 Feb 2011 09:52:11 +0800 From: Feng Tang To: Jan Kara CC: "op.q.liu@gmail.com" , "linux-kernel@vger.kernel.org" , "Wu, Fengguang" , "akpm@linux-foundation.org" , "axboe@kernel.dk" Subject: Re: ext2 write performance regression from 2.6.32 Message-ID: <20110218095211.556c22ed@feng-i7> In-Reply-To: <20110217103214.GA4947@quack.suse.cz> References: <20110215144641.05318556@feng-i7> <20110215111126.GD17313@quack.suse.cz> <20110216174031.183180c4@feng-i7> <20110216143530.GA5592@quack.suse.cz> <20110217140846.2196b756@feng-i7> <20110217103214.GA4947@quack.suse.cz> Organization: intel X-Mailer: Claws Mail 3.7.4 (GTK+ 2.20.0; i486-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Jan, On Thu, 17 Feb 2011 18:32:14 +0800 Jan Kara wrote: > On Thu 17-02-11 14:08:46, Feng Tang wrote: > > On Wed, 16 Feb 2011 22:35:30 +0800 > > Jan Kara wrote: > > > On Wed 16-02-11 17:40:31, Feng Tang wrote: > > > > Hi, > > > > > > > > I made out a debug patch which try to delay the pure FS metadata > > > > writeback (maxim 30 seconds to match current writeback expire > > > > time). It works for me on 2.6.32, and the dd performance is > > > > restored. > > > > > > > > Please help to review it, thanks! > > > > > > > > btw, I've sent out the block dump info requested by Jan Kara, > > > > but didn't see it on LKML, so attached them again. > > > > > > > > - Feng > > > > > > > > From c35548c7d0c3a334d24c26adab277ef62b9825db Mon Sep 17 > > > > 00:00:00 2001 From: Feng Tang > > > > Date: Wed, 16 Feb 2011 17:27:36 +0800 > > > > Subject: [PATCH] writeback: delay the file system metadata > > > > writeback in 30 seconds > > > > > > > > Signed-off-by: Feng Tang > > > > --- > > > > fs/fs-writeback.c | 10 ++++++++++ > > > > 1 files changed, 10 insertions(+), 0 deletions(-) > > > > > > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > > > > index 9d5360c..418fd9e 100644 > > > > --- a/fs/fs-writeback.c > > > > +++ b/fs/fs-writeback.c > > > > @@ -635,6 +635,16 @@ static void writeback_inodes_wb(struct > > > > bdi_writeback *wb, continue; > > > > } > > > > > > > > + if ((wbc->sync_mode != WB_SYNC_ALL) > > > > + && !inode->i_ino > > > > + && !strcmp(inode->i_sb->s_id, "bdev")) > > > > { > > > > + if (inode->dirtied_when + 30 * HZ > > > > > jiffies) { > > > > + list_move(&inode->i_list, > > > > &wb->b_dirty); > > > > + continue; > > > > + } > > > > + } > > > > + > > > > + > > > Doh, this is a crude hack! Nice for debugging but no way to get > > > this into the kernel. We have to find a cleaner way to speedup the > > > writeback... > > > > I just tested you 5 writeback patches, and they don't fix the > > problem, the FS metadata are still periodically written back every > > one or two seconds. Attached the block dump on 2.6.37 + your > > patches. > Yes, so I didn't expect the writeout of metadata will disappear, just > the IO pattern should be better. So you didn't observe any change in > throughput with my patches vs without them? With your patches, it did bring some improvements, like from 7 MB/s to 7.7 MB/s, but not restore back to 10 MB/s. Kyle Liu, who is the original reporter of this issue also see similar results for your patch. Anyway, I really hope the IO-less patch either from you or Fengguang can be merged, which will significantly improve the writeback. > > Looking at the block trace, we write about 8 MB of data before doing > metadata writeback. That's about the best what I'd expect with current > writeback settings so things worked as expected. Yes, it is > > Hmm, but probably flash card is simple and does not have any > Flash Translation Layer and so each write of one metadata block costs > us as a rewrite of the whole erase block which may well be in a MB > range? That would explain it. Raising MAX_WRITEBACK_PAGES would help > here but given the throughput of the flash card, the fairness of > writeback when there are more inodes to write would be already rather > bad so that's not a good solution either. > > > And ye, I agree my patch is kinds of hacky, but its main point is > > to delay the file system metadata writeback (no longer than current > > writeback expiration limit: 30 seconds) to make the normal data > > pages writeback as sequential as possible. Could we go on tuning it > > in this direction? > Apart from my aesthetical feelings, the patch brings also some > technical difficulties. For example if lots of dirtying happens > against device inode (metadata heavy workload), we need to push out > dirty pages from the device inode to clean dirty memory. Otherwise > the processes would just stall in balance_dirty_pages() for 30 s > waiting for pages to get cleaned. Another issue is that under heavier > load, the inode will get redirtied while you are writing metadata. > Thus i_dirtied_when need not change. Finally, if the load is not so > trivial like your single dd write, but there happens also some other > activity in the filesystem (like syslog writing to the device once in > a while or so), then your large dd write will be mixed with small > writes to other files anyway and thus performance will degrade. > > That being said I don't see an easy solution to your problem. The > fact that 2.6.30 didn't write metadata was more a bug than a feature > but happened to work good for your flash card. A solution that comes > to my mind is that we could have a "write chunk size" parameter in > each BDI (it would make sense not only for flash devices but also for > RAID) and writeback would be aware that written amount is always > rounded up to "write chunk size". This ignores filesystem > fragmentation but that might be dealt with. And if we are doing well > with cleaning pages, we may skip writes that have small > cleaned_pages / write_chunk_size ratio. But we'd have to be careful > not to delay such "inprofitable" writes for too long. So it's not so > easy to implement this. > Thanks for the detailed analysis, that patch can't handle heavy metadata dirtier case. I'll try to do more check Thanks, Feng > Honza