From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751422AbZJCGLW (ORCPT ); Sat, 3 Oct 2009 02:11:22 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1750756AbZJCGLV (ORCPT ); Sat, 3 Oct 2009 02:11:21 -0400 Received: from mga14.intel.com ([143.182.124.37]:23979 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750753AbZJCGLV (ORCPT ); Sat, 3 Oct 2009 02:11:21 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.44,499,1249282800"; d="scan'208";a="194672781" Date: Sat, 3 Oct 2009 14:10:44 +0800 From: Wu Fengguang To: Theodore Tso , Christoph Hellwig , Dave Chinner , Chris Mason , Andrew Morton , Peter Zijlstra , "Li, Shaohua" , "linux-kernel@vger.kernel.org" , "richard@rsk.demon.co.uk" , "jens.axboe@oracle.com" Subject: Re: regression in page writeback Message-ID: <20091003061044.GA3791@localhost> References: <20090928071507.GA20068@localhost> <20090928130804.GA25880@infradead.org> <20090928140756.GC17514@mit.edu> <20090930052657.GA17268@localhost> <20090930141158.GG24383@mit.edu> <20091001151429.GB9469@localhost> <20091001215438.GY24383@mit.edu> <20091002025502.GA14246@localhost> <20091002081953.GA14529@localhost> <20091002172620.GB8161@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20091002172620.GB8161@mit.edu> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Oct 03, 2009 at 01:26:20AM +0800, Theodore Ts'o wrote: > On Fri, Oct 02, 2009 at 04:19:53PM +0800, Wu Fengguang wrote: > > > > The big writes, if they are contiguous, could take 1-2 seconds > > > > on a very slow, ancient laptop disk, and that will hold up any kind of > > > > small synchornous activities --- such as either a disk read or a firefox- > > > > triggered fsync(). > > > > > > Yes, that's a problem. The SYNC/ASYNC elevator queues can help here. > > The SYNC/ASYNC queues will partially help, up to the whatever the > largest I/O that can issued as a single chunk times the queue depth > for those disks that support NCQ. > > > > There's still the problem of IO submission time != IO completion time, > > > due to fluctuations of randomness and more. However that's a general > > > and unavoidable problem. Both the wbc.timeout scheme and the > > > "wbc.nr_to_write based on estimated throughput" scheme are based on > > > _past_ requests and it's simply impossible to have a 100% accurate > > > scheme. In principle, wbc.timeout will only be inferior at IO startup > > > time. In the steady state of 100% full queue, it is actually estimating > > > the IO throughput implicitly :) > > > > Another difference between wbc.timeout and adaptive wbc.nr_to_write > > is, when there comes many _read_ requests or fsync, these SYNC rw > > requests will significant lower the ASYNC writeback throughput, if > > it's not completely stalled. So with timeout, the inode will be > > aborted with few pages written; with nr_to_write, the inode will be > > written a good number of pages, at the cost of taking up long time. > > > > IMHO the nr_to_write behavior seems more efficient. What do you think? > > I agree, adaptively changing nr_to_write seems like the right thing to I'd like to estimate the writeback throughput in bdi_writeback_wakeup(), where the queue is not starved and the estimation would reflect the max device capability (unless there are busy reads, in which case we need lower nr_to_write anyway). > do. For bonus points, we could also monitor how often synchronous I/O > operations are happening, allow nr_to_write to go up by some amount if > there aren't many synchronous operations happening at the moment. So > that might be another opportunity to do auto-tuning, although this > might be a hueristic that might need to be configurable for certain > specialized workloads. For many other workloads, the it should be > possible to detect regular pattern of reads and/or synchronous writes, > and if so, use a lower nr_to_write versus if there isn't many > synchronous I/O operations happening on that particular block device. It's not easy to get state of the art SYNC read/write busyness. However it is possible to "feel" them through the progress of ASYNC writes. - setup a per-file timeout=3*HZ - check this in write_cache_pages: if (half nr_to_write pages written && timeout) break; In this way we back off to nr_to_write/2 if the writeback is blocked by some busy READs. I'd choose to implement this advanced feature some time later :) Thanks, Fengguang