Hi Jan, On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote: > Hello Fengguang, > > this patch is much easier to read than in older versions! Good work! Thank you :) > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, > > + unsigned long thresh, > > + unsigned long bg_thresh, > > + unsigned long dirty, > > + unsigned long bdi_thresh, > > + unsigned long bdi_dirty) > > +{ > > + unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh); > > + unsigned long limit = hard_dirty_limit(thresh); > > + unsigned long x_intercept; > > + unsigned long setpoint; /* the target balance point */ > > + unsigned long span; > > + long long pos_ratio; /* for scaling up/down the rate limit */ > > + long x; > > + > > + if (unlikely(dirty >= limit)) > > + return 0; > > + > > + /* > > + * global setpoint > > + * > > + * setpoint - dirty 3 > > + * f(dirty) := 1 + (----------------) > > + * limit - setpoint > > + * > > + * it's a 3rd order polynomial that subjects to > > + * > > + * (1) f(freerun) = 2.0 => rampup base_rate reasonably fast > > + * (2) f(setpoint) = 1.0 => the balance point > > + * (3) f(limit) = 0 => the hard limit > > + * (4) df/dx < 0 => negative feedback control > > + * (5) the closer to setpoint, the smaller |df/dx| (and the reverse) > > + * => fast response on large errors; small oscillation near setpoint > > + */ > > + setpoint = (freerun + limit) / 2; > > + x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT, > > + limit - setpoint + 1); > > + pos_ratio = x; > > + pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT; > > + pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT; > > + pos_ratio += 1 << RATELIMIT_CALC_SHIFT; > > + > > + /* > > + * bdi setpoint > > + * > > + * f(dirty) := 1.0 + k * (dirty - setpoint) > > + * > > + * The main bdi control line is a linear function that subjects to > > + * > > + * (1) f(setpoint) = 1.0 > > + * (2) k = - 1 / (8 * write_bw) (in single bdi case) > > + * or equally: x_intercept = setpoint + 8 * write_bw > > + * > > + * For single bdi case, the dirty pages are observed to fluctuate > > + * regularly within range > > + * [setpoint - write_bw/2, setpoint + write_bw/2] > > + * for various filesystems, where (2) can yield in a reasonable 12.5% > > + * fluctuation range for pos_ratio. > > + * > > + * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its > > + * own size, so move the slope over accordingly. > > + */ > > + if (unlikely(bdi_thresh > thresh)) > > + bdi_thresh = thresh; > > + /* > > + * scale global setpoint to bdi's: setpoint *= bdi_thresh / thresh > > + */ > > + x = div_u64((u64)bdi_thresh << 16, thresh | 1); > > + setpoint = setpoint * (u64)x >> 16; > > + /* > > + * Use span=(4*write_bw) in single bdi case as indicated by > > + * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case. > > + */ > > + span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) + > > + (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh, > > + thresh + 1); > I think you can slightly simplify this to: > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16; Good idea! > > + x_intercept = setpoint + 2 * span; > What if x_intercept > bdi_thresh? Since 8*bdi->avg_write_bandwidth is > easily 500 MB, that happens quite often I imagine? That's fine because I no longer target "bdi_thresh" as some limiting factor as the global "thresh". Due to it being unstable in small memory JBOD systems, which is the big and unique problem in JBOD. > > + > > + if (unlikely(bdi_dirty > setpoint + span)) { > > + if (unlikely(bdi_dirty > limit)) > > + return 0; > Shouldn't this be bdi_thresh instead of limit? I understand this is a > hard limit but with more bdis this condition is rather weak and almost > never true. Yeah, I mean @limit. @bdi_thresh is made weak in IO-less balance_dirty_pages() in order to get reasonable smooth dirty rate in the face of a fluctuating @bdi_thresh. The tradeoff is to let bdi dirty pages fluctuate more or less freely, as long as they don't drop low and risk IO queue underflow. The attached patch tries to prevent the underflow (which is good but not perfect). > > + if (x_intercept < limit) { > > + x_intercept = limit; /* auxiliary control line */ > > + setpoint += span; > > + pos_ratio >>= 1; > > + } > And here you stretch the control area upto the global dirty limit. I > understand you maybe don't want to be really strict and cut control area at > bdi_thresh but your choice looks like too benevolent - when you have > several active bdi's with different speeds this will effectively erase > difference between them, won't it? E.g. with 10 bdi's (x_intercept - > bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless > bdi_dirty really heavily exceeds bdi_thresh. Yes the auxiliary control line could be very flat (small slope). However it's not normal for the bdi dirty pages to fall into the range of auxiliary control line at all. And once it takes effect, the pos_ratio is at most 0.5 (which is the value at the connection point with the main bdi control line) which is strong enough to pull the dirty pages off the auxiliary bdi control line and into the scope of main bdi control line. The auxiliary control line is intended for bringing down the bdi_dirty of the USB key before 250s (where the "pos bandwidth" line keeps low): http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/ext4-4dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-46/balance_dirty_pages-pages.png After that the bdi_dirty will fluctuate around bdi_thresh and won't grow high and step into the scope of the auxiliary control line. > So wouldn't it be better to > just make sure control area is reasonably large (e.g. at least 16 MB) to > allow BDI to ramp up it's bdi_thresh but don't extend it upto global dirty > limit? In order to take bdi_thresh as some semi-strict limit, we need to make it very stable at first..otherwise the whole control system may fluctuate violently. Thanks, Fengguang > > + } > > + pos_ratio *= x_intercept - bdi_dirty; > > + do_div(pos_ratio, x_intercept - setpoint + 1); > > + > > + return pos_ratio; > > +} > > + > > Honza > -- > Jan Kara > SUSE Labs, CR