* [PATCH 0/5] IO-less dirty throttling v8 @ 2011-08-06 8:44 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML, Wu Fengguang Hi all, The _core_ bits of the IO-less balance_dirty_pages(). Heavily simplified and re-commented to make it easier to review. git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8 Only the bare minimal algorithms are presented, so you will find some rough edges in the graphs below. But it's usable :) http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/ And an introduction to the (more complete) algorithms: http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf Questions and reviews are highly appreciated! shortlog: Wu Fengguang (5): writeback: account per-bdi accumulated dirtied pages writeback: dirty position control writeback: dirty rate control writeback: per task dirty rate limit writeback: IO-less balance_dirty_pages() The last 4 patches are one single logical change, but splitted here to make it easier to review the different parts of the algorithm. diffstat: include/linux/backing-dev.h | 8 + include/linux/sched.h | 7 + include/trace/events/writeback.h | 24 -- mm/backing-dev.c | 3 + mm/memory_hotplug.c | 3 - mm/page-writeback.c | 459 ++++++++++++++++++++++---------------- 6 files changed, 290 insertions(+), 214 deletions(-) Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 0/5] IO-less dirty throttling v8 @ 2011-08-06 8:44 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML, Wu Fengguang Hi all, The _core_ bits of the IO-less balance_dirty_pages(). Heavily simplified and re-commented to make it easier to review. git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8 Only the bare minimal algorithms are presented, so you will find some rough edges in the graphs below. But it's usable :) http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/ And an introduction to the (more complete) algorithms: http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf Questions and reviews are highly appreciated! shortlog: Wu Fengguang (5): writeback: account per-bdi accumulated dirtied pages writeback: dirty position control writeback: dirty rate control writeback: per task dirty rate limit writeback: IO-less balance_dirty_pages() The last 4 patches are one single logical change, but splitted here to make it easier to review the different parts of the algorithm. diffstat: include/linux/backing-dev.h | 8 + include/linux/sched.h | 7 + include/trace/events/writeback.h | 24 -- mm/backing-dev.c | 3 + mm/memory_hotplug.c | 3 - mm/page-writeback.c | 459 ++++++++++++++++++++++---------------- 6 files changed, 290 insertions(+), 214 deletions(-) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 0/5] IO-less dirty throttling v8 @ 2011-08-06 8:44 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML, Wu Fengguang Hi all, The _core_ bits of the IO-less balance_dirty_pages(). Heavily simplified and re-commented to make it easier to review. git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8 Only the bare minimal algorithms are presented, so you will find some rough edges in the graphs below. But it's usable :) http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/ And an introduction to the (more complete) algorithms: http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf Questions and reviews are highly appreciated! shortlog: Wu Fengguang (5): writeback: account per-bdi accumulated dirtied pages writeback: dirty position control writeback: dirty rate control writeback: per task dirty rate limit writeback: IO-less balance_dirty_pages() The last 4 patches are one single logical change, but splitted here to make it easier to review the different parts of the algorithm. diffstat: include/linux/backing-dev.h | 8 + include/linux/sched.h | 7 + include/trace/events/writeback.h | 24 -- mm/backing-dev.c | 3 + mm/memory_hotplug.c | 3 - mm/page-writeback.c | 459 ++++++++++++++++++++++---------------- 6 files changed, 290 insertions(+), 214 deletions(-) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 1/5] writeback: account per-bdi accumulated dirtied pages 2011-08-06 8:44 ` Wu Fengguang (?) @ 2011-08-06 8:44 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Jan Kara, Michael Rubin, Peter Zijlstra, Wu Fengguang, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: writeback-bdi-dirtied.patch --] [-- Type: text/plain, Size: 2019 bytes --] Introduce the BDI_DIRTIED counter. It will be used for estimating the bdi's dirty bandwidth. CC: Jan Kara <jack@suse.cz> CC: Michael Rubin <mrubin@google.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/linux/backing-dev.h | 1 + mm/backing-dev.c | 2 ++ mm/page-writeback.c | 1 + 3 files changed, 4 insertions(+) --- linux-next.orig/include/linux/backing-dev.h 2011-06-12 20:58:31.000000000 +0800 +++ linux-next/include/linux/backing-dev.h 2011-06-12 20:58:40.000000000 +0800 @@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int); enum bdi_stat_item { BDI_RECLAIMABLE, BDI_WRITEBACK, + BDI_DIRTIED, BDI_WRITTEN, NR_BDI_STAT_ITEMS }; --- linux-next.orig/mm/page-writeback.c 2011-06-12 20:58:31.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-06-12 20:58:40.000000000 +0800 @@ -1530,6 +1530,7 @@ void account_page_dirtied(struct page *p __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_zone_page_state(page, NR_DIRTIED); __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); + __inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED); task_dirty_inc(current); task_io_account_write(PAGE_CACHE_SIZE); } --- linux-next.orig/mm/backing-dev.c 2011-06-12 20:58:31.000000000 +0800 +++ linux-next/mm/backing-dev.c 2011-06-12 20:58:55.000000000 +0800 @@ -97,6 +97,7 @@ static int bdi_debug_stats_show(struct s "BdiDirtyThresh: %10lu kB\n" "DirtyThresh: %10lu kB\n" "BackgroundThresh: %10lu kB\n" + "BdiDirtied: %10lu kB\n" "BdiWritten: %10lu kB\n" "BdiWriteBandwidth: %10lu kBps\n" "b_dirty: %10lu\n" @@ -109,6 +110,7 @@ static int bdi_debug_stats_show(struct s K(bdi_thresh), K(dirty_thresh), K(background_thresh), + (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)), (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)), (unsigned long) K(bdi->write_bandwidth), nr_dirty, ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 1/5] writeback: account per-bdi accumulated dirtied pages @ 2011-08-06 8:44 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Jan Kara, Michael Rubin, Peter Zijlstra, Wu Fengguang, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: writeback-bdi-dirtied.patch --] [-- Type: text/plain, Size: 2322 bytes --] Introduce the BDI_DIRTIED counter. It will be used for estimating the bdi's dirty bandwidth. CC: Jan Kara <jack@suse.cz> CC: Michael Rubin <mrubin@google.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/linux/backing-dev.h | 1 + mm/backing-dev.c | 2 ++ mm/page-writeback.c | 1 + 3 files changed, 4 insertions(+) --- linux-next.orig/include/linux/backing-dev.h 2011-06-12 20:58:31.000000000 +0800 +++ linux-next/include/linux/backing-dev.h 2011-06-12 20:58:40.000000000 +0800 @@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int); enum bdi_stat_item { BDI_RECLAIMABLE, BDI_WRITEBACK, + BDI_DIRTIED, BDI_WRITTEN, NR_BDI_STAT_ITEMS }; --- linux-next.orig/mm/page-writeback.c 2011-06-12 20:58:31.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-06-12 20:58:40.000000000 +0800 @@ -1530,6 +1530,7 @@ void account_page_dirtied(struct page *p __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_zone_page_state(page, NR_DIRTIED); __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); + __inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED); task_dirty_inc(current); task_io_account_write(PAGE_CACHE_SIZE); } --- linux-next.orig/mm/backing-dev.c 2011-06-12 20:58:31.000000000 +0800 +++ linux-next/mm/backing-dev.c 2011-06-12 20:58:55.000000000 +0800 @@ -97,6 +97,7 @@ static int bdi_debug_stats_show(struct s "BdiDirtyThresh: %10lu kB\n" "DirtyThresh: %10lu kB\n" "BackgroundThresh: %10lu kB\n" + "BdiDirtied: %10lu kB\n" "BdiWritten: %10lu kB\n" "BdiWriteBandwidth: %10lu kBps\n" "b_dirty: %10lu\n" @@ -109,6 +110,7 @@ static int bdi_debug_stats_show(struct s K(bdi_thresh), K(dirty_thresh), K(background_thresh), + (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)), (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)), (unsigned long) K(bdi->write_bandwidth), nr_dirty, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 1/5] writeback: account per-bdi accumulated dirtied pages @ 2011-08-06 8:44 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Jan Kara, Michael Rubin, Peter Zijlstra, Wu Fengguang, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: writeback-bdi-dirtied.patch --] [-- Type: text/plain, Size: 2322 bytes --] Introduce the BDI_DIRTIED counter. It will be used for estimating the bdi's dirty bandwidth. CC: Jan Kara <jack@suse.cz> CC: Michael Rubin <mrubin@google.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/linux/backing-dev.h | 1 + mm/backing-dev.c | 2 ++ mm/page-writeback.c | 1 + 3 files changed, 4 insertions(+) --- linux-next.orig/include/linux/backing-dev.h 2011-06-12 20:58:31.000000000 +0800 +++ linux-next/include/linux/backing-dev.h 2011-06-12 20:58:40.000000000 +0800 @@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int); enum bdi_stat_item { BDI_RECLAIMABLE, BDI_WRITEBACK, + BDI_DIRTIED, BDI_WRITTEN, NR_BDI_STAT_ITEMS }; --- linux-next.orig/mm/page-writeback.c 2011-06-12 20:58:31.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-06-12 20:58:40.000000000 +0800 @@ -1530,6 +1530,7 @@ void account_page_dirtied(struct page *p __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_zone_page_state(page, NR_DIRTIED); __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); + __inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED); task_dirty_inc(current); task_io_account_write(PAGE_CACHE_SIZE); } --- linux-next.orig/mm/backing-dev.c 2011-06-12 20:58:31.000000000 +0800 +++ linux-next/mm/backing-dev.c 2011-06-12 20:58:55.000000000 +0800 @@ -97,6 +97,7 @@ static int bdi_debug_stats_show(struct s "BdiDirtyThresh: %10lu kB\n" "DirtyThresh: %10lu kB\n" "BackgroundThresh: %10lu kB\n" + "BdiDirtied: %10lu kB\n" "BdiWritten: %10lu kB\n" "BdiWriteBandwidth: %10lu kBps\n" "b_dirty: %10lu\n" @@ -109,6 +110,7 @@ static int bdi_debug_stats_show(struct s K(bdi_thresh), K(dirty_thresh), K(background_thresh), + (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)), (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)), (unsigned long) K(bdi->write_bandwidth), nr_dirty, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 2/5] writeback: dirty position control 2011-08-06 8:44 ` Wu Fengguang (?) @ 2011-08-06 8:44 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: writeback-control-algorithms.patch --] [-- Type: text/plain, Size: 7230 bytes --] Old scheme is, | free run area | throttle area ----------------------------------------+----------------------------> thresh^ dirty pages New scheme is, ^ task rate limit | | * | * | * |[free run] * [smooth throttled] | * | * | * ..bdi->dirty_ratelimit..........* | . * | . * | . * | . * | . * +-------------------------------.-----------------------*------------> setpoint^ limit^ dirty pages For simplicity, only the global/bdi setpoint control lines are implemented here, so the [*] curve is more straight than the ideal one showed in the above figure. bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so that the resulted task rate limit can drive the dirty pages back to the global/bdi setpoints. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- mm/page-writeback.c | 143 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 143 insertions(+) --- linux-next.orig/mm/page-writeback.c 2011-08-06 10:31:32.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-06 11:17:07.000000000 +0800 @@ -46,6 +46,8 @@ */ #define BANDWIDTH_INTERVAL max(HZ/5, 1) +#define BANDWIDTH_CALC_SHIFT 10 + /* * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited * will look to see if it needs to force writeback or throttling. @@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac return bdi_dirty; } +/* + * Dirty position control. + * + * (o) global/bdi setpoints + * + * When the number of dirty pages go higher/lower than the setpoint, the dirty + * position ratio (and hence dirty rate limit) will be decreased/increased to + * bring the dirty pages back to the setpoint. + * + * setpoint + * v + * |-------------------------------*-------------------------------|-----------| + * ^ ^ ^ ^ + * (thresh + background_thresh)/2 thresh - thresh/DIRTY_SCOPE thresh limit + * + * bdi setpoint + * v + * |-------------------------------*-------------------------------------------| + * ^ ^ ^ + * 0 bdi_thresh - bdi_thresh/DIRTY_SCOPE limit + * + * (o) pseudo code + * + * pos_ratio = 1 << BANDWIDTH_CALC_SHIFT + * + * if (dirty < thresh) scale up pos_ratio + * if (dirty > thresh) scale down pos_ratio + * + * if (bdi_dirty < bdi_thresh) scale up pos_ratio + * if (bdi_dirty > bdi_thresh) scale down pos_ratio + * + * (o) global/bdi control lines + * + * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by + * several control lines in turn. + * + * The control lines for the global/bdi setpoints both stretch up to @limit. + * If any control line drops below Y=0 before reaching @limit, an auxiliary + * line will be setup to connect them. The below figure illustrates the main + * bdi control line with an auxiliary line extending it to @limit. + * + * This allows smoothly throttling bdi_dirty down to normal if it starts high + * in situations like + * - start writing to a slow SD card and a fast disk at the same time. The SD + * card's bdi_dirty may rush to 5 times higher than bdi setpoint. + * - the bdi dirty thresh goes down quickly due to change of JBOD workload + * + * o + * o + * o [o] main control line + * o [*] auxiliary control line + * o + * o + * o + * o + * o + * o + * o--------------------- balance point, bw scale = 1 + * | o + * | o + * | o + * | o + * | o + * | o + * | o------- connect point, bw scale = 1/2 + * | .* + * | . * + * | . * + * | . * + * | . * + * | . * + * | . * + * [--------------------+-----------------------------.--------------------*] + * 0 bdi setpoint bdi origin limit + * + * The bdi control line: if (origin < limit), an auxiliary control line (*) + * will be setup to extend the main control line (o) to @limit. + */ +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, + unsigned long thresh, + unsigned long dirty, + unsigned long bdi_thresh, + unsigned long bdi_dirty) +{ + unsigned long limit = hard_dirty_limit(thresh); + unsigned long origin; + unsigned long goal; + unsigned long long span; + unsigned long long pos_ratio; /* for scaling up/down the rate limit */ + + if (unlikely(dirty >= limit)) + return 0; + + /* + * global setpoint + */ + goal = thresh - thresh / DIRTY_SCOPE; + origin = 4 * thresh; + + if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { + origin = limit; /* auxiliary control line */ + goal = (goal + origin) / 2; + pos_ratio >>= 1; + } + pos_ratio = origin - dirty; + pos_ratio <<= BANDWIDTH_CALC_SHIFT; + do_div(pos_ratio, origin - goal + 1); + + /* + * bdi setpoint + */ + if (unlikely(bdi_thresh > thresh)) + bdi_thresh = thresh; + goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE; + /* + * Use span=(4*bw) in single disk case and transit to bdi_thresh in + * JBOD case. For JBOD, bdi_thresh could fluctuate up to its own size. + * Otherwise the bdi write bandwidth is good for limiting the floating + * area, which makes the bdi control line a good backup when the global + * control line is too flat/weak in large memory systems. + */ + span = (u64) bdi_thresh * (thresh - bdi_thresh) + + (4 * bdi->avg_write_bandwidth) * bdi_thresh; + do_div(span, thresh + 1); + origin = goal + 2 * span; + + if (unlikely(bdi_dirty > goal + span)) { + if (bdi_dirty > limit) + return 0; + if (origin < limit) { + origin = limit; /* auxiliary control line */ + goal += span; + pos_ratio >>= 1; + } + } + pos_ratio *= origin - bdi_dirty; + do_div(pos_ratio, origin - goal + 1); + + return pos_ratio; +} + static void bdi_update_write_bandwidth(struct backing_dev_info *bdi, unsigned long elapsed, unsigned long written) ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 2/5] writeback: dirty position control @ 2011-08-06 8:44 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: writeback-control-algorithms.patch --] [-- Type: text/plain, Size: 7533 bytes --] Old scheme is, | free run area | throttle area ----------------------------------------+----------------------------> thresh^ dirty pages New scheme is, ^ task rate limit | | * | * | * |[free run] * [smooth throttled] | * | * | * ..bdi->dirty_ratelimit..........* | . * | . * | . * | . * | . * +-------------------------------.-----------------------*------------> setpoint^ limit^ dirty pages For simplicity, only the global/bdi setpoint control lines are implemented here, so the [*] curve is more straight than the ideal one showed in the above figure. bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so that the resulted task rate limit can drive the dirty pages back to the global/bdi setpoints. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- mm/page-writeback.c | 143 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 143 insertions(+) --- linux-next.orig/mm/page-writeback.c 2011-08-06 10:31:32.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-06 11:17:07.000000000 +0800 @@ -46,6 +46,8 @@ */ #define BANDWIDTH_INTERVAL max(HZ/5, 1) +#define BANDWIDTH_CALC_SHIFT 10 + /* * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited * will look to see if it needs to force writeback or throttling. @@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac return bdi_dirty; } +/* + * Dirty position control. + * + * (o) global/bdi setpoints + * + * When the number of dirty pages go higher/lower than the setpoint, the dirty + * position ratio (and hence dirty rate limit) will be decreased/increased to + * bring the dirty pages back to the setpoint. + * + * setpoint + * v + * |-------------------------------*-------------------------------|-----------| + * ^ ^ ^ ^ + * (thresh + background_thresh)/2 thresh - thresh/DIRTY_SCOPE thresh limit + * + * bdi setpoint + * v + * |-------------------------------*-------------------------------------------| + * ^ ^ ^ + * 0 bdi_thresh - bdi_thresh/DIRTY_SCOPE limit + * + * (o) pseudo code + * + * pos_ratio = 1 << BANDWIDTH_CALC_SHIFT + * + * if (dirty < thresh) scale up pos_ratio + * if (dirty > thresh) scale down pos_ratio + * + * if (bdi_dirty < bdi_thresh) scale up pos_ratio + * if (bdi_dirty > bdi_thresh) scale down pos_ratio + * + * (o) global/bdi control lines + * + * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by + * several control lines in turn. + * + * The control lines for the global/bdi setpoints both stretch up to @limit. + * If any control line drops below Y=0 before reaching @limit, an auxiliary + * line will be setup to connect them. The below figure illustrates the main + * bdi control line with an auxiliary line extending it to @limit. + * + * This allows smoothly throttling bdi_dirty down to normal if it starts high + * in situations like + * - start writing to a slow SD card and a fast disk at the same time. The SD + * card's bdi_dirty may rush to 5 times higher than bdi setpoint. + * - the bdi dirty thresh goes down quickly due to change of JBOD workload + * + * o + * o + * o [o] main control line + * o [*] auxiliary control line + * o + * o + * o + * o + * o + * o + * o--------------------- balance point, bw scale = 1 + * | o + * | o + * | o + * | o + * | o + * | o + * | o------- connect point, bw scale = 1/2 + * | .* + * | . * + * | . * + * | . * + * | . * + * | . * + * | . * + * [--------------------+-----------------------------.--------------------*] + * 0 bdi setpoint bdi origin limit + * + * The bdi control line: if (origin < limit), an auxiliary control line (*) + * will be setup to extend the main control line (o) to @limit. + */ +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, + unsigned long thresh, + unsigned long dirty, + unsigned long bdi_thresh, + unsigned long bdi_dirty) +{ + unsigned long limit = hard_dirty_limit(thresh); + unsigned long origin; + unsigned long goal; + unsigned long long span; + unsigned long long pos_ratio; /* for scaling up/down the rate limit */ + + if (unlikely(dirty >= limit)) + return 0; + + /* + * global setpoint + */ + goal = thresh - thresh / DIRTY_SCOPE; + origin = 4 * thresh; + + if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { + origin = limit; /* auxiliary control line */ + goal = (goal + origin) / 2; + pos_ratio >>= 1; + } + pos_ratio = origin - dirty; + pos_ratio <<= BANDWIDTH_CALC_SHIFT; + do_div(pos_ratio, origin - goal + 1); + + /* + * bdi setpoint + */ + if (unlikely(bdi_thresh > thresh)) + bdi_thresh = thresh; + goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE; + /* + * Use span=(4*bw) in single disk case and transit to bdi_thresh in + * JBOD case. For JBOD, bdi_thresh could fluctuate up to its own size. + * Otherwise the bdi write bandwidth is good for limiting the floating + * area, which makes the bdi control line a good backup when the global + * control line is too flat/weak in large memory systems. + */ + span = (u64) bdi_thresh * (thresh - bdi_thresh) + + (4 * bdi->avg_write_bandwidth) * bdi_thresh; + do_div(span, thresh + 1); + origin = goal + 2 * span; + + if (unlikely(bdi_dirty > goal + span)) { + if (bdi_dirty > limit) + return 0; + if (origin < limit) { + origin = limit; /* auxiliary control line */ + goal += span; + pos_ratio >>= 1; + } + } + pos_ratio *= origin - bdi_dirty; + do_div(pos_ratio, origin - goal + 1); + + return pos_ratio; +} + static void bdi_update_write_bandwidth(struct backing_dev_info *bdi, unsigned long elapsed, unsigned long written) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 2/5] writeback: dirty position control @ 2011-08-06 8:44 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: writeback-control-algorithms.patch --] [-- Type: text/plain, Size: 7533 bytes --] Old scheme is, | free run area | throttle area ----------------------------------------+----------------------------> thresh^ dirty pages New scheme is, ^ task rate limit | | * | * | * |[free run] * [smooth throttled] | * | * | * ..bdi->dirty_ratelimit..........* | . * | . * | . * | . * | . * +-------------------------------.-----------------------*------------> setpoint^ limit^ dirty pages For simplicity, only the global/bdi setpoint control lines are implemented here, so the [*] curve is more straight than the ideal one showed in the above figure. bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so that the resulted task rate limit can drive the dirty pages back to the global/bdi setpoints. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- mm/page-writeback.c | 143 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 143 insertions(+) --- linux-next.orig/mm/page-writeback.c 2011-08-06 10:31:32.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-06 11:17:07.000000000 +0800 @@ -46,6 +46,8 @@ */ #define BANDWIDTH_INTERVAL max(HZ/5, 1) +#define BANDWIDTH_CALC_SHIFT 10 + /* * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited * will look to see if it needs to force writeback or throttling. @@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac return bdi_dirty; } +/* + * Dirty position control. + * + * (o) global/bdi setpoints + * + * When the number of dirty pages go higher/lower than the setpoint, the dirty + * position ratio (and hence dirty rate limit) will be decreased/increased to + * bring the dirty pages back to the setpoint. + * + * setpoint + * v + * |-------------------------------*-------------------------------|-----------| + * ^ ^ ^ ^ + * (thresh + background_thresh)/2 thresh - thresh/DIRTY_SCOPE thresh limit + * + * bdi setpoint + * v + * |-------------------------------*-------------------------------------------| + * ^ ^ ^ + * 0 bdi_thresh - bdi_thresh/DIRTY_SCOPE limit + * + * (o) pseudo code + * + * pos_ratio = 1 << BANDWIDTH_CALC_SHIFT + * + * if (dirty < thresh) scale up pos_ratio + * if (dirty > thresh) scale down pos_ratio + * + * if (bdi_dirty < bdi_thresh) scale up pos_ratio + * if (bdi_dirty > bdi_thresh) scale down pos_ratio + * + * (o) global/bdi control lines + * + * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by + * several control lines in turn. + * + * The control lines for the global/bdi setpoints both stretch up to @limit. + * If any control line drops below Y=0 before reaching @limit, an auxiliary + * line will be setup to connect them. The below figure illustrates the main + * bdi control line with an auxiliary line extending it to @limit. + * + * This allows smoothly throttling bdi_dirty down to normal if it starts high + * in situations like + * - start writing to a slow SD card and a fast disk at the same time. The SD + * card's bdi_dirty may rush to 5 times higher than bdi setpoint. + * - the bdi dirty thresh goes down quickly due to change of JBOD workload + * + * o + * o + * o [o] main control line + * o [*] auxiliary control line + * o + * o + * o + * o + * o + * o + * o--------------------- balance point, bw scale = 1 + * | o + * | o + * | o + * | o + * | o + * | o + * | o------- connect point, bw scale = 1/2 + * | .* + * | . * + * | . * + * | . * + * | . * + * | . * + * | . * + * [--------------------+-----------------------------.--------------------*] + * 0 bdi setpoint bdi origin limit + * + * The bdi control line: if (origin < limit), an auxiliary control line (*) + * will be setup to extend the main control line (o) to @limit. + */ +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, + unsigned long thresh, + unsigned long dirty, + unsigned long bdi_thresh, + unsigned long bdi_dirty) +{ + unsigned long limit = hard_dirty_limit(thresh); + unsigned long origin; + unsigned long goal; + unsigned long long span; + unsigned long long pos_ratio; /* for scaling up/down the rate limit */ + + if (unlikely(dirty >= limit)) + return 0; + + /* + * global setpoint + */ + goal = thresh - thresh / DIRTY_SCOPE; + origin = 4 * thresh; + + if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { + origin = limit; /* auxiliary control line */ + goal = (goal + origin) / 2; + pos_ratio >>= 1; + } + pos_ratio = origin - dirty; + pos_ratio <<= BANDWIDTH_CALC_SHIFT; + do_div(pos_ratio, origin - goal + 1); + + /* + * bdi setpoint + */ + if (unlikely(bdi_thresh > thresh)) + bdi_thresh = thresh; + goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE; + /* + * Use span=(4*bw) in single disk case and transit to bdi_thresh in + * JBOD case. For JBOD, bdi_thresh could fluctuate up to its own size. + * Otherwise the bdi write bandwidth is good for limiting the floating + * area, which makes the bdi control line a good backup when the global + * control line is too flat/weak in large memory systems. + */ + span = (u64) bdi_thresh * (thresh - bdi_thresh) + + (4 * bdi->avg_write_bandwidth) * bdi_thresh; + do_div(span, thresh + 1); + origin = goal + 2 * span; + + if (unlikely(bdi_dirty > goal + span)) { + if (bdi_dirty > limit) + return 0; + if (origin < limit) { + origin = limit; /* auxiliary control line */ + goal += span; + pos_ratio >>= 1; + } + } + pos_ratio *= origin - bdi_dirty; + do_div(pos_ratio, origin - goal + 1); + + return pos_ratio; +} + static void bdi_update_write_bandwidth(struct backing_dev_info *bdi, unsigned long elapsed, unsigned long written) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-06 8:44 ` Wu Fengguang (?) @ 2011-08-08 13:46 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-08 13:46 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, > + unsigned long thresh, > + unsigned long dirty, > + unsigned long bdi_thresh, > + unsigned long bdi_dirty) > +{ > + unsigned long limit = hard_dirty_limit(thresh); > + unsigned long origin; > + unsigned long goal; > + unsigned long long span; > + unsigned long long pos_ratio; /* for scaling up/down the rate limit */ > + > + if (unlikely(dirty >= limit)) > + return 0; > + > + /* > + * global setpoint > + */ > + goal = thresh - thresh / DIRTY_SCOPE; > + origin = 4 * thresh; > + > + if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { > + origin = limit; /* auxiliary control line */ > + goal = (goal + origin) / 2; > + pos_ratio >>= 1; use before init? ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-08 13:46 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-08 13:46 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, > + unsigned long thresh, > + unsigned long dirty, > + unsigned long bdi_thresh, > + unsigned long bdi_dirty) > +{ > + unsigned long limit = hard_dirty_limit(thresh); > + unsigned long origin; > + unsigned long goal; > + unsigned long long span; > + unsigned long long pos_ratio; /* for scaling up/down the rate limit */ > + > + if (unlikely(dirty >= limit)) > + return 0; > + > + /* > + * global setpoint > + */ > + goal = thresh - thresh / DIRTY_SCOPE; > + origin = 4 * thresh; > + > + if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { > + origin = limit; /* auxiliary control line */ > + goal = (goal + origin) / 2; > + pos_ratio >>= 1; use before init? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-08 13:46 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-08 13:46 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, > + unsigned long thresh, > + unsigned long dirty, > + unsigned long bdi_thresh, > + unsigned long bdi_dirty) > +{ > + unsigned long limit = hard_dirty_limit(thresh); > + unsigned long origin; > + unsigned long goal; > + unsigned long long span; > + unsigned long long pos_ratio; /* for scaling up/down the rate limit */ > + > + if (unlikely(dirty >= limit)) > + return 0; > + > + /* > + * global setpoint > + */ > + goal = thresh - thresh / DIRTY_SCOPE; > + origin = 4 * thresh; > + > + if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { > + origin = limit; /* auxiliary control line */ > + goal = (goal + origin) / 2; > + pos_ratio >>= 1; use before init? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-08 13:46 ` Peter Zijlstra @ 2011-08-08 14:11 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-08 14:11 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, Aug 08, 2011 at 09:46:33PM +0800, Peter Zijlstra wrote: > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, > > + unsigned long thresh, > > + unsigned long dirty, > > + unsigned long bdi_thresh, > > + unsigned long bdi_dirty) > > +{ > > + unsigned long limit = hard_dirty_limit(thresh); > > + unsigned long origin; > > + unsigned long goal; > > + unsigned long long span; > > + unsigned long long pos_ratio; /* for scaling up/down the rate limit */ > > + > > + if (unlikely(dirty >= limit)) > > + return 0; > > + > > + /* > > + * global setpoint > > + */ > > + goal = thresh - thresh / DIRTY_SCOPE; > > + origin = 4 * thresh; > > + > > + if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { > > + origin = limit; /* auxiliary control line */ > > + goal = (goal + origin) / 2; > > + pos_ratio >>= 1; > > use before init? Yeah it's embarrassing, I find this bug all the way back to the initial version... It's actually dead code because (origin < limit) should never happen. I feel so good being able to drop 5 more lines of code :) Thanks, Fengguang --- --- linux-next.orig/mm/page-writeback.c 2011-08-08 21:56:11.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-08 22:04:48.000000000 +0800 @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio( goal = thresh - thresh / DIRTY_SCOPE; origin = 4 * thresh; - if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { - origin = limit; /* auxiliary control line */ - goal = (goal + origin) / 2; - pos_ratio >>= 1; - } pos_ratio = origin - dirty; pos_ratio <<= BANDWIDTH_CALC_SHIFT; do_div(pos_ratio, origin - goal + 1); ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-08 14:11 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-08 14:11 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, Aug 08, 2011 at 09:46:33PM +0800, Peter Zijlstra wrote: > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, > > + unsigned long thresh, > > + unsigned long dirty, > > + unsigned long bdi_thresh, > > + unsigned long bdi_dirty) > > +{ > > + unsigned long limit = hard_dirty_limit(thresh); > > + unsigned long origin; > > + unsigned long goal; > > + unsigned long long span; > > + unsigned long long pos_ratio; /* for scaling up/down the rate limit */ > > + > > + if (unlikely(dirty >= limit)) > > + return 0; > > + > > + /* > > + * global setpoint > > + */ > > + goal = thresh - thresh / DIRTY_SCOPE; > > + origin = 4 * thresh; > > + > > + if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { > > + origin = limit; /* auxiliary control line */ > > + goal = (goal + origin) / 2; > > + pos_ratio >>= 1; > > use before init? Yeah it's embarrassing, I find this bug all the way back to the initial version... It's actually dead code because (origin < limit) should never happen. I feel so good being able to drop 5 more lines of code :) Thanks, Fengguang --- --- linux-next.orig/mm/page-writeback.c 2011-08-08 21:56:11.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-08 22:04:48.000000000 +0800 @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio( goal = thresh - thresh / DIRTY_SCOPE; origin = 4 * thresh; - if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { - origin = limit; /* auxiliary control line */ - goal = (goal + origin) / 2; - pos_ratio >>= 1; - } pos_ratio = origin - dirty; pos_ratio <<= BANDWIDTH_CALC_SHIFT; do_div(pos_ratio, origin - goal + 1); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-08 14:11 ` Wu Fengguang (?) @ 2011-08-08 14:31 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-08 14:31 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote: > It's actually dead code because (origin < limit) should never happen. > I feel so good being able to drop 5 more lines of code :) OK, but that leaves me trying to figure out what origin is, and why its 4 * thresh. I'm having a horrible time understanding this stuff. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-08 14:31 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-08 14:31 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote: > It's actually dead code because (origin < limit) should never happen. > I feel so good being able to drop 5 more lines of code :) OK, but that leaves me trying to figure out what origin is, and why its 4 * thresh. I'm having a horrible time understanding this stuff. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-08 14:31 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-08 14:31 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote: > It's actually dead code because (origin < limit) should never happen. > I feel so good being able to drop 5 more lines of code :) OK, but that leaves me trying to figure out what origin is, and why its 4 * thresh. I'm having a horrible time understanding this stuff. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-08 14:31 ` Peter Zijlstra @ 2011-08-08 22:47 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-08 22:47 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, Aug 08, 2011 at 10:31:49PM +0800, Peter Zijlstra wrote: > On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote: > > It's actually dead code because (origin < limit) should never happen. > > I feel so good being able to drop 5 more lines of code :) > > OK, but that leaves me trying to figure out what origin is, and why its > 4 * thresh. origin is where the control line crosses the X axis (in both the global/bdi setpoint cases). "4 * thresh" is merely something larger than max(dirty, thresh) that yields reasonably gentle slope. The more slope, the larger "gravity" to bring the dirty pages back to the setpoint. > I'm having a horrible time understanding this stuff. Sorry for that. Do you have more questions? Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-08 22:47 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-08 22:47 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, Aug 08, 2011 at 10:31:49PM +0800, Peter Zijlstra wrote: > On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote: > > It's actually dead code because (origin < limit) should never happen. > > I feel so good being able to drop 5 more lines of code :) > > OK, but that leaves me trying to figure out what origin is, and why its > 4 * thresh. origin is where the control line crosses the X axis (in both the global/bdi setpoint cases). "4 * thresh" is merely something larger than max(dirty, thresh) that yields reasonably gentle slope. The more slope, the larger "gravity" to bring the dirty pages back to the setpoint. > I'm having a horrible time understanding this stuff. Sorry for that. Do you have more questions? Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-08 22:47 ` Wu Fengguang (?) @ 2011-08-09 9:31 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 9:31 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote: > origin is where the control line crosses the X axis (in both the > global/bdi setpoint cases). Ah, that's normally called zero, root or or x-intercept: http://en.wikipedia.org/wiki/X-intercept ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-09 9:31 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 9:31 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote: > origin is where the control line crosses the X axis (in both the > global/bdi setpoint cases). Ah, that's normally called zero, root or or x-intercept: http://en.wikipedia.org/wiki/X-intercept -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-09 9:31 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 9:31 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote: > origin is where the control line crosses the X axis (in both the > global/bdi setpoint cases). Ah, that's normally called zero, root or or x-intercept: http://en.wikipedia.org/wiki/X-intercept -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-09 9:31 ` Peter Zijlstra @ 2011-08-10 12:28 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 12:28 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, Aug 09, 2011 at 05:31:44PM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote: > > origin is where the control line crosses the X axis (in both the > > global/bdi setpoint cases). > > Ah, that's normally called zero, root or or x-intercept: > > http://en.wikipedia.org/wiki/X-intercept Yes indeed! I'll change the name to x_intercept. Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-10 12:28 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 12:28 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, Aug 09, 2011 at 05:31:44PM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote: > > origin is where the control line crosses the X axis (in both the > > global/bdi setpoint cases). > > Ah, that's normally called zero, root or or x-intercept: > > http://en.wikipedia.org/wiki/X-intercept Yes indeed! I'll change the name to x_intercept. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-08 14:11 ` Wu Fengguang (?) @ 2011-08-08 14:41 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-08 14:41 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote: > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio( > goal = thresh - thresh / DIRTY_SCOPE; > origin = 4 * thresh; > > - if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { > - origin = limit; /* auxiliary control line */ > - goal = (goal + origin) / 2; > - pos_ratio >>= 1; > - } > pos_ratio = origin - dirty; > pos_ratio <<= BANDWIDTH_CALC_SHIFT; > do_div(pos_ratio, origin - goal + 1); So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-08 14:41 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-08 14:41 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote: > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio( > goal = thresh - thresh / DIRTY_SCOPE; > origin = 4 * thresh; > > - if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { > - origin = limit; /* auxiliary control line */ > - goal = (goal + origin) / 2; > - pos_ratio >>= 1; > - } > pos_ratio = origin - dirty; > pos_ratio <<= BANDWIDTH_CALC_SHIFT; > do_div(pos_ratio, origin - goal + 1); So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-08 14:41 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-08 14:41 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote: > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio( > goal = thresh - thresh / DIRTY_SCOPE; > origin = 4 * thresh; > > - if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { > - origin = limit; /* auxiliary control line */ > - goal = (goal + origin) / 2; > - pos_ratio >>= 1; > - } > pos_ratio = origin - dirty; > pos_ratio <<= BANDWIDTH_CALC_SHIFT; > do_div(pos_ratio, origin - goal + 1); So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-08 14:41 ` Peter Zijlstra @ 2011-08-08 23:05 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-08 23:05 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, Aug 08, 2011 at 10:41:41PM +0800, Peter Zijlstra wrote: > On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote: > > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio( > > goal = thresh - thresh / DIRTY_SCOPE; > > origin = 4 * thresh; > > > > - if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { > > - origin = limit; /* auxiliary control line */ > > - goal = (goal + origin) / 2; > > - pos_ratio >>= 1; > > - } > > pos_ratio = origin - dirty; > > pos_ratio <<= BANDWIDTH_CALC_SHIFT; > > do_div(pos_ratio, origin - goal + 1); FYI I've updated the fix to the below one, so that @limit will be used as the origin in the rare case of (4*thresh < dirty). --- linux-next.orig/mm/page-writeback.c 2011-08-08 21:56:11.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-09 06:34:25.000000000 +0800 @@ -536,13 +536,8 @@ static unsigned long bdi_position_ratio( * global setpoint */ goal = thresh - thresh / DIRTY_SCOPE; - origin = 4 * thresh; + origin = max(4 * thresh, limit); - if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { - origin = limit; /* auxiliary control line */ - goal = (goal + origin) / 2; - pos_ratio >>= 1; - } pos_ratio = origin - dirty; pos_ratio <<= BANDWIDTH_CALC_SHIFT; do_div(pos_ratio, origin - goal + 1); > So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken > comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. This is the more meaningful view :) origin - dirty pos_ratio = -------------- origin - goal which comes from the below [*] control line, so that when (dirty == goal), pos_ratio == 1.0: ^ pos_ratio | | | * | * | * | * | * | * | * | * | * | * | * .. pos_ratio = 1.0 ..................* | . * | . * | . * | . * | . * | . * | . * | . * | . * | . * | . * | . * | . * | . * | . * | . * +------------------------------------.--------------------------------------------------*----------------------> 0 goal origin dirty pages Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-08 23:05 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-08 23:05 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, Aug 08, 2011 at 10:41:41PM +0800, Peter Zijlstra wrote: > On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote: > > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio( > > goal = thresh - thresh / DIRTY_SCOPE; > > origin = 4 * thresh; > > > > - if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { > > - origin = limit; /* auxiliary control line */ > > - goal = (goal + origin) / 2; > > - pos_ratio >>= 1; > > - } > > pos_ratio = origin - dirty; > > pos_ratio <<= BANDWIDTH_CALC_SHIFT; > > do_div(pos_ratio, origin - goal + 1); FYI I've updated the fix to the below one, so that @limit will be used as the origin in the rare case of (4*thresh < dirty). --- linux-next.orig/mm/page-writeback.c 2011-08-08 21:56:11.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-09 06:34:25.000000000 +0800 @@ -536,13 +536,8 @@ static unsigned long bdi_position_ratio( * global setpoint */ goal = thresh - thresh / DIRTY_SCOPE; - origin = 4 * thresh; + origin = max(4 * thresh, limit); - if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { - origin = limit; /* auxiliary control line */ - goal = (goal + origin) / 2; - pos_ratio >>= 1; - } pos_ratio = origin - dirty; pos_ratio <<= BANDWIDTH_CALC_SHIFT; do_div(pos_ratio, origin - goal + 1); > So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken > comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. This is the more meaningful view :) origin - dirty pos_ratio = -------------- origin - goal which comes from the below [*] control line, so that when (dirty == goal), pos_ratio == 1.0: ^ pos_ratio | | | * | * | * | * | * | * | * | * | * | * | * .. pos_ratio = 1.0 ..................* | . * | . * | . * | . * | . * | . * | . * | . * | . * | . * | . * | . * | . * | . * | . * | . * +------------------------------------.--------------------------------------------------*----------------------> 0 goal origin dirty pages Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-08 23:05 ` Wu Fengguang (?) @ 2011-08-09 10:32 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 10:32 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 07:05 +0800, Wu Fengguang wrote: > This is the more meaningful view :) > > origin - dirty > pos_ratio = -------------- > origin - goal > which comes from the below [*] control line, so that when (dirty == goal), > pos_ratio == 1.0: OK, so basically you want a linear function for which: f(goal) = 1 and has a root somewhere > goal. (that one line is much more informative than all your graphs put together, one can start from there and derive your function) That does indeed get you the above function, now what does it mean? > + * When the number of dirty pages go higher/lower than the setpoint, the dirty > + * position ratio (and hence dirty rate limit) will be decreased/increased to > + * bring the dirty pages back to the setpoint. (you seem inconsistent with your terminology, I think goal and setpoint are interchanged? I looked up set point and its a term from control system theory, so I'll chalk that up to my own ignorance..) Ok, so higher dirty -> lower position ration -> lower dirty rate (and the inverse), now what does that do... /me goes read other patches in search of more clues.. I'm starting to dislike graphs.. why not simply state where those things come from, that's much easier. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-09 10:32 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 10:32 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 07:05 +0800, Wu Fengguang wrote: > This is the more meaningful view :) > > origin - dirty > pos_ratio = -------------- > origin - goal > which comes from the below [*] control line, so that when (dirty == goal), > pos_ratio == 1.0: OK, so basically you want a linear function for which: f(goal) = 1 and has a root somewhere > goal. (that one line is much more informative than all your graphs put together, one can start from there and derive your function) That does indeed get you the above function, now what does it mean? > + * When the number of dirty pages go higher/lower than the setpoint, the dirty > + * position ratio (and hence dirty rate limit) will be decreased/increased to > + * bring the dirty pages back to the setpoint. (you seem inconsistent with your terminology, I think goal and setpoint are interchanged? I looked up set point and its a term from control system theory, so I'll chalk that up to my own ignorance..) Ok, so higher dirty -> lower position ration -> lower dirty rate (and the inverse), now what does that do... /me goes read other patches in search of more clues.. I'm starting to dislike graphs.. why not simply state where those things come from, that's much easier. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-09 10:32 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 10:32 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 07:05 +0800, Wu Fengguang wrote: > This is the more meaningful view :) > > origin - dirty > pos_ratio = -------------- > origin - goal > which comes from the below [*] control line, so that when (dirty == goal), > pos_ratio == 1.0: OK, so basically you want a linear function for which: f(goal) = 1 and has a root somewhere > goal. (that one line is much more informative than all your graphs put together, one can start from there and derive your function) That does indeed get you the above function, now what does it mean? > + * When the number of dirty pages go higher/lower than the setpoint, the dirty > + * position ratio (and hence dirty rate limit) will be decreased/increased to > + * bring the dirty pages back to the setpoint. (you seem inconsistent with your terminology, I think goal and setpoint are interchanged? I looked up set point and its a term from control system theory, so I'll chalk that up to my own ignorance..) Ok, so higher dirty -> lower position ration -> lower dirty rate (and the inverse), now what does that do... /me goes read other patches in search of more clues.. I'm starting to dislike graphs.. why not simply state where those things come from, that's much easier. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-08 23:05 ` Wu Fengguang (?) @ 2011-08-09 17:20 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 17:20 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote: > > origin - dirty > > pos_ratio = -------------- > > origin - goal > > > which comes from the below [*] control line, so that when (dirty == goal), > > pos_ratio == 1.0: > > OK, so basically you want a linear function for which: > > f(goal) = 1 and has a root somewhere > goal. > > (that one line is much more informative than all your graphs put > together, one can start from there and derive your function) > > That does indeed get you the above function, now what does it mean? So going by: write_bw ref_bw = dirty_ratelimit * pos_ratio * -------- dirty_bw pos_ratio seems to be the feedback on the deviation of the dirty pages around its setpoint. So we adjust the reference bw (or rather ratelimit) to take account of the shift in output vs input capacity as well as the shift in dirty pages around its setpoint. >From that we derive the condition that: pos_ratio(setpoint) := 1 Now in order to create a linear function we need one more condition. We get one from the fact that once we hit the limit we should hard throttle our writers. We get that by setting the ratelimit to 0, because, after all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus: pos_ratio(limit) := 0 Using these two conditions we can solve the equations and get your: limit - dirty pos_ratio(dirty) = ---------------- limit - setpoint Now, for some reason you chose not to use limit, but something like min(limit, 4*thresh) something to do with the slope affecting the rate of adjustment. This wants a comment someplace. Now all of the above would seem to suggest: dirty_ratelimit := ref_bw However for that you use: if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) dirty_ratelimit = max(ref_bw, pos_bw); if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) dirty_ratelimit = min(ref_bw, pos_bw); You have: pos_bw = dirty_ratelimit * pos_ratio Which is ref_bw without the write_bw/dirty_bw factor, this confuses me.. why are you ignoring the shift in output vs input rate there? ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-09 17:20 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 17:20 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote: > > origin - dirty > > pos_ratio = -------------- > > origin - goal > > > which comes from the below [*] control line, so that when (dirty == goal), > > pos_ratio == 1.0: > > OK, so basically you want a linear function for which: > > f(goal) = 1 and has a root somewhere > goal. > > (that one line is much more informative than all your graphs put > together, one can start from there and derive your function) > > That does indeed get you the above function, now what does it mean? So going by: write_bw ref_bw = dirty_ratelimit * pos_ratio * -------- dirty_bw pos_ratio seems to be the feedback on the deviation of the dirty pages around its setpoint. So we adjust the reference bw (or rather ratelimit) to take account of the shift in output vs input capacity as well as the shift in dirty pages around its setpoint. From that we derive the condition that: pos_ratio(setpoint) := 1 Now in order to create a linear function we need one more condition. We get one from the fact that once we hit the limit we should hard throttle our writers. We get that by setting the ratelimit to 0, because, after all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus: pos_ratio(limit) := 0 Using these two conditions we can solve the equations and get your: limit - dirty pos_ratio(dirty) = ---------------- limit - setpoint Now, for some reason you chose not to use limit, but something like min(limit, 4*thresh) something to do with the slope affecting the rate of adjustment. This wants a comment someplace. Now all of the above would seem to suggest: dirty_ratelimit := ref_bw However for that you use: if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) dirty_ratelimit = max(ref_bw, pos_bw); if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) dirty_ratelimit = min(ref_bw, pos_bw); You have: pos_bw = dirty_ratelimit * pos_ratio Which is ref_bw without the write_bw/dirty_bw factor, this confuses me.. why are you ignoring the shift in output vs input rate there? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-09 17:20 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 17:20 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote: > > origin - dirty > > pos_ratio = -------------- > > origin - goal > > > which comes from the below [*] control line, so that when (dirty == goal), > > pos_ratio == 1.0: > > OK, so basically you want a linear function for which: > > f(goal) = 1 and has a root somewhere > goal. > > (that one line is much more informative than all your graphs put > together, one can start from there and derive your function) > > That does indeed get you the above function, now what does it mean? So going by: write_bw ref_bw = dirty_ratelimit * pos_ratio * -------- dirty_bw pos_ratio seems to be the feedback on the deviation of the dirty pages around its setpoint. So we adjust the reference bw (or rather ratelimit) to take account of the shift in output vs input capacity as well as the shift in dirty pages around its setpoint. From that we derive the condition that: pos_ratio(setpoint) := 1 Now in order to create a linear function we need one more condition. We get one from the fact that once we hit the limit we should hard throttle our writers. We get that by setting the ratelimit to 0, because, after all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus: pos_ratio(limit) := 0 Using these two conditions we can solve the equations and get your: limit - dirty pos_ratio(dirty) = ---------------- limit - setpoint Now, for some reason you chose not to use limit, but something like min(limit, 4*thresh) something to do with the slope affecting the rate of adjustment. This wants a comment someplace. Now all of the above would seem to suggest: dirty_ratelimit := ref_bw However for that you use: if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) dirty_ratelimit = max(ref_bw, pos_bw); if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) dirty_ratelimit = min(ref_bw, pos_bw); You have: pos_bw = dirty_ratelimit * pos_ratio Which is ref_bw without the write_bw/dirty_bw factor, this confuses me.. why are you ignoring the shift in output vs input rate there? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-09 17:20 ` Peter Zijlstra @ 2011-08-10 22:34 ` Jan Kara -1 siblings, 0 replies; 301+ messages in thread From: Jan Kara @ 2011-08-10 22:34 UTC (permalink / raw) To: Peter Zijlstra Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue 09-08-11 19:20:27, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote: > > > origin - dirty > > > pos_ratio = -------------- > > > origin - goal > > > > > which comes from the below [*] control line, so that when (dirty == goal), > > > pos_ratio == 1.0: > > > > OK, so basically you want a linear function for which: > > > > f(goal) = 1 and has a root somewhere > goal. > > > > (that one line is much more informative than all your graphs put > > together, one can start from there and derive your function) > > > > That does indeed get you the above function, now what does it mean? > > So going by: > > write_bw > ref_bw = dirty_ratelimit * pos_ratio * -------- > dirty_bw Actually, thinking about these formulas, why do we even bother with computing all these factors like write_bw, dirty_bw, pos_ratio, ... Couldn't we just have a feedback loop (probably similar to the one computing pos_ratio) which will maintain single value - ratelimit? When we are getting close to dirty limit, we will scale ratelimit down, when we will be getting significantly below dirty limit, we will scale the ratelimit up. Because looking at the formulas it seems to me that the net effect is the same - pos_ratio basically overrules everything... > pos_ratio seems to be the feedback on the deviation of the dirty pages > around its setpoint. So we adjust the reference bw (or rather ratelimit) > to take account of the shift in output vs input capacity as well as the > shift in dirty pages around its setpoint. > > From that we derive the condition that: > > pos_ratio(setpoint) := 1 > > Now in order to create a linear function we need one more condition. We > get one from the fact that once we hit the limit we should hard throttle > our writers. We get that by setting the ratelimit to 0, because, after > all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus: > > pos_ratio(limit) := 0 > > Using these two conditions we can solve the equations and get your: > > limit - dirty > pos_ratio(dirty) = ---------------- > limit - setpoint > > Now, for some reason you chose not to use limit, but something like > min(limit, 4*thresh) something to do with the slope affecting the rate > of adjustment. This wants a comment someplace. > > > Now all of the above would seem to suggest: > > dirty_ratelimit := ref_bw > > However for that you use: > > if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) > dirty_ratelimit = max(ref_bw, pos_bw); > > if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) > dirty_ratelimit = min(ref_bw, pos_bw); > > You have: > > pos_bw = dirty_ratelimit * pos_ratio > > Which is ref_bw without the write_bw/dirty_bw factor, this confuses me.. > why are you ignoring the shift in output vs input rate there? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-10 22:34 ` Jan Kara 0 siblings, 0 replies; 301+ messages in thread From: Jan Kara @ 2011-08-10 22:34 UTC (permalink / raw) To: Peter Zijlstra Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue 09-08-11 19:20:27, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote: > > > origin - dirty > > > pos_ratio = -------------- > > > origin - goal > > > > > which comes from the below [*] control line, so that when (dirty == goal), > > > pos_ratio == 1.0: > > > > OK, so basically you want a linear function for which: > > > > f(goal) = 1 and has a root somewhere > goal. > > > > (that one line is much more informative than all your graphs put > > together, one can start from there and derive your function) > > > > That does indeed get you the above function, now what does it mean? > > So going by: > > write_bw > ref_bw = dirty_ratelimit * pos_ratio * -------- > dirty_bw Actually, thinking about these formulas, why do we even bother with computing all these factors like write_bw, dirty_bw, pos_ratio, ... Couldn't we just have a feedback loop (probably similar to the one computing pos_ratio) which will maintain single value - ratelimit? When we are getting close to dirty limit, we will scale ratelimit down, when we will be getting significantly below dirty limit, we will scale the ratelimit up. Because looking at the formulas it seems to me that the net effect is the same - pos_ratio basically overrules everything... > pos_ratio seems to be the feedback on the deviation of the dirty pages > around its setpoint. So we adjust the reference bw (or rather ratelimit) > to take account of the shift in output vs input capacity as well as the > shift in dirty pages around its setpoint. > > From that we derive the condition that: > > pos_ratio(setpoint) := 1 > > Now in order to create a linear function we need one more condition. We > get one from the fact that once we hit the limit we should hard throttle > our writers. We get that by setting the ratelimit to 0, because, after > all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus: > > pos_ratio(limit) := 0 > > Using these two conditions we can solve the equations and get your: > > limit - dirty > pos_ratio(dirty) = ---------------- > limit - setpoint > > Now, for some reason you chose not to use limit, but something like > min(limit, 4*thresh) something to do with the slope affecting the rate > of adjustment. This wants a comment someplace. > > > Now all of the above would seem to suggest: > > dirty_ratelimit := ref_bw > > However for that you use: > > if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) > dirty_ratelimit = max(ref_bw, pos_bw); > > if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) > dirty_ratelimit = min(ref_bw, pos_bw); > > You have: > > pos_bw = dirty_ratelimit * pos_ratio > > Which is ref_bw without the write_bw/dirty_bw factor, this confuses me.. > why are you ignoring the shift in output vs input rate there? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-10 22:34 ` Jan Kara @ 2011-08-11 2:29 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-11 2:29 UTC (permalink / raw) To: Jan Kara Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote: > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote: > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote: > > > > origin - dirty > > > > pos_ratio = -------------- > > > > origin - goal > > > > > > > which comes from the below [*] control line, so that when (dirty == goal), > > > > pos_ratio == 1.0: > > > > > > OK, so basically you want a linear function for which: > > > > > > f(goal) = 1 and has a root somewhere > goal. > > > > > > (that one line is much more informative than all your graphs put > > > together, one can start from there and derive your function) > > > > > > That does indeed get you the above function, now what does it mean? > > > > So going by: > > > > write_bw > > ref_bw = dirty_ratelimit * pos_ratio * -------- > > dirty_bw > > Actually, thinking about these formulas, why do we even bother with > computing all these factors like write_bw, dirty_bw, pos_ratio, ... > Couldn't we just have a feedback loop (probably similar to the one > computing pos_ratio) which will maintain single value - ratelimit? When we > are getting close to dirty limit, we will scale ratelimit down, when we > will be getting significantly below dirty limit, we will scale the > ratelimit up. Because looking at the formulas it seems to me that the net > effect is the same - pos_ratio basically overrules everything... Good question. That is actually one of the early approaches I tried. It somehow worked, however the resulted ratelimit is not only slow responding, but also oscillating all the time. This is due to the imperfections 1) pos_ratio at best only provides a "direction" for adjusting the ratelimit. There is only vague clues that if pos_ratio is small, the errors in ratelimit should be small. 2) Due to time-lag, the assumptions in (1) about "direction" and "error size" can be wrong. The ratelimit may already be over-adjusted when the dirty pages take time to approach the setpoint. The larger memory, the more time lag, the easier to overshoot and oscillate. 3) dirty pages are constantly fluctuating around the setpoint, so is pos_ratio. With (1) and (2), it's a control system very susceptible to disturbs. With (3) we get constant disturbs. Well I had very hard time and played dirty tricks (which you may never want to know ;-) trying to tradeoff between response time and stableness.. Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-11 2:29 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-11 2:29 UTC (permalink / raw) To: Jan Kara Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote: > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote: > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote: > > > > origin - dirty > > > > pos_ratio = -------------- > > > > origin - goal > > > > > > > which comes from the below [*] control line, so that when (dirty == goal), > > > > pos_ratio == 1.0: > > > > > > OK, so basically you want a linear function for which: > > > > > > f(goal) = 1 and has a root somewhere > goal. > > > > > > (that one line is much more informative than all your graphs put > > > together, one can start from there and derive your function) > > > > > > That does indeed get you the above function, now what does it mean? > > > > So going by: > > > > write_bw > > ref_bw = dirty_ratelimit * pos_ratio * -------- > > dirty_bw > > Actually, thinking about these formulas, why do we even bother with > computing all these factors like write_bw, dirty_bw, pos_ratio, ... > Couldn't we just have a feedback loop (probably similar to the one > computing pos_ratio) which will maintain single value - ratelimit? When we > are getting close to dirty limit, we will scale ratelimit down, when we > will be getting significantly below dirty limit, we will scale the > ratelimit up. Because looking at the formulas it seems to me that the net > effect is the same - pos_ratio basically overrules everything... Good question. That is actually one of the early approaches I tried. It somehow worked, however the resulted ratelimit is not only slow responding, but also oscillating all the time. This is due to the imperfections 1) pos_ratio at best only provides a "direction" for adjusting the ratelimit. There is only vague clues that if pos_ratio is small, the errors in ratelimit should be small. 2) Due to time-lag, the assumptions in (1) about "direction" and "error size" can be wrong. The ratelimit may already be over-adjusted when the dirty pages take time to approach the setpoint. The larger memory, the more time lag, the easier to overshoot and oscillate. 3) dirty pages are constantly fluctuating around the setpoint, so is pos_ratio. With (1) and (2), it's a control system very susceptible to disturbs. With (3) we get constant disturbs. Well I had very hard time and played dirty tricks (which you may never want to know ;-) trying to tradeoff between response time and stableness.. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-11 2:29 ` Wu Fengguang @ 2011-08-11 11:14 ` Jan Kara -1 siblings, 0 replies; 301+ messages in thread From: Jan Kara @ 2011-08-11 11:14 UTC (permalink / raw) To: Wu Fengguang Cc: Jan Kara, Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Thu 11-08-11 10:29:52, Wu Fengguang wrote: > On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote: > > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote: > > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote: > > > > > origin - dirty > > > > > pos_ratio = -------------- > > > > > origin - goal > > > > > > > > > which comes from the below [*] control line, so that when (dirty == goal), > > > > > pos_ratio == 1.0: > > > > > > > > OK, so basically you want a linear function for which: > > > > > > > > f(goal) = 1 and has a root somewhere > goal. > > > > > > > > (that one line is much more informative than all your graphs put > > > > together, one can start from there and derive your function) > > > > > > > > That does indeed get you the above function, now what does it mean? > > > > > > So going by: > > > > > > write_bw > > > ref_bw = dirty_ratelimit * pos_ratio * -------- > > > dirty_bw > > > > Actually, thinking about these formulas, why do we even bother with > > computing all these factors like write_bw, dirty_bw, pos_ratio, ... > > Couldn't we just have a feedback loop (probably similar to the one > > computing pos_ratio) which will maintain single value - ratelimit? When we > > are getting close to dirty limit, we will scale ratelimit down, when we > > will be getting significantly below dirty limit, we will scale the > > ratelimit up. Because looking at the formulas it seems to me that the net > > effect is the same - pos_ratio basically overrules everything... > > Good question. That is actually one of the early approaches I tried. > It somehow worked, however the resulted ratelimit is not only slow > responding, but also oscillating all the time. Yes, I think I vaguely remember that. > This is due to the imperfections > > 1) pos_ratio at best only provides a "direction" for adjusting the > ratelimit. There is only vague clues that if pos_ratio is small, > the errors in ratelimit should be small. > > 2) Due to time-lag, the assumptions in (1) about "direction" and > "error size" can be wrong. The ratelimit may already be > over-adjusted when the dirty pages take time to approach the > setpoint. The larger memory, the more time lag, the easier to > overshoot and oscillate. > > 3) dirty pages are constantly fluctuating around the setpoint, > so is pos_ratio. > > With (1) and (2), it's a control system very susceptible to disturbs. > With (3) we get constant disturbs. Well I had very hard time and > played dirty tricks (which you may never want to know ;-) trying to > tradeoff between response time and stableness.. Yes, I can see especially 2) is a problem. But I don't understand why your current formula would be that much different. As Peter decoded from your code, your current formula is: write_bw ref_bw = dirty_ratelimit * pos_ratio * -------- dirty_bw while previously it was essentially: ref_bw = dirty_ratelimit * pos_ratio So what is so magical about computing write_bw and dirty_bw separately? Is it because previously you did not use derivation of distance from the goal for updating pos_ratio? Because in your current formula write_bw/dirty_bw is a derivation of position... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-11 11:14 ` Jan Kara 0 siblings, 0 replies; 301+ messages in thread From: Jan Kara @ 2011-08-11 11:14 UTC (permalink / raw) To: Wu Fengguang Cc: Jan Kara, Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Thu 11-08-11 10:29:52, Wu Fengguang wrote: > On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote: > > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote: > > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote: > > > > > origin - dirty > > > > > pos_ratio = -------------- > > > > > origin - goal > > > > > > > > > which comes from the below [*] control line, so that when (dirty == goal), > > > > > pos_ratio == 1.0: > > > > > > > > OK, so basically you want a linear function for which: > > > > > > > > f(goal) = 1 and has a root somewhere > goal. > > > > > > > > (that one line is much more informative than all your graphs put > > > > together, one can start from there and derive your function) > > > > > > > > That does indeed get you the above function, now what does it mean? > > > > > > So going by: > > > > > > write_bw > > > ref_bw = dirty_ratelimit * pos_ratio * -------- > > > dirty_bw > > > > Actually, thinking about these formulas, why do we even bother with > > computing all these factors like write_bw, dirty_bw, pos_ratio, ... > > Couldn't we just have a feedback loop (probably similar to the one > > computing pos_ratio) which will maintain single value - ratelimit? When we > > are getting close to dirty limit, we will scale ratelimit down, when we > > will be getting significantly below dirty limit, we will scale the > > ratelimit up. Because looking at the formulas it seems to me that the net > > effect is the same - pos_ratio basically overrules everything... > > Good question. That is actually one of the early approaches I tried. > It somehow worked, however the resulted ratelimit is not only slow > responding, but also oscillating all the time. Yes, I think I vaguely remember that. > This is due to the imperfections > > 1) pos_ratio at best only provides a "direction" for adjusting the > ratelimit. There is only vague clues that if pos_ratio is small, > the errors in ratelimit should be small. > > 2) Due to time-lag, the assumptions in (1) about "direction" and > "error size" can be wrong. The ratelimit may already be > over-adjusted when the dirty pages take time to approach the > setpoint. The larger memory, the more time lag, the easier to > overshoot and oscillate. > > 3) dirty pages are constantly fluctuating around the setpoint, > so is pos_ratio. > > With (1) and (2), it's a control system very susceptible to disturbs. > With (3) we get constant disturbs. Well I had very hard time and > played dirty tricks (which you may never want to know ;-) trying to > tradeoff between response time and stableness.. Yes, I can see especially 2) is a problem. But I don't understand why your current formula would be that much different. As Peter decoded from your code, your current formula is: write_bw ref_bw = dirty_ratelimit * pos_ratio * -------- dirty_bw while previously it was essentially: ref_bw = dirty_ratelimit * pos_ratio So what is so magical about computing write_bw and dirty_bw separately? Is it because previously you did not use derivation of distance from the goal for updating pos_ratio? Because in your current formula write_bw/dirty_bw is a derivation of position... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-11 11:14 ` Jan Kara @ 2011-08-16 8:35 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-16 8:35 UTC (permalink / raw) To: Jan Kara Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Thu, Aug 11, 2011 at 07:14:23PM +0800, Jan Kara wrote: > On Thu 11-08-11 10:29:52, Wu Fengguang wrote: > > On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote: > > > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote: > > > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote: > > > > > > origin - dirty > > > > > > pos_ratio = -------------- > > > > > > origin - goal > > > > > > > > > > > which comes from the below [*] control line, so that when (dirty == goal), > > > > > > pos_ratio == 1.0: > > > > > > > > > > OK, so basically you want a linear function for which: > > > > > > > > > > f(goal) = 1 and has a root somewhere > goal. > > > > > > > > > > (that one line is much more informative than all your graphs put > > > > > together, one can start from there and derive your function) > > > > > > > > > > That does indeed get you the above function, now what does it mean? > > > > > > > > So going by: > > > > > > > > write_bw > > > > ref_bw = dirty_ratelimit * pos_ratio * -------- > > > > dirty_bw > > > > > > Actually, thinking about these formulas, why do we even bother with > > > computing all these factors like write_bw, dirty_bw, pos_ratio, ... > > > Couldn't we just have a feedback loop (probably similar to the one > > > computing pos_ratio) which will maintain single value - ratelimit? When we > > > are getting close to dirty limit, we will scale ratelimit down, when we > > > will be getting significantly below dirty limit, we will scale the > > > ratelimit up. Because looking at the formulas it seems to me that the net > > > effect is the same - pos_ratio basically overrules everything... > > > > Good question. That is actually one of the early approaches I tried. > > It somehow worked, however the resulted ratelimit is not only slow > > responding, but also oscillating all the time. > Yes, I think I vaguely remember that. > > > This is due to the imperfections > > > > 1) pos_ratio at best only provides a "direction" for adjusting the > > ratelimit. There is only vague clues that if pos_ratio is small, > > the errors in ratelimit should be small. > > > > 2) Due to time-lag, the assumptions in (1) about "direction" and > > "error size" can be wrong. The ratelimit may already be > > over-adjusted when the dirty pages take time to approach the > > setpoint. The larger memory, the more time lag, the easier to > > overshoot and oscillate. > > > > 3) dirty pages are constantly fluctuating around the setpoint, > > so is pos_ratio. > > > > With (1) and (2), it's a control system very susceptible to disturbs. > > With (3) we get constant disturbs. Well I had very hard time and > > played dirty tricks (which you may never want to know ;-) trying to > > tradeoff between response time and stableness.. > Yes, I can see especially 2) is a problem. But I don't understand why > your current formula would be that much different. As Peter decoded from > your code, your current formula is: > write_bw > ref_bw = dirty_ratelimit * pos_ratio * -------- > dirty_bw > > while previously it was essentially: > ref_bw = dirty_ratelimit * pos_ratio Sorry what's the code you are referring to? Does the changelog in the newly posted patchset make the ref_bw calculation and dirty_ratelimit updating more clear? > So what is so magical about computing write_bw and dirty_bw separately? Is > it because previously you did not use derivation of distance from the goal > for updating pos_ratio? Because in your current formula write_bw/dirty_bw > is a derivation of position... dirty_bw is the main feedback. If we are throttling too much, the resulting dirty_bw will be lowered than write_bw. Thus write_bw ref_bw = ratelimit_in_past_200ms * -------- dirty_bw will give us a higher ref_bw than ratelimit_in_past_200ms. For pure dd workload, the computed ref_bw by the above formula is exactly the balanced rate (if not considering trivial errors). Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-16 8:35 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-16 8:35 UTC (permalink / raw) To: Jan Kara Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Thu, Aug 11, 2011 at 07:14:23PM +0800, Jan Kara wrote: > On Thu 11-08-11 10:29:52, Wu Fengguang wrote: > > On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote: > > > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote: > > > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote: > > > > > > origin - dirty > > > > > > pos_ratio = -------------- > > > > > > origin - goal > > > > > > > > > > > which comes from the below [*] control line, so that when (dirty == goal), > > > > > > pos_ratio == 1.0: > > > > > > > > > > OK, so basically you want a linear function for which: > > > > > > > > > > f(goal) = 1 and has a root somewhere > goal. > > > > > > > > > > (that one line is much more informative than all your graphs put > > > > > together, one can start from there and derive your function) > > > > > > > > > > That does indeed get you the above function, now what does it mean? > > > > > > > > So going by: > > > > > > > > write_bw > > > > ref_bw = dirty_ratelimit * pos_ratio * -------- > > > > dirty_bw > > > > > > Actually, thinking about these formulas, why do we even bother with > > > computing all these factors like write_bw, dirty_bw, pos_ratio, ... > > > Couldn't we just have a feedback loop (probably similar to the one > > > computing pos_ratio) which will maintain single value - ratelimit? When we > > > are getting close to dirty limit, we will scale ratelimit down, when we > > > will be getting significantly below dirty limit, we will scale the > > > ratelimit up. Because looking at the formulas it seems to me that the net > > > effect is the same - pos_ratio basically overrules everything... > > > > Good question. That is actually one of the early approaches I tried. > > It somehow worked, however the resulted ratelimit is not only slow > > responding, but also oscillating all the time. > Yes, I think I vaguely remember that. > > > This is due to the imperfections > > > > 1) pos_ratio at best only provides a "direction" for adjusting the > > ratelimit. There is only vague clues that if pos_ratio is small, > > the errors in ratelimit should be small. > > > > 2) Due to time-lag, the assumptions in (1) about "direction" and > > "error size" can be wrong. The ratelimit may already be > > over-adjusted when the dirty pages take time to approach the > > setpoint. The larger memory, the more time lag, the easier to > > overshoot and oscillate. > > > > 3) dirty pages are constantly fluctuating around the setpoint, > > so is pos_ratio. > > > > With (1) and (2), it's a control system very susceptible to disturbs. > > With (3) we get constant disturbs. Well I had very hard time and > > played dirty tricks (which you may never want to know ;-) trying to > > tradeoff between response time and stableness.. > Yes, I can see especially 2) is a problem. But I don't understand why > your current formula would be that much different. As Peter decoded from > your code, your current formula is: > write_bw > ref_bw = dirty_ratelimit * pos_ratio * -------- > dirty_bw > > while previously it was essentially: > ref_bw = dirty_ratelimit * pos_ratio Sorry what's the code you are referring to? Does the changelog in the newly posted patchset make the ref_bw calculation and dirty_ratelimit updating more clear? > So what is so magical about computing write_bw and dirty_bw separately? Is > it because previously you did not use derivation of distance from the goal > for updating pos_ratio? Because in your current formula write_bw/dirty_bw > is a derivation of position... dirty_bw is the main feedback. If we are throttling too much, the resulting dirty_bw will be lowered than write_bw. Thus write_bw ref_bw = ratelimit_in_past_200ms * -------- dirty_bw will give us a higher ref_bw than ratelimit_in_past_200ms. For pure dd workload, the computed ref_bw by the above formula is exactly the balanced rate (if not considering trivial errors). Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-09 17:20 ` Peter Zijlstra @ 2011-08-12 13:19 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-12 13:19 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 01:20:27AM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote: > > > origin - dirty > > > pos_ratio = -------------- > > > origin - goal > > > > > which comes from the below [*] control line, so that when (dirty == goal), > > > pos_ratio == 1.0: > > > > OK, so basically you want a linear function for which: > > > > f(goal) = 1 and has a root somewhere > goal. > > > > (that one line is much more informative than all your graphs put > > together, one can start from there and derive your function) > > > > That does indeed get you the above function, now what does it mean? > > So going by: > > write_bw > ref_bw = dirty_ratelimit * pos_ratio * -------- > dirty_bw > > pos_ratio seems to be the feedback on the deviation of the dirty pages > around its setpoint. Yes. > So we adjust the reference bw (or rather ratelimit) > to take account of the shift in output vs input capacity as well as the > shift in dirty pages around its setpoint. However the above function should better be interpreted as write_bw ref_bw = task_ratelimit_in_past_200ms * -------- dirty_bw where task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio It would be highly confusing if trying to find the direct "logical" relationships between ref_bw and pos_ratio in the above equation. > From that we derive the condition that: > > pos_ratio(setpoint) := 1 Right. > Now in order to create a linear function we need one more condition. We > get one from the fact that once we hit the limit we should hard throttle > our writers. We get that by setting the ratelimit to 0, because, after > all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus: > > pos_ratio(limit) := 0 > > Using these two conditions we can solve the equations and get your: > > limit - dirty > pos_ratio(dirty) = ---------------- > limit - setpoint > > Now, for some reason you chose not to use limit, but something like > min(limit, 4*thresh) something to do with the slope affecting the rate > of adjustment. This wants a comment someplace. Thanks to your reasoning that lead to the more elegant setpoint - dirty 3 pos_ratio(dirty) := 1 + (----------------) limit - setpoint Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 13:19 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-12 13:19 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 01:20:27AM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote: > > > origin - dirty > > > pos_ratio = -------------- > > > origin - goal > > > > > which comes from the below [*] control line, so that when (dirty == goal), > > > pos_ratio == 1.0: > > > > OK, so basically you want a linear function for which: > > > > f(goal) = 1 and has a root somewhere > goal. > > > > (that one line is much more informative than all your graphs put > > together, one can start from there and derive your function) > > > > That does indeed get you the above function, now what does it mean? > > So going by: > > write_bw > ref_bw = dirty_ratelimit * pos_ratio * -------- > dirty_bw > > pos_ratio seems to be the feedback on the deviation of the dirty pages > around its setpoint. Yes. > So we adjust the reference bw (or rather ratelimit) > to take account of the shift in output vs input capacity as well as the > shift in dirty pages around its setpoint. However the above function should better be interpreted as write_bw ref_bw = task_ratelimit_in_past_200ms * -------- dirty_bw where task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio It would be highly confusing if trying to find the direct "logical" relationships between ref_bw and pos_ratio in the above equation. > From that we derive the condition that: > > pos_ratio(setpoint) := 1 Right. > Now in order to create a linear function we need one more condition. We > get one from the fact that once we hit the limit we should hard throttle > our writers. We get that by setting the ratelimit to 0, because, after > all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus: > > pos_ratio(limit) := 0 > > Using these two conditions we can solve the equations and get your: > > limit - dirty > pos_ratio(dirty) = ---------------- > limit - setpoint > > Now, for some reason you chose not to use limit, but something like > min(limit, 4*thresh) something to do with the slope affecting the rate > of adjustment. This wants a comment someplace. Thanks to your reasoning that lead to the more elegant setpoint - dirty 3 pos_ratio(dirty) := 1 + (----------------) limit - setpoint Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-08 23:05 ` Wu Fengguang @ 2011-08-10 21:40 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-10 21:40 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 09, 2011 at 07:05:35AM +0800, Wu Fengguang wrote: > On Mon, Aug 08, 2011 at 10:41:41PM +0800, Peter Zijlstra wrote: > > On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote: > > > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio( > > > goal = thresh - thresh / DIRTY_SCOPE; > > > origin = 4 * thresh; > > > > > > - if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { > > > - origin = limit; /* auxiliary control line */ > > > - goal = (goal + origin) / 2; > > > - pos_ratio >>= 1; > > > - } > > > pos_ratio = origin - dirty; > > > pos_ratio <<= BANDWIDTH_CALC_SHIFT; > > > do_div(pos_ratio, origin - goal + 1); > > FYI I've updated the fix to the below one, so that @limit will be used > as the origin in the rare case of (4*thresh < dirty). > > --- linux-next.orig/mm/page-writeback.c 2011-08-08 21:56:11.000000000 +0800 > +++ linux-next/mm/page-writeback.c 2011-08-09 06:34:25.000000000 +0800 > @@ -536,13 +536,8 @@ static unsigned long bdi_position_ratio( > * global setpoint > */ > goal = thresh - thresh / DIRTY_SCOPE; > - origin = 4 * thresh; > + origin = max(4 * thresh, limit); Hi Fengguang, Ok, so just trying to understand this pos_ratio little better. You have following basic formula. origin - dirty pos_ratio = -------------- origin - goal Terminology is very confusing and following is my understanding. - setpoint == goal setpoint is the point where we would like our number of dirty pages to be and at this point pos_ratio = 1. For global dirty this number seems to be (thresh - thresh / DIRTY_SCOPE) - thresh dirty page threshold calculated from dirty_ratio (Certain percentage of total memory). - Origin (seems to be equivalent of limit) This seems to be the reference point/limit we don't want to cross and distance from this limit basically decides the pos_ratio. Closer we are to limit, lower the pos_ratio and further we are higher the pos_ratio. So threshold is just a number which helps us determine goal and limit. goal = thresh - thresh / DIRTY_SCOPE limit = 4*thresh So goal is where we want to be and we start throttling the task more as we move away goal and approach limit. We keep the limit high enough so that (origin-dirty) does not become negative entity. So we do expect to cross "thresh" otherwise thresh itself could have served as limit? If my understanding is right, then can we get rid of terms "setpoint" and "origin". Would it be easier to understand the things if we just talk in terms of "goal" and "limit" and how these are derived from "thresh". thresh == soft limit limit == 4*thresh (hard limit) goal = thresh - thresh / DIRTY_SCOPE (where we want system to be in steady state). limit - dirty pos_ratio = -------------- limit - goal Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-10 21:40 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-10 21:40 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 09, 2011 at 07:05:35AM +0800, Wu Fengguang wrote: > On Mon, Aug 08, 2011 at 10:41:41PM +0800, Peter Zijlstra wrote: > > On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote: > > > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio( > > > goal = thresh - thresh / DIRTY_SCOPE; > > > origin = 4 * thresh; > > > > > > - if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { > > > - origin = limit; /* auxiliary control line */ > > > - goal = (goal + origin) / 2; > > > - pos_ratio >>= 1; > > > - } > > > pos_ratio = origin - dirty; > > > pos_ratio <<= BANDWIDTH_CALC_SHIFT; > > > do_div(pos_ratio, origin - goal + 1); > > FYI I've updated the fix to the below one, so that @limit will be used > as the origin in the rare case of (4*thresh < dirty). > > --- linux-next.orig/mm/page-writeback.c 2011-08-08 21:56:11.000000000 +0800 > +++ linux-next/mm/page-writeback.c 2011-08-09 06:34:25.000000000 +0800 > @@ -536,13 +536,8 @@ static unsigned long bdi_position_ratio( > * global setpoint > */ > goal = thresh - thresh / DIRTY_SCOPE; > - origin = 4 * thresh; > + origin = max(4 * thresh, limit); Hi Fengguang, Ok, so just trying to understand this pos_ratio little better. You have following basic formula. origin - dirty pos_ratio = -------------- origin - goal Terminology is very confusing and following is my understanding. - setpoint == goal setpoint is the point where we would like our number of dirty pages to be and at this point pos_ratio = 1. For global dirty this number seems to be (thresh - thresh / DIRTY_SCOPE) - thresh dirty page threshold calculated from dirty_ratio (Certain percentage of total memory). - Origin (seems to be equivalent of limit) This seems to be the reference point/limit we don't want to cross and distance from this limit basically decides the pos_ratio. Closer we are to limit, lower the pos_ratio and further we are higher the pos_ratio. So threshold is just a number which helps us determine goal and limit. goal = thresh - thresh / DIRTY_SCOPE limit = 4*thresh So goal is where we want to be and we start throttling the task more as we move away goal and approach limit. We keep the limit high enough so that (origin-dirty) does not become negative entity. So we do expect to cross "thresh" otherwise thresh itself could have served as limit? If my understanding is right, then can we get rid of terms "setpoint" and "origin". Would it be easier to understand the things if we just talk in terms of "goal" and "limit" and how these are derived from "thresh". thresh == soft limit limit == 4*thresh (hard limit) goal = thresh - thresh / DIRTY_SCOPE (where we want system to be in steady state). limit - dirty pos_ratio = -------------- limit - goal Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-10 21:40 ` Vivek Goyal @ 2011-08-16 8:55 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-16 8:55 UTC (permalink / raw) To: Vivek Goyal Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML Hi Vivek, Sorry it made such a big confusion to you. I hope Peter's 3rd order polynomial abstraction in v9 can clarify the concepts a lot. As for the old global control line origin - dirty pos_ratio = -------------- (1) origin - goal where origin = 4 * thresh (2) effectively decides the slope of the line. The use of @limit in code origin = max(4 * thresh, limit) (3) is merely to safeguard the rare case that (2) might result in negative pos_ratio in (1). I have another patch to add a "brake" area immediately below @limit that will scale pos_ratio down to 0. However that's no longer necessary with the 3rd order polynomial solution. Note that @limit will normally be equal to @thresh except in the rare case that @thresh is suddenly knocked down and @limit is taking time to follow it. Thanks, Fengguang > Hi Fengguang, > > Ok, so just trying to understand this pos_ratio little better. > > You have following basic formula. > > origin - dirty > pos_ratio = -------------- > origin - goal > > Terminology is very confusing and following is my understanding. > > - setpoint == goal > > setpoint is the point where we would like our number of dirty pages to > be and at this point pos_ratio = 1. For global dirty this number seems > to be (thresh - thresh / DIRTY_SCOPE) > > - thresh > dirty page threshold calculated from dirty_ratio (Certain percentage of > total memory). > > - Origin (seems to be equivalent of limit) > > This seems to be the reference point/limit we don't want to cross and > distance from this limit basically decides the pos_ratio. Closer we > are to limit, lower the pos_ratio and further we are higher the > pos_ratio. > > So threshold is just a number which helps us determine goal and limit. > > goal = thresh - thresh / DIRTY_SCOPE > limit = 4*thresh > > So goal is where we want to be and we start throttling the task more as > we move away goal and approach limit. We keep the limit high enough > so that (origin-dirty) does not become negative entity. > > So we do expect to cross "thresh" otherwise thresh itself could have > served as limit? > > If my understanding is right, then can we get rid of terms "setpoint" and > "origin". Would it be easier to understand the things if we just talk > in terms of "goal" and "limit" and how these are derived from "thresh". > > thresh == soft limit > limit == 4*thresh (hard limit) > goal = thresh - thresh / DIRTY_SCOPE (where we want system to > be in steady state). > limit - dirty > pos_ratio = -------------- > limit - goal > > Thanks > Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-16 8:55 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-16 8:55 UTC (permalink / raw) To: Vivek Goyal Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML Hi Vivek, Sorry it made such a big confusion to you. I hope Peter's 3rd order polynomial abstraction in v9 can clarify the concepts a lot. As for the old global control line origin - dirty pos_ratio = -------------- (1) origin - goal where origin = 4 * thresh (2) effectively decides the slope of the line. The use of @limit in code origin = max(4 * thresh, limit) (3) is merely to safeguard the rare case that (2) might result in negative pos_ratio in (1). I have another patch to add a "brake" area immediately below @limit that will scale pos_ratio down to 0. However that's no longer necessary with the 3rd order polynomial solution. Note that @limit will normally be equal to @thresh except in the rare case that @thresh is suddenly knocked down and @limit is taking time to follow it. Thanks, Fengguang > Hi Fengguang, > > Ok, so just trying to understand this pos_ratio little better. > > You have following basic formula. > > origin - dirty > pos_ratio = -------------- > origin - goal > > Terminology is very confusing and following is my understanding. > > - setpoint == goal > > setpoint is the point where we would like our number of dirty pages to > be and at this point pos_ratio = 1. For global dirty this number seems > to be (thresh - thresh / DIRTY_SCOPE) > > - thresh > dirty page threshold calculated from dirty_ratio (Certain percentage of > total memory). > > - Origin (seems to be equivalent of limit) > > This seems to be the reference point/limit we don't want to cross and > distance from this limit basically decides the pos_ratio. Closer we > are to limit, lower the pos_ratio and further we are higher the > pos_ratio. > > So threshold is just a number which helps us determine goal and limit. > > goal = thresh - thresh / DIRTY_SCOPE > limit = 4*thresh > > So goal is where we want to be and we start throttling the task more as > we move away goal and approach limit. We keep the limit high enough > so that (origin-dirty) does not become negative entity. > > So we do expect to cross "thresh" otherwise thresh itself could have > served as limit? > > If my understanding is right, then can we get rid of terms "setpoint" and > "origin". Would it be easier to understand the things if we just talk > in terms of "goal" and "limit" and how these are derived from "thresh". > > thresh == soft limit > limit == 4*thresh (hard limit) > goal = thresh - thresh / DIRTY_SCOPE (where we want system to > be in steady state). > limit - dirty > pos_ratio = -------------- > limit - goal > > Thanks > Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-08 23:05 ` Wu Fengguang (?) @ 2011-08-11 22:56 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-11 22:56 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote: > So going by: > > write_bw > ref_bw = dirty_ratelimit * pos_ratio * -------- > dirty_bw > > pos_ratio seems to be the feedback on the deviation of the dirty pages > around its setpoint. So we adjust the reference bw (or rather ratelimit) > to take account of the shift in output vs input capacity as well as the > shift in dirty pages around its setpoint. > > From that we derive the condition that: > > pos_ratio(setpoint) := 1 > > Now in order to create a linear function we need one more condition. We > get one from the fact that once we hit the limit we should hard throttle > our writers. We get that by setting the ratelimit to 0, because, after > all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus: > > pos_ratio(limit) := 0 > > Using these two conditions we can solve the equations and get your: > > limit - dirty > pos_ratio(dirty) = ---------------- > limit - setpoint > > Now, for some reason you chose not to use limit, but something like > min(limit, 4*thresh) something to do with the slope affecting the rate > of adjustment. This wants a comment someplace. Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than your negative slope (df/dx < 0), simply because it implies your condition and because it expresses our hard stop at limit. Also, while I know this is totally over the top, but.. I saw you added a ramp and brake area in future patches, so have you considered using a third order polynomial instead? The simple: f(x) = -x^3 has the 'right' shape, all we need is move it so that: f(s) = 1 and stretch it to put the single root at our limit. You'd get something like: s - x 3 f(x) := 1 + (-----) d Which, as required, is 1 at our setpoint and the factor d stretches the middle bit. Which has a single (real) root at: x = s + d, by setting that to our limit, we get: d = l - s Making our final function look like: s - x 3 f(x) := 1 + (-----) l - s You can clamp it at [0,2] or so. The implementation wouldn't be too horrid either, something like: unsigned long bdi_pos_ratio(..) { if (dirty > limit) return 0; if (dirty < 2*setpoint - limit) return 2 * SCALE; x = SCALE * (setpoint - dirty) / (limit - setpoint); xx = (x * x) / SCALE; xxx = (xx * x) / SCALE; return xxx; } ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-11 22:56 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-11 22:56 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote: > So going by: > > write_bw > ref_bw = dirty_ratelimit * pos_ratio * -------- > dirty_bw > > pos_ratio seems to be the feedback on the deviation of the dirty pages > around its setpoint. So we adjust the reference bw (or rather ratelimit) > to take account of the shift in output vs input capacity as well as the > shift in dirty pages around its setpoint. > > From that we derive the condition that: > > pos_ratio(setpoint) := 1 > > Now in order to create a linear function we need one more condition. We > get one from the fact that once we hit the limit we should hard throttle > our writers. We get that by setting the ratelimit to 0, because, after > all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus: > > pos_ratio(limit) := 0 > > Using these two conditions we can solve the equations and get your: > > limit - dirty > pos_ratio(dirty) = ---------------- > limit - setpoint > > Now, for some reason you chose not to use limit, but something like > min(limit, 4*thresh) something to do with the slope affecting the rate > of adjustment. This wants a comment someplace. Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than your negative slope (df/dx < 0), simply because it implies your condition and because it expresses our hard stop at limit. Also, while I know this is totally over the top, but.. I saw you added a ramp and brake area in future patches, so have you considered using a third order polynomial instead? The simple: f(x) = -x^3 has the 'right' shape, all we need is move it so that: f(s) = 1 and stretch it to put the single root at our limit. You'd get something like: s - x 3 f(x) := 1 + (-----) d Which, as required, is 1 at our setpoint and the factor d stretches the middle bit. Which has a single (real) root at: x = s + d, by setting that to our limit, we get: d = l - s Making our final function look like: s - x 3 f(x) := 1 + (-----) l - s You can clamp it at [0,2] or so. The implementation wouldn't be too horrid either, something like: unsigned long bdi_pos_ratio(..) { if (dirty > limit) return 0; if (dirty < 2*setpoint - limit) return 2 * SCALE; x = SCALE * (setpoint - dirty) / (limit - setpoint); xx = (x * x) / SCALE; xxx = (xx * x) / SCALE; return xxx; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-11 22:56 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-11 22:56 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote: > So going by: > > write_bw > ref_bw = dirty_ratelimit * pos_ratio * -------- > dirty_bw > > pos_ratio seems to be the feedback on the deviation of the dirty pages > around its setpoint. So we adjust the reference bw (or rather ratelimit) > to take account of the shift in output vs input capacity as well as the > shift in dirty pages around its setpoint. > > From that we derive the condition that: > > pos_ratio(setpoint) := 1 > > Now in order to create a linear function we need one more condition. We > get one from the fact that once we hit the limit we should hard throttle > our writers. We get that by setting the ratelimit to 0, because, after > all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus: > > pos_ratio(limit) := 0 > > Using these two conditions we can solve the equations and get your: > > limit - dirty > pos_ratio(dirty) = ---------------- > limit - setpoint > > Now, for some reason you chose not to use limit, but something like > min(limit, 4*thresh) something to do with the slope affecting the rate > of adjustment. This wants a comment someplace. Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than your negative slope (df/dx < 0), simply because it implies your condition and because it expresses our hard stop at limit. Also, while I know this is totally over the top, but.. I saw you added a ramp and brake area in future patches, so have you considered using a third order polynomial instead? The simple: f(x) = -x^3 has the 'right' shape, all we need is move it so that: f(s) = 1 and stretch it to put the single root at our limit. You'd get something like: s - x 3 f(x) := 1 + (-----) d Which, as required, is 1 at our setpoint and the factor d stretches the middle bit. Which has a single (real) root at: x = s + d, by setting that to our limit, we get: d = l - s Making our final function look like: s - x 3 f(x) := 1 + (-----) l - s You can clamp it at [0,2] or so. The implementation wouldn't be too horrid either, something like: unsigned long bdi_pos_ratio(..) { if (dirty > limit) return 0; if (dirty < 2*setpoint - limit) return 2 * SCALE; x = SCALE * (setpoint - dirty) / (limit - setpoint); xx = (x * x) / SCALE; xxx = (xx * x) / SCALE; return xxx; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-11 22:56 ` Peter Zijlstra @ 2011-08-12 2:43 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-12 2:43 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, Aug 12, 2011 at 06:56:06AM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote: > > So going by: > > > > write_bw > > ref_bw = dirty_ratelimit * pos_ratio * -------- > > dirty_bw > > > > pos_ratio seems to be the feedback on the deviation of the dirty pages > > around its setpoint. So we adjust the reference bw (or rather ratelimit) > > to take account of the shift in output vs input capacity as well as the > > shift in dirty pages around its setpoint. > > > > From that we derive the condition that: > > > > pos_ratio(setpoint) := 1 > > > > Now in order to create a linear function we need one more condition. We > > get one from the fact that once we hit the limit we should hard throttle > > our writers. We get that by setting the ratelimit to 0, because, after > > all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus: > > > > pos_ratio(limit) := 0 > > > > Using these two conditions we can solve the equations and get your: > > > > limit - dirty > > pos_ratio(dirty) = ---------------- > > limit - setpoint > > > > Now, for some reason you chose not to use limit, but something like > > min(limit, 4*thresh) something to do with the slope affecting the rate > > of adjustment. This wants a comment someplace. > > Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than > your negative slope (df/dx < 0), simply because it implies your > condition and because it expresses our hard stop at limit. Right. That's good point. > Also, while I know this is totally over the top, but.. > > I saw you added a ramp and brake area in future patches, so have you > considered using a third order polynomial instead? No I have not ;) The 3 lines/curves should be a bit more flexible/configurable than the single 3rd order polynomial. However the 3rd order polynomial is sure much more simple and consistent by removing the explicit rampup/brake areas and curves. > The simple: > > f(x) = -x^3 > > has the 'right' shape, all we need is move it so that: > > f(s) = 1 > > and stretch it to put the single root at our limit. You'd get something > like: > > s - x 3 > f(x) := 1 + (-----) > d > > Which, as required, is 1 at our setpoint and the factor d stretches the > middle bit. Which has a single (real) root at: > > x = s + d, > > by setting that to our limit, we get: > > d = l - s > > Making our final function look like: > > s - x 3 > f(x) := 1 + (-----) > l - s Very intuitive reasoning, thanks! I substituted real numbers to the function assuming a mem=2GB system. with limit=thresh: gnuplot> set xrange [60000:80000] gnuplot> plot 1 + (70000.0 - x)**3/(80000-70000.0)**3 with limit=thresh+thresh/DIRTY_SCOPE gnuplot> set xrange [60000:90000] gnuplot> plot 1 + (70000.0 - x)**3/(90000-70000.0)**3 Figures attached. The latter produces reasonably flat slope and I'll give it a spin in the dd tests :) > You can clamp it at [0,2] or so. Looking at the figures, we may even do without the clamp because it's already inside the range [0, 2]. > The implementation wouldn't be too horrid either, something like: > > unsigned long bdi_pos_ratio(..) > { > if (dirty > limit) > return 0; > > if (dirty < 2*setpoint - limit) > return 2 * SCALE; > > x = SCALE * (setpoint - dirty) / (limit - setpoint); > xx = (x * x) / SCALE; > xxx = (xx * x) / SCALE; > > return xxx; > } Looks very neat, much simpler than the three curves solution! Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 2:43 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-12 2:43 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, Aug 12, 2011 at 06:56:06AM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote: > > So going by: > > > > write_bw > > ref_bw = dirty_ratelimit * pos_ratio * -------- > > dirty_bw > > > > pos_ratio seems to be the feedback on the deviation of the dirty pages > > around its setpoint. So we adjust the reference bw (or rather ratelimit) > > to take account of the shift in output vs input capacity as well as the > > shift in dirty pages around its setpoint. > > > > From that we derive the condition that: > > > > pos_ratio(setpoint) := 1 > > > > Now in order to create a linear function we need one more condition. We > > get one from the fact that once we hit the limit we should hard throttle > > our writers. We get that by setting the ratelimit to 0, because, after > > all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus: > > > > pos_ratio(limit) := 0 > > > > Using these two conditions we can solve the equations and get your: > > > > limit - dirty > > pos_ratio(dirty) = ---------------- > > limit - setpoint > > > > Now, for some reason you chose not to use limit, but something like > > min(limit, 4*thresh) something to do with the slope affecting the rate > > of adjustment. This wants a comment someplace. > > Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than > your negative slope (df/dx < 0), simply because it implies your > condition and because it expresses our hard stop at limit. Right. That's good point. > Also, while I know this is totally over the top, but.. > > I saw you added a ramp and brake area in future patches, so have you > considered using a third order polynomial instead? No I have not ;) The 3 lines/curves should be a bit more flexible/configurable than the single 3rd order polynomial. However the 3rd order polynomial is sure much more simple and consistent by removing the explicit rampup/brake areas and curves. > The simple: > > f(x) = -x^3 > > has the 'right' shape, all we need is move it so that: > > f(s) = 1 > > and stretch it to put the single root at our limit. You'd get something > like: > > s - x 3 > f(x) := 1 + (-----) > d > > Which, as required, is 1 at our setpoint and the factor d stretches the > middle bit. Which has a single (real) root at: > > x = s + d, > > by setting that to our limit, we get: > > d = l - s > > Making our final function look like: > > s - x 3 > f(x) := 1 + (-----) > l - s Very intuitive reasoning, thanks! I substituted real numbers to the function assuming a mem=2GB system. with limit=thresh: gnuplot> set xrange [60000:80000] gnuplot> plot 1 + (70000.0 - x)**3/(80000-70000.0)**3 with limit=thresh+thresh/DIRTY_SCOPE gnuplot> set xrange [60000:90000] gnuplot> plot 1 + (70000.0 - x)**3/(90000-70000.0)**3 Figures attached. The latter produces reasonably flat slope and I'll give it a spin in the dd tests :) > You can clamp it at [0,2] or so. Looking at the figures, we may even do without the clamp because it's already inside the range [0, 2]. > The implementation wouldn't be too horrid either, something like: > > unsigned long bdi_pos_ratio(..) > { > if (dirty > limit) > return 0; > > if (dirty < 2*setpoint - limit) > return 2 * SCALE; > > x = SCALE * (setpoint - dirty) / (limit - setpoint); > xx = (x * x) / SCALE; > xxx = (xx * x) / SCALE; > > return xxx; > } Looks very neat, much simpler than the three curves solution! Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-12 2:43 ` Wu Fengguang (?) @ 2011-08-12 3:18 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-12 3:18 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: Type: text/plain, Size: 1306 bytes --] Sorry forgot the 2 gnuplot figures, attached now. > > Making our final function look like: > > > > s - x 3 > > f(x) := 1 + (-----) > > l - s > > Very intuitive reasoning, thanks! > > I substituted real numbers to the function assuming a mem=2GB system. > > with limit=thresh: > > gnuplot> set xrange [60000:80000] > gnuplot> plot 1 + (70000.0 - x)**3/(80000-70000.0)**3 > > with limit=thresh+thresh/DIRTY_SCOPE > > gnuplot> set xrange [60000:90000] > gnuplot> plot 1 + (70000.0 - x)**3/(90000-70000.0)**3 > > Figures attached. The latter produces reasonably flat slope and I'll > give it a spin in the dd tests :) > > > You can clamp it at [0,2] or so. > > Looking at the figures, we may even do without the clamp because it's > already inside the range [0, 2]. > > > The implementation wouldn't be too horrid either, something like: > > > > unsigned long bdi_pos_ratio(..) > > { > > if (dirty > limit) > > return 0; > > > > if (dirty < 2*setpoint - limit) > > return 2 * SCALE; > > > > x = SCALE * (setpoint - dirty) / (limit - setpoint); > > xx = (x * x) / SCALE; > > xxx = (xx * x) / SCALE; > > > > return xxx; > > } > > Looks very neat, much simpler than the three curves solution! > > Thanks, > Fengguang [-- Attachment #2: 3rd-order-limit=thresh+halfscope.png --] [-- Type: image/png, Size: 30247 bytes --] [-- Attachment #3: 3rd-order-limit=thresh.png --] [-- Type: image/png, Size: 28785 bytes --] ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-12 2:43 ` Wu Fengguang @ 2011-08-12 5:45 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-12 5:45 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML > > Making our final function look like: > > > > s - x 3 > > f(x) := 1 + (-----) > > l - s > > Very intuitive reasoning, thanks! > > I substituted real numbers to the function assuming a mem=2GB system. > > with limit=thresh: > > gnuplot> set xrange [60000:80000] > gnuplot> plot 1 + (70000.0 - x)**3/(80000-70000.0)**3 I'll use the above one, which is more simple and elegant: f(freerun) = 2.0 f(setpoint) = 1.0 f(limit) = 0 Code is unsigned long freerun = (thresh + bg_thresh) / 2; setpoint = (limit + freerun) / 2; pos_ratio = abs(dirty - setpoint); pos_ratio <<= BANDWIDTH_CALC_SHIFT; do_div(pos_ratio, limit - setpoint + 1); x = pos_ratio; pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT; pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT; if (dirty > setpoint) pos_ratio = -pos_ratio; pos_ratio += 1 << BANDWIDTH_CALC_SHIFT; Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 5:45 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-12 5:45 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML > > Making our final function look like: > > > > s - x 3 > > f(x) := 1 + (-----) > > l - s > > Very intuitive reasoning, thanks! > > I substituted real numbers to the function assuming a mem=2GB system. > > with limit=thresh: > > gnuplot> set xrange [60000:80000] > gnuplot> plot 1 + (70000.0 - x)**3/(80000-70000.0)**3 I'll use the above one, which is more simple and elegant: f(freerun) = 2.0 f(setpoint) = 1.0 f(limit) = 0 Code is unsigned long freerun = (thresh + bg_thresh) / 2; setpoint = (limit + freerun) / 2; pos_ratio = abs(dirty - setpoint); pos_ratio <<= BANDWIDTH_CALC_SHIFT; do_div(pos_ratio, limit - setpoint + 1); x = pos_ratio; pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT; pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT; if (dirty > setpoint) pos_ratio = -pos_ratio; pos_ratio += 1 << BANDWIDTH_CALC_SHIFT; Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-12 5:45 ` Wu Fengguang (?) @ 2011-08-12 9:45 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 9:45 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote: > Code is > > unsigned long freerun = (thresh + bg_thresh) / 2; > > setpoint = (limit + freerun) / 2; > pos_ratio = abs(dirty - setpoint); > pos_ratio <<= BANDWIDTH_CALC_SHIFT; > do_div(pos_ratio, limit - setpoint + 1); Why do you use do_div()? from the code those things are unsigned long, and you can divide that just fine. Also, there's div64_s64 that can do signed divides for s64 types. That'll loose the extra conditionals you used for abs and putting the sign back. > x = pos_ratio; > pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT; > pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT; So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that solves to 6, which isn't going to be enough I figure since (dirty-setpoint) !< 64. So you really need to use u64/s64 types here, unsigned long just won't do, with u64 you have 64=2(10+b) 22 bits for x, which should fit. > if (dirty > setpoint) > pos_ratio = -pos_ratio; > pos_ratio += 1 << BANDWIDTH_CALC_SHIFT; ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 9:45 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 9:45 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote: > Code is > > unsigned long freerun = (thresh + bg_thresh) / 2; > > setpoint = (limit + freerun) / 2; > pos_ratio = abs(dirty - setpoint); > pos_ratio <<= BANDWIDTH_CALC_SHIFT; > do_div(pos_ratio, limit - setpoint + 1); Why do you use do_div()? from the code those things are unsigned long, and you can divide that just fine. Also, there's div64_s64 that can do signed divides for s64 types. That'll loose the extra conditionals you used for abs and putting the sign back. > x = pos_ratio; > pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT; > pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT; So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that solves to 6, which isn't going to be enough I figure since (dirty-setpoint) !< 64. So you really need to use u64/s64 types here, unsigned long just won't do, with u64 you have 64=2(10+b) 22 bits for x, which should fit. > if (dirty > setpoint) > pos_ratio = -pos_ratio; > pos_ratio += 1 << BANDWIDTH_CALC_SHIFT; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 9:45 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 9:45 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote: > Code is > > unsigned long freerun = (thresh + bg_thresh) / 2; > > setpoint = (limit + freerun) / 2; > pos_ratio = abs(dirty - setpoint); > pos_ratio <<= BANDWIDTH_CALC_SHIFT; > do_div(pos_ratio, limit - setpoint + 1); Why do you use do_div()? from the code those things are unsigned long, and you can divide that just fine. Also, there's div64_s64 that can do signed divides for s64 types. That'll loose the extra conditionals you used for abs and putting the sign back. > x = pos_ratio; > pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT; > pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT; So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that solves to 6, which isn't going to be enough I figure since (dirty-setpoint) !< 64. So you really need to use u64/s64 types here, unsigned long just won't do, with u64 you have 64=2(10+b) 22 bits for x, which should fit. > if (dirty > setpoint) > pos_ratio = -pos_ratio; > pos_ratio += 1 << BANDWIDTH_CALC_SHIFT; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-12 9:45 ` Peter Zijlstra @ 2011-08-12 11:07 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-12 11:07 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, Aug 12, 2011 at 05:45:33PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote: > > Code is > > > > unsigned long freerun = (thresh + bg_thresh) / 2; > > > > setpoint = (limit + freerun) / 2; > > pos_ratio = abs(dirty - setpoint); > > pos_ratio <<= BANDWIDTH_CALC_SHIFT; > > do_div(pos_ratio, limit - setpoint + 1); > > Why do you use do_div()? from the code those things are unsigned long, > and you can divide that just fine. Because pos_ratio was "unsigned long long".. > Also, there's div64_s64 that can do signed divides for s64 types. > That'll loose the extra conditionals you used for abs and putting the > sign back. Ah ok, good to know that :) > > x = pos_ratio; > > pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT; > > pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT; > > So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that > solves to 6, which isn't going to be enough I figure since > (dirty-setpoint) !< 64. > > So you really need to use u64/s64 types here, unsigned long just won't > do, with u64 you have 64=2(10+b) 22 bits for x, which should fit. Sure, here is the updated code: long long pos_ratio; /* for scaling up/down the rate limit */ long x; if (unlikely(dirty >= limit)) return 0; /* * global setpoint * * setpoint - dirty 3 * f(dirty) := 1 + (----------------) * limit - setpoint * * it's a 3rd order polynomial that subjects to * * (1) f(freerun) = 2.0 => rampup base_rate reasonably fast * (2) f(setpoint) = 1.0 => the balance point * (3) f(limit) = 0 => the hard limit * (4) df/dx < 0 => negative feedback control * (5) the closer to setpoint, the smaller |df/dx| (and the reverse), * => fast response on large errors; small oscillation near setpoint */ setpoint = (limit + freerun) / 2; pos_ratio = (setpoint - dirty) << RATELIMIT_CALC_SHIFT; pos_ratio = div_s64(pos_ratio, limit - setpoint + 1); x = pos_ratio; pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT; pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT; pos_ratio += 1 << RATELIMIT_CALC_SHIFT; Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 11:07 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-12 11:07 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, Aug 12, 2011 at 05:45:33PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote: > > Code is > > > > unsigned long freerun = (thresh + bg_thresh) / 2; > > > > setpoint = (limit + freerun) / 2; > > pos_ratio = abs(dirty - setpoint); > > pos_ratio <<= BANDWIDTH_CALC_SHIFT; > > do_div(pos_ratio, limit - setpoint + 1); > > Why do you use do_div()? from the code those things are unsigned long, > and you can divide that just fine. Because pos_ratio was "unsigned long long".. > Also, there's div64_s64 that can do signed divides for s64 types. > That'll loose the extra conditionals you used for abs and putting the > sign back. Ah ok, good to know that :) > > x = pos_ratio; > > pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT; > > pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT; > > So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that > solves to 6, which isn't going to be enough I figure since > (dirty-setpoint) !< 64. > > So you really need to use u64/s64 types here, unsigned long just won't > do, with u64 you have 64=2(10+b) 22 bits for x, which should fit. Sure, here is the updated code: long long pos_ratio; /* for scaling up/down the rate limit */ long x; if (unlikely(dirty >= limit)) return 0; /* * global setpoint * * setpoint - dirty 3 * f(dirty) := 1 + (----------------) * limit - setpoint * * it's a 3rd order polynomial that subjects to * * (1) f(freerun) = 2.0 => rampup base_rate reasonably fast * (2) f(setpoint) = 1.0 => the balance point * (3) f(limit) = 0 => the hard limit * (4) df/dx < 0 => negative feedback control * (5) the closer to setpoint, the smaller |df/dx| (and the reverse), * => fast response on large errors; small oscillation near setpoint */ setpoint = (limit + freerun) / 2; pos_ratio = (setpoint - dirty) << RATELIMIT_CALC_SHIFT; pos_ratio = div_s64(pos_ratio, limit - setpoint + 1); x = pos_ratio; pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT; pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT; pos_ratio += 1 << RATELIMIT_CALC_SHIFT; Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-12 11:07 ` Wu Fengguang (?) @ 2011-08-12 12:17 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 12:17 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 19:07 +0800, Wu Fengguang wrote: > Because pos_ratio was "unsigned long long".. Ah! totally missed that ;-) Yes looks good. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 12:17 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 12:17 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 19:07 +0800, Wu Fengguang wrote: > Because pos_ratio was "unsigned long long".. Ah! totally missed that ;-) Yes looks good. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 12:17 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 12:17 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 19:07 +0800, Wu Fengguang wrote: > Because pos_ratio was "unsigned long long".. Ah! totally missed that ;-) Yes looks good. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-12 2:43 ` Wu Fengguang (?) @ 2011-08-12 9:47 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 9:47 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote: > > s - x 3 > > f(x) := 1 + (-----) > > l - s > > Looks very neat, much simpler than the three curves solution! Glad you like it, there is of course the small matter of real-world behaviour to consider, lets hope that works as well :-) ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 9:47 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 9:47 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote: > > s - x 3 > > f(x) := 1 + (-----) > > l - s > > Looks very neat, much simpler than the three curves solution! Glad you like it, there is of course the small matter of real-world behaviour to consider, lets hope that works as well :-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 9:47 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 9:47 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote: > > s - x 3 > > f(x) := 1 + (-----) > > l - s > > Looks very neat, much simpler than the three curves solution! Glad you like it, there is of course the small matter of real-world behaviour to consider, lets hope that works as well :-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-12 9:47 ` Peter Zijlstra @ 2011-08-12 11:11 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-12 11:11 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, Aug 12, 2011 at 05:47:54PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote: > > > s - x 3 > > > f(x) := 1 + (-----) > > > l - s > > > > > Looks very neat, much simpler than the three curves solution! > > Glad you like it, there is of course the small matter of real-world > behaviour to consider, lets hope that works as well :-) It magically meets all the criteria in my mind, not to mention it can eliminate 2 extra patches. As for the tests, so far, so good :) Your arithmetics are awesome! Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 11:11 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-12 11:11 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, Aug 12, 2011 at 05:47:54PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote: > > > s - x 3 > > > f(x) := 1 + (-----) > > > l - s > > > > > Looks very neat, much simpler than the three curves solution! > > Glad you like it, there is of course the small matter of real-world > behaviour to consider, lets hope that works as well :-) It magically meets all the criteria in my mind, not to mention it can eliminate 2 extra patches. As for the tests, so far, so good :) Your arithmetics are awesome! Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-08 23:05 ` Wu Fengguang (?) @ 2011-08-12 12:54 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 12:54 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote: > > s - x 3 > f(x) := 1 + (-----) > d > btw, if you want steeper slopes for rampup and brake you can add another factor like: s - x 3 f(x) := 1 + a(-----) d And solve the whole f(l)=0 thing again to determine d in l and a. For 0 < a < 1 the slopes increase. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 12:54 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 12:54 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote: > > s - x 3 > f(x) := 1 + (-----) > d > btw, if you want steeper slopes for rampup and brake you can add another factor like: s - x 3 f(x) := 1 + a(-----) d And solve the whole f(l)=0 thing again to determine d in l and a. For 0 < a < 1 the slopes increase. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 12:54 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 12:54 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote: > > s - x 3 > f(x) := 1 + (-----) > d > btw, if you want steeper slopes for rampup and brake you can add another factor like: s - x 3 f(x) := 1 + a(-----) d And solve the whole f(l)=0 thing again to determine d in l and a. For 0 < a < 1 the slopes increase. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-12 12:54 ` Peter Zijlstra @ 2011-08-12 12:59 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-12 12:59 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote: > > > > s - x 3 > > f(x) := 1 + (-----) > > d > > > btw, if you want steeper slopes for rampup and brake you can add another > factor like: > > s - x 3 > f(x) := 1 + a(-----) > d > > And solve the whole f(l)=0 thing again to determine d in l and a. > > For 0 < a < 1 the slopes increase. Yes, we can leave it as a future tuning option. For now I'm pretty satisfied with the current function's shape :) Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 12:59 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-12 12:59 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote: > > > > s - x 3 > > f(x) := 1 + (-----) > > d > > > btw, if you want steeper slopes for rampup and brake you can add another > factor like: > > s - x 3 > f(x) := 1 + a(-----) > d > > And solve the whole f(l)=0 thing again to determine d in l and a. > > For 0 < a < 1 the slopes increase. Yes, we can leave it as a future tuning option. For now I'm pretty satisfied with the current function's shape :) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-12 12:59 ` Wu Fengguang (?) @ 2011-08-12 13:08 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 13:08 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 20:59 +0800, Wu Fengguang wrote: > On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote: > > On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote: > > > > > > s - x 3 > > > f(x) := 1 + (-----) > > > d > > > > > btw, if you want steeper slopes for rampup and brake you can add another > > factor like: > > > > s - x 3 > > f(x) := 1 + a(-----) > > d > > > > And solve the whole f(l)=0 thing again to determine d in l and a. > > > > For 0 < a < 1 the slopes increase. > > Yes, we can leave it as a future tuning option. For now I'm pretty > satisfied with the current function's shape :) Oh for sure, it just occurred to me when looking at your plots and thought I'd at least mention it.. You know something to poke at on a rainy afternoon ;-) ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 13:08 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 13:08 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 20:59 +0800, Wu Fengguang wrote: > On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote: > > On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote: > > > > > > s - x 3 > > > f(x) := 1 + (-----) > > > d > > > > > btw, if you want steeper slopes for rampup and brake you can add another > > factor like: > > > > s - x 3 > > f(x) := 1 + a(-----) > > d > > > > And solve the whole f(l)=0 thing again to determine d in l and a. > > > > For 0 < a < 1 the slopes increase. > > Yes, we can leave it as a future tuning option. For now I'm pretty > satisfied with the current function's shape :) Oh for sure, it just occurred to me when looking at your plots and thought I'd at least mention it.. You know something to poke at on a rainy afternoon ;-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 13:08 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 13:08 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 20:59 +0800, Wu Fengguang wrote: > On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote: > > On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote: > > > > > > s - x 3 > > > f(x) := 1 + (-----) > > > d > > > > > btw, if you want steeper slopes for rampup and brake you can add another > > factor like: > > > > s - x 3 > > f(x) := 1 + a(-----) > > d > > > > And solve the whole f(l)=0 thing again to determine d in l and a. > > > > For 0 < a < 1 the slopes increase. > > Yes, we can leave it as a future tuning option. For now I'm pretty > satisfied with the current function's shape :) Oh for sure, it just occurred to me when looking at your plots and thought I'd at least mention it.. You know something to poke at on a rainy afternoon ;-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-08 23:05 ` Wu Fengguang (?) @ 2011-08-12 13:04 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 13:04 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote: > > Now all of the above would seem to suggest: > > dirty_ratelimit := ref_bw > > However for that you use: > > if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) > dirty_ratelimit = max(ref_bw, pos_bw); > > if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) > dirty_ratelimit = min(ref_bw, pos_bw); > > You have: > > pos_bw = dirty_ratelimit * pos_ratio > > Which is ref_bw without the write_bw/dirty_bw factor, this confuses me.. > why are you ignoring the shift in output vs input rate there? Could you elaborate on this primary feedback loop? Its the one part I don't feel I actually understand well. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 13:04 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 13:04 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote: > > Now all of the above would seem to suggest: > > dirty_ratelimit := ref_bw > > However for that you use: > > if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) > dirty_ratelimit = max(ref_bw, pos_bw); > > if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) > dirty_ratelimit = min(ref_bw, pos_bw); > > You have: > > pos_bw = dirty_ratelimit * pos_ratio > > Which is ref_bw without the write_bw/dirty_bw factor, this confuses me.. > why are you ignoring the shift in output vs input rate there? Could you elaborate on this primary feedback loop? Its the one part I don't feel I actually understand well. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 13:04 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-12 13:04 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote: > > Now all of the above would seem to suggest: > > dirty_ratelimit := ref_bw > > However for that you use: > > if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) > dirty_ratelimit = max(ref_bw, pos_bw); > > if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) > dirty_ratelimit = min(ref_bw, pos_bw); > > You have: > > pos_bw = dirty_ratelimit * pos_ratio > > Which is ref_bw without the write_bw/dirty_bw factor, this confuses me.. > why are you ignoring the shift in output vs input rate there? Could you elaborate on this primary feedback loop? Its the one part I don't feel I actually understand well. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-12 13:04 ` Peter Zijlstra @ 2011-08-12 14:20 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-12 14:20 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML Peter, Sorry for the delay.. On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote: To start with, write_bw ref_bw = task_ratelimit_in_past_200ms * -------- dirty_bw where task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio > > Now all of the above would seem to suggest: > > > > dirty_ratelimit := ref_bw Right, ideally ref_bw is the balanced dirty ratelimit. I actually started with exactly the above equation when I got choked by pure pos_bw based feedback control (as mentioned in the reply to Jan's email) and introduced the ref_bw estimation as the way out. But there are some imperfections in ref_bw, too. Which makes it not suitable for direct use: 1) large fluctuations The dirty_bw used for computing ref_bw is merely averaged in the past 200ms (very small comparing to the 3s estimation period in write_bw), which makes rather dispersed distribution of ref_bw. http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/8G/ext4-10dd-4k-32p-6802M-20:10-3.0.0-next-20110802+-2011-08-06.16:48/balance_dirty_pages-pages.png Take a look at the blue [*] points in the above graph. I find it pretty hard to average out the singular points by increasing the estimation period. Considering that the averaging technique will introduce the very undesirable time lags, I give it up totally. (btw, the write_bw averaging time lag is much more acceptable because its impact is one-way and therefore won't lead to oscillations.) The one practical way is filtering -- the most large singular ref_bw points can be filtered out effectively by remembering some prev_ref_bw and prev_prev_ref_bw. However it cannot do away all of them. And the remaining majority ref_bw points are still randomly dancing around the ideal balanced rate. 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw) becomes unbalanced match, which leads to large systematical errors in ref_bw. The truncates, due to its possibly bumpy nature, can hardly be compensated smoothly. So let's face it. When some over-estimated ref_bw brings ->dirty_ratelimit high, higher than the setpoint, the pos_bw will in turn become lower than ->dirty_ratelimit. So if we consider both ref_bw and pos_bw and update ->dirty_ratelimit only when they are on the same side of ->dirty_ratelimit, the systematical errors in ref_bw won't be able to bring ->dirty_ratelimit too away. The ref_bw estimation is also not accurate when near the max pause and free run areas. 3) since we ultimately want to - keep the dirty pages around the setpoint as long time as possible - keep the fluctuations of task ratelimit as small as possible the update policy used for (2) also serves the above goals nicely: if for some reason the dirty pages are high (pos_bw < dirty_ratelimit), and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no point to bring up dirty_ratelimit in a hurry and to hurt both the above two goals. > > However for that you use: > > > > if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) > > dirty_ratelimit = max(ref_bw, pos_bw); > > > > if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) > > dirty_ratelimit = min(ref_bw, pos_bw); The above are merely constraints to the dirty_ratelimit update. It serves to 1) stop adjusting the rate when it's against the position control target (the adjusted rate will slow down the progress of dirty pages going back to setpoint). 2) limit the step size. pos_bw is changing values step by step, leaving a consistent trace comparing to the randomly jumping ref_bw. pos_bw also has smaller errors in stable state and normally have larger errors when there are big errors in rate. So it's a pretty good limiting factor for the step size of dirty_ratelimit. > > You have: > > > > pos_bw = dirty_ratelimit * pos_ratio > > > > Which is ref_bw without the write_bw/dirty_bw factor, this confuses me.. > > why are you ignoring the shift in output vs input rate there? Again, you need to understand pos_bw the other way. Only (pos_bw - dirty_ratelimit) matters here, which is exactly the position error. > Could you elaborate on this primary feedback loop? Its the one part I > don't feel I actually understand well. Hope the above elaboration helps :) Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-12 14:20 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-12 14:20 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML Peter, Sorry for the delay.. On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote: To start with, write_bw ref_bw = task_ratelimit_in_past_200ms * -------- dirty_bw where task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio > > Now all of the above would seem to suggest: > > > > dirty_ratelimit := ref_bw Right, ideally ref_bw is the balanced dirty ratelimit. I actually started with exactly the above equation when I got choked by pure pos_bw based feedback control (as mentioned in the reply to Jan's email) and introduced the ref_bw estimation as the way out. But there are some imperfections in ref_bw, too. Which makes it not suitable for direct use: 1) large fluctuations The dirty_bw used for computing ref_bw is merely averaged in the past 200ms (very small comparing to the 3s estimation period in write_bw), which makes rather dispersed distribution of ref_bw. http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/8G/ext4-10dd-4k-32p-6802M-20:10-3.0.0-next-20110802+-2011-08-06.16:48/balance_dirty_pages-pages.png Take a look at the blue [*] points in the above graph. I find it pretty hard to average out the singular points by increasing the estimation period. Considering that the averaging technique will introduce the very undesirable time lags, I give it up totally. (btw, the write_bw averaging time lag is much more acceptable because its impact is one-way and therefore won't lead to oscillations.) The one practical way is filtering -- the most large singular ref_bw points can be filtered out effectively by remembering some prev_ref_bw and prev_prev_ref_bw. However it cannot do away all of them. And the remaining majority ref_bw points are still randomly dancing around the ideal balanced rate. 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw) becomes unbalanced match, which leads to large systematical errors in ref_bw. The truncates, due to its possibly bumpy nature, can hardly be compensated smoothly. So let's face it. When some over-estimated ref_bw brings ->dirty_ratelimit high, higher than the setpoint, the pos_bw will in turn become lower than ->dirty_ratelimit. So if we consider both ref_bw and pos_bw and update ->dirty_ratelimit only when they are on the same side of ->dirty_ratelimit, the systematical errors in ref_bw won't be able to bring ->dirty_ratelimit too away. The ref_bw estimation is also not accurate when near the max pause and free run areas. 3) since we ultimately want to - keep the dirty pages around the setpoint as long time as possible - keep the fluctuations of task ratelimit as small as possible the update policy used for (2) also serves the above goals nicely: if for some reason the dirty pages are high (pos_bw < dirty_ratelimit), and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no point to bring up dirty_ratelimit in a hurry and to hurt both the above two goals. > > However for that you use: > > > > if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) > > dirty_ratelimit = max(ref_bw, pos_bw); > > > > if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) > > dirty_ratelimit = min(ref_bw, pos_bw); The above are merely constraints to the dirty_ratelimit update. It serves to 1) stop adjusting the rate when it's against the position control target (the adjusted rate will slow down the progress of dirty pages going back to setpoint). 2) limit the step size. pos_bw is changing values step by step, leaving a consistent trace comparing to the randomly jumping ref_bw. pos_bw also has smaller errors in stable state and normally have larger errors when there are big errors in rate. So it's a pretty good limiting factor for the step size of dirty_ratelimit. > > You have: > > > > pos_bw = dirty_ratelimit * pos_ratio > > > > Which is ref_bw without the write_bw/dirty_bw factor, this confuses me.. > > why are you ignoring the shift in output vs input rate there? Again, you need to understand pos_bw the other way. Only (pos_bw - dirty_ratelimit) matters here, which is exactly the position error. > Could you elaborate on this primary feedback loop? Its the one part I > don't feel I actually understand well. Hope the above elaboration helps :) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-12 14:20 ` Wu Fengguang (?) @ 2011-08-22 15:38 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-22 15:38 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote: > On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote: > > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote: > > To start with, > > write_bw > ref_bw = task_ratelimit_in_past_200ms * -------- > dirty_bw > > where > task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio > > > > Now all of the above would seem to suggest: > > > > > > dirty_ratelimit := ref_bw > > Right, ideally ref_bw is the balanced dirty ratelimit. I actually > started with exactly the above equation when I got choked by pure > pos_bw based feedback control (as mentioned in the reply to Jan's > email) and introduced the ref_bw estimation as the way out. > > But there are some imperfections in ref_bw, too. Which makes it not > suitable for direct use: > > 1) large fluctuations OK, understood. > 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw) > becomes unbalanced match, which leads to large systematical errors > in ref_bw. The truncates, due to its possibly bumpy nature, can hardly > be compensated smoothly. OK. > 3) since we ultimately want to > > - keep the dirty pages around the setpoint as long time as possible > - keep the fluctuations of task ratelimit as small as possible Fair enough ;-) > the update policy used for (2) also serves the above goals nicely: > if for some reason the dirty pages are high (pos_bw < dirty_ratelimit), > and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no > point to bring up dirty_ratelimit in a hurry and to hurt both the > above two goals. Right, so still I feel somewhat befuddled, so we have: dirty_ratelimit - rate at which we throttle dirtiers as estimated upto 200ms ago. pos_ratio - ratio adjusting the dirty_ratelimit for variance in dirty pages around its target bw_ratio - ratio adjusting the dirty_ratelimit for variance in input/output bandwidth and we need to basically do: dirty_ratelimit *= pos_ratio * bw_ratio to update the dirty_ratelimit to reflect the current state. However per 1) and 2) bw_ratio is crappy and hard to fix. So you propose to update dirty_ratelimit only if both pos_ratio and bw_ratio point in the same direction, however that would result in: if (pos_ratio < UNIT && bw_ratio < UNIT || pos_ratio > UNIT && bw_ratio > UNIT) { dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT; dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT; } > > > However for that you use: > > > > > > if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) > > > dirty_ratelimit = max(ref_bw, pos_bw); > > > > > > if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) > > > dirty_ratelimit = min(ref_bw, pos_bw); > > The above are merely constraints to the dirty_ratelimit update. > It serves to > > 1) stop adjusting the rate when it's against the position control > target (the adjusted rate will slow down the progress of dirty > pages going back to setpoint). Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then they point in different directions however: 0.5 < 1 && 0.5 * 1.1 < 1 so your code will in fact update the dirty_ratelimit, even though the two factors point in opposite directions. > 2) limit the step size. pos_bw is changing values step by step, > leaving a consistent trace comparing to the randomly jumping > ref_bw. pos_bw also has smaller errors in stable state and normally > have larger errors when there are big errors in rate. So it's a > pretty good limiting factor for the step size of dirty_ratelimit. OK, so that's the min/max stuff, however it only works because you use pos_bw and ref_bw instead of the fully separated factors. > Hope the above elaboration helps :) A little.. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-22 15:38 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-22 15:38 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote: > On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote: > > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote: > > To start with, > > write_bw > ref_bw = task_ratelimit_in_past_200ms * -------- > dirty_bw > > where > task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio > > > > Now all of the above would seem to suggest: > > > > > > dirty_ratelimit := ref_bw > > Right, ideally ref_bw is the balanced dirty ratelimit. I actually > started with exactly the above equation when I got choked by pure > pos_bw based feedback control (as mentioned in the reply to Jan's > email) and introduced the ref_bw estimation as the way out. > > But there are some imperfections in ref_bw, too. Which makes it not > suitable for direct use: > > 1) large fluctuations OK, understood. > 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw) > becomes unbalanced match, which leads to large systematical errors > in ref_bw. The truncates, due to its possibly bumpy nature, can hardly > be compensated smoothly. OK. > 3) since we ultimately want to > > - keep the dirty pages around the setpoint as long time as possible > - keep the fluctuations of task ratelimit as small as possible Fair enough ;-) > the update policy used for (2) also serves the above goals nicely: > if for some reason the dirty pages are high (pos_bw < dirty_ratelimit), > and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no > point to bring up dirty_ratelimit in a hurry and to hurt both the > above two goals. Right, so still I feel somewhat befuddled, so we have: dirty_ratelimit - rate at which we throttle dirtiers as estimated upto 200ms ago. pos_ratio - ratio adjusting the dirty_ratelimit for variance in dirty pages around its target bw_ratio - ratio adjusting the dirty_ratelimit for variance in input/output bandwidth and we need to basically do: dirty_ratelimit *= pos_ratio * bw_ratio to update the dirty_ratelimit to reflect the current state. However per 1) and 2) bw_ratio is crappy and hard to fix. So you propose to update dirty_ratelimit only if both pos_ratio and bw_ratio point in the same direction, however that would result in: if (pos_ratio < UNIT && bw_ratio < UNIT || pos_ratio > UNIT && bw_ratio > UNIT) { dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT; dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT; } > > > However for that you use: > > > > > > if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) > > > dirty_ratelimit = max(ref_bw, pos_bw); > > > > > > if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) > > > dirty_ratelimit = min(ref_bw, pos_bw); > > The above are merely constraints to the dirty_ratelimit update. > It serves to > > 1) stop adjusting the rate when it's against the position control > target (the adjusted rate will slow down the progress of dirty > pages going back to setpoint). Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then they point in different directions however: 0.5 < 1 && 0.5 * 1.1 < 1 so your code will in fact update the dirty_ratelimit, even though the two factors point in opposite directions. > 2) limit the step size. pos_bw is changing values step by step, > leaving a consistent trace comparing to the randomly jumping > ref_bw. pos_bw also has smaller errors in stable state and normally > have larger errors when there are big errors in rate. So it's a > pretty good limiting factor for the step size of dirty_ratelimit. OK, so that's the min/max stuff, however it only works because you use pos_bw and ref_bw instead of the fully separated factors. > Hope the above elaboration helps :) A little.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-22 15:38 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-22 15:38 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote: > On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote: > > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote: > > To start with, > > write_bw > ref_bw = task_ratelimit_in_past_200ms * -------- > dirty_bw > > where > task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio > > > > Now all of the above would seem to suggest: > > > > > > dirty_ratelimit := ref_bw > > Right, ideally ref_bw is the balanced dirty ratelimit. I actually > started with exactly the above equation when I got choked by pure > pos_bw based feedback control (as mentioned in the reply to Jan's > email) and introduced the ref_bw estimation as the way out. > > But there are some imperfections in ref_bw, too. Which makes it not > suitable for direct use: > > 1) large fluctuations OK, understood. > 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw) > becomes unbalanced match, which leads to large systematical errors > in ref_bw. The truncates, due to its possibly bumpy nature, can hardly > be compensated smoothly. OK. > 3) since we ultimately want to > > - keep the dirty pages around the setpoint as long time as possible > - keep the fluctuations of task ratelimit as small as possible Fair enough ;-) > the update policy used for (2) also serves the above goals nicely: > if for some reason the dirty pages are high (pos_bw < dirty_ratelimit), > and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no > point to bring up dirty_ratelimit in a hurry and to hurt both the > above two goals. Right, so still I feel somewhat befuddled, so we have: dirty_ratelimit - rate at which we throttle dirtiers as estimated upto 200ms ago. pos_ratio - ratio adjusting the dirty_ratelimit for variance in dirty pages around its target bw_ratio - ratio adjusting the dirty_ratelimit for variance in input/output bandwidth and we need to basically do: dirty_ratelimit *= pos_ratio * bw_ratio to update the dirty_ratelimit to reflect the current state. However per 1) and 2) bw_ratio is crappy and hard to fix. So you propose to update dirty_ratelimit only if both pos_ratio and bw_ratio point in the same direction, however that would result in: if (pos_ratio < UNIT && bw_ratio < UNIT || pos_ratio > UNIT && bw_ratio > UNIT) { dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT; dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT; } > > > However for that you use: > > > > > > if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) > > > dirty_ratelimit = max(ref_bw, pos_bw); > > > > > > if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) > > > dirty_ratelimit = min(ref_bw, pos_bw); > > The above are merely constraints to the dirty_ratelimit update. > It serves to > > 1) stop adjusting the rate when it's against the position control > target (the adjusted rate will slow down the progress of dirty > pages going back to setpoint). Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then they point in different directions however: 0.5 < 1 && 0.5 * 1.1 < 1 so your code will in fact update the dirty_ratelimit, even though the two factors point in opposite directions. > 2) limit the step size. pos_bw is changing values step by step, > leaving a consistent trace comparing to the randomly jumping > ref_bw. pos_bw also has smaller errors in stable state and normally > have larger errors when there are big errors in rate. So it's a > pretty good limiting factor for the step size of dirty_ratelimit. OK, so that's the min/max stuff, however it only works because you use pos_bw and ref_bw instead of the fully separated factors. > Hope the above elaboration helps :) A little.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-22 15:38 ` Peter Zijlstra @ 2011-08-23 3:40 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-23 3:40 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, Aug 22, 2011 at 11:38:07PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote: > > On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote: > > > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote: > > > > To start with, > > > > write_bw > > ref_bw = task_ratelimit_in_past_200ms * -------- > > dirty_bw > > > > where > > task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio > > > > > > Now all of the above would seem to suggest: > > > > > > > > dirty_ratelimit := ref_bw > > > > Right, ideally ref_bw is the balanced dirty ratelimit. I actually > > started with exactly the above equation when I got choked by pure > > pos_bw based feedback control (as mentioned in the reply to Jan's > > email) and introduced the ref_bw estimation as the way out. > > > > But there are some imperfections in ref_bw, too. Which makes it not > > suitable for direct use: > > > > 1) large fluctuations > > OK, understood. > > > 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw) > > becomes unbalanced match, which leads to large systematical errors > > in ref_bw. The truncates, due to its possibly bumpy nature, can hardly > > be compensated smoothly. > > OK. > > > 3) since we ultimately want to > > > > - keep the dirty pages around the setpoint as long time as possible > > - keep the fluctuations of task ratelimit as small as possible > > Fair enough ;-) > > > the update policy used for (2) also serves the above goals nicely: > > if for some reason the dirty pages are high (pos_bw < dirty_ratelimit), > > and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no > > point to bring up dirty_ratelimit in a hurry and to hurt both the > > above two goals. > > Right, so still I feel somewhat befuddled, so we have: > > dirty_ratelimit - rate at which we throttle dirtiers as > estimated upto 200ms ago. Note that bdi->dirty_ratelimit is supposed to be the balanced ratelimit, ie. (write_bw / N), regardless whether dirty pages meets the setpoint. In _concept_, the bdi balanced ratelimit is updated _independent_ of the position control embodied in the task ratelimit calculation. A lot of confusions seem to come from the seemingly inter-twisted rate and position controls, however in my mind, there are two levels of relationship: 1) work fundamentally independent of each other, each tries to fulfill one single target (either balanced rate or balanced position) 2) _based_ on (1), completely optional, try to constraint the rate update to get more stable ->dirty_ratelimit and more balanced dirty position Note that (2) is not a must even if there are systematic errors in balanced_rate calculation. For example, the v8 patchset only does (1) and hence do simple bdi->dirty_ratelimit = balanced_rate; And it can still balance at some point (though not exactly around the setpoint): http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G-bs=1M/ext4-1dd-1M-8p-2942M-20:10-3.0.0-next-20110802+-2011-08-08.19:47/balance_dirty_pages-pages.png Even if ext4 has mis-matched (dirty_rate:write_bw ~= 3:2) hence introduced systematic errors in balanced_rate: http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G-bs=1M/ext4-1dd-1M-8p-2942M-20:10-3.0.0-next-20110802+-2011-08-08.19:47/global_dirtied_written.png > pos_ratio - ratio adjusting the dirty_ratelimit > for variance in dirty pages around its target So pos_ratio is - is a _limiting_ factor rather than an _adjusting_ factor for updating ->dirty_ratelimit (when do (2)) - not a factor at all for updating balanced_rate (whether or not we do (2)) well, in this concept: the balanced_rate formula inherently does not derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's based on the ratelimit executed for the past 200ms: balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio and task_ratelimit_200ms happen to can be estimated from task_ratelimit_200ms ~= balanced_rate_i * pos_ratio There is fundamentally no dependency between balanced_rate_(i+1) and balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation only asks for _whatever_ CONSTANT task ratelimit to be executed for 200ms, then it get the balanced rate from the dirty_rate feedback. We may alternatively record every task_ratelimit executed in the past 200ms and average them all to get task_ratelimit_200ms. In this way we take the "superfluous" pos_ratio out of sight :) > bw_ratio - ratio adjusting the dirty_ratelimit > for variance in input/output bandwidth > > and we need to basically do: > > dirty_ratelimit *= pos_ratio * bw_ratio So there is even no such recursing at all: balanced_rate *= bw_ratio Each balanced_rate is estimated from the start, based on each 200ms period. > to update the dirty_ratelimit to reflect the current state. However per > 1) and 2) bw_ratio is crappy and hard to fix. > > So you propose to update dirty_ratelimit only if both pos_ratio and > bw_ratio point in the same direction, however that would result in: > > if (pos_ratio < UNIT && bw_ratio < UNIT || > pos_ratio > UNIT && bw_ratio > UNIT) { > dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT; > dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT; > } We start by doing this for (1): dirty_ratelimit = balanced_rate and then try to refine it for (1)+(2): dirty_ratelimit => balanced_rate, but limit the progress by pos_ratio > > > > However for that you use: > > > > > > > > if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) > > > > dirty_ratelimit = max(ref_bw, pos_bw); > > > > > > > > if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) > > > > dirty_ratelimit = min(ref_bw, pos_bw); > > > > The above are merely constraints to the dirty_ratelimit update. > > It serves to > > > > 1) stop adjusting the rate when it's against the position control > > target (the adjusted rate will slow down the progress of dirty > > pages going back to setpoint). > > Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then > they point in different directions however: > > 0.5 < 1 && 0.5 * 1.1 < 1 > > so your code will in fact update the dirty_ratelimit, even though the > two factors point in opposite directions. It does not work that way since pos_ratio does not take part in the multiplication. However I admit that the tests (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) don't aim to avoid all unnecessary updates, and it may even stop some rightful updates. It's not possible at all to act perfect. It's merely a rule that sounds "reasonable" in theory and works reasonably good in practice :) I'd be happy to try more if there are better ones. > > 2) limit the step size. pos_bw is changing values step by step, > > leaving a consistent trace comparing to the randomly jumping > > ref_bw. pos_bw also has smaller errors in stable state and normally > > have larger errors when there are big errors in rate. So it's a > > pretty good limiting factor for the step size of dirty_ratelimit. > > OK, so that's the min/max stuff, however it only works because you use > pos_bw and ref_bw instead of the fully separated factors. Yes, the min/max stuff is for limiting the step size. The "limiting" intention can be made more clear if written as delta = balanced_rate - base_rate; if (delta > pos_rate - base_rate) delta = pos_rate - base_rate; delta /= 8; > > Hope the above elaboration helps :) > > A little.. And now? ;) Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-23 3:40 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-23 3:40 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, Aug 22, 2011 at 11:38:07PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote: > > On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote: > > > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote: > > > > To start with, > > > > write_bw > > ref_bw = task_ratelimit_in_past_200ms * -------- > > dirty_bw > > > > where > > task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio > > > > > > Now all of the above would seem to suggest: > > > > > > > > dirty_ratelimit := ref_bw > > > > Right, ideally ref_bw is the balanced dirty ratelimit. I actually > > started with exactly the above equation when I got choked by pure > > pos_bw based feedback control (as mentioned in the reply to Jan's > > email) and introduced the ref_bw estimation as the way out. > > > > But there are some imperfections in ref_bw, too. Which makes it not > > suitable for direct use: > > > > 1) large fluctuations > > OK, understood. > > > 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw) > > becomes unbalanced match, which leads to large systematical errors > > in ref_bw. The truncates, due to its possibly bumpy nature, can hardly > > be compensated smoothly. > > OK. > > > 3) since we ultimately want to > > > > - keep the dirty pages around the setpoint as long time as possible > > - keep the fluctuations of task ratelimit as small as possible > > Fair enough ;-) > > > the update policy used for (2) also serves the above goals nicely: > > if for some reason the dirty pages are high (pos_bw < dirty_ratelimit), > > and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no > > point to bring up dirty_ratelimit in a hurry and to hurt both the > > above two goals. > > Right, so still I feel somewhat befuddled, so we have: > > dirty_ratelimit - rate at which we throttle dirtiers as > estimated upto 200ms ago. Note that bdi->dirty_ratelimit is supposed to be the balanced ratelimit, ie. (write_bw / N), regardless whether dirty pages meets the setpoint. In _concept_, the bdi balanced ratelimit is updated _independent_ of the position control embodied in the task ratelimit calculation. A lot of confusions seem to come from the seemingly inter-twisted rate and position controls, however in my mind, there are two levels of relationship: 1) work fundamentally independent of each other, each tries to fulfill one single target (either balanced rate or balanced position) 2) _based_ on (1), completely optional, try to constraint the rate update to get more stable ->dirty_ratelimit and more balanced dirty position Note that (2) is not a must even if there are systematic errors in balanced_rate calculation. For example, the v8 patchset only does (1) and hence do simple bdi->dirty_ratelimit = balanced_rate; And it can still balance at some point (though not exactly around the setpoint): http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G-bs=1M/ext4-1dd-1M-8p-2942M-20:10-3.0.0-next-20110802+-2011-08-08.19:47/balance_dirty_pages-pages.png Even if ext4 has mis-matched (dirty_rate:write_bw ~= 3:2) hence introduced systematic errors in balanced_rate: http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G-bs=1M/ext4-1dd-1M-8p-2942M-20:10-3.0.0-next-20110802+-2011-08-08.19:47/global_dirtied_written.png > pos_ratio - ratio adjusting the dirty_ratelimit > for variance in dirty pages around its target So pos_ratio is - is a _limiting_ factor rather than an _adjusting_ factor for updating ->dirty_ratelimit (when do (2)) - not a factor at all for updating balanced_rate (whether or not we do (2)) well, in this concept: the balanced_rate formula inherently does not derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's based on the ratelimit executed for the past 200ms: balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio and task_ratelimit_200ms happen to can be estimated from task_ratelimit_200ms ~= balanced_rate_i * pos_ratio There is fundamentally no dependency between balanced_rate_(i+1) and balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation only asks for _whatever_ CONSTANT task ratelimit to be executed for 200ms, then it get the balanced rate from the dirty_rate feedback. We may alternatively record every task_ratelimit executed in the past 200ms and average them all to get task_ratelimit_200ms. In this way we take the "superfluous" pos_ratio out of sight :) > bw_ratio - ratio adjusting the dirty_ratelimit > for variance in input/output bandwidth > > and we need to basically do: > > dirty_ratelimit *= pos_ratio * bw_ratio So there is even no such recursing at all: balanced_rate *= bw_ratio Each balanced_rate is estimated from the start, based on each 200ms period. > to update the dirty_ratelimit to reflect the current state. However per > 1) and 2) bw_ratio is crappy and hard to fix. > > So you propose to update dirty_ratelimit only if both pos_ratio and > bw_ratio point in the same direction, however that would result in: > > if (pos_ratio < UNIT && bw_ratio < UNIT || > pos_ratio > UNIT && bw_ratio > UNIT) { > dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT; > dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT; > } We start by doing this for (1): dirty_ratelimit = balanced_rate and then try to refine it for (1)+(2): dirty_ratelimit => balanced_rate, but limit the progress by pos_ratio > > > > However for that you use: > > > > > > > > if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) > > > > dirty_ratelimit = max(ref_bw, pos_bw); > > > > > > > > if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) > > > > dirty_ratelimit = min(ref_bw, pos_bw); > > > > The above are merely constraints to the dirty_ratelimit update. > > It serves to > > > > 1) stop adjusting the rate when it's against the position control > > target (the adjusted rate will slow down the progress of dirty > > pages going back to setpoint). > > Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then > they point in different directions however: > > 0.5 < 1 && 0.5 * 1.1 < 1 > > so your code will in fact update the dirty_ratelimit, even though the > two factors point in opposite directions. It does not work that way since pos_ratio does not take part in the multiplication. However I admit that the tests (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) don't aim to avoid all unnecessary updates, and it may even stop some rightful updates. It's not possible at all to act perfect. It's merely a rule that sounds "reasonable" in theory and works reasonably good in practice :) I'd be happy to try more if there are better ones. > > 2) limit the step size. pos_bw is changing values step by step, > > leaving a consistent trace comparing to the randomly jumping > > ref_bw. pos_bw also has smaller errors in stable state and normally > > have larger errors when there are big errors in rate. So it's a > > pretty good limiting factor for the step size of dirty_ratelimit. > > OK, so that's the min/max stuff, however it only works because you use > pos_bw and ref_bw instead of the fully separated factors. Yes, the min/max stuff is for limiting the step size. The "limiting" intention can be made more clear if written as delta = balanced_rate - base_rate; if (delta > pos_rate - base_rate) delta = pos_rate - base_rate; delta /= 8; > > Hope the above elaboration helps :) > > A little.. And now? ;) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-23 3:40 ` Wu Fengguang (?) @ 2011-08-23 10:01 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-23 10:01 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote: > - not a factor at all for updating balanced_rate (whether or not we do (2)) > well, in this concept: the balanced_rate formula inherently does not > derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's > based on the ratelimit executed for the past 200ms: > > balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio Ok, this is where it all goes funny.. So if you want completely separated feedback loops I would expect something like: balance_rate_(i+1) = balance_rate_(i) * bw_ratio ; every 200ms The former is a complete feedback loop, expressing the new value in the old value (*) with bw_ratio as feedback parameter; if we throttled too much, the dirty_rate will have dropped and the bw_ratio will be <1 causing the balance_rate to drop increasing the dirty_rate, and vice versa. (*) which is the form I expected and why I thought your primary feedback loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio With the above balance_rate is an independent variable that tracks the write bandwidth. Now possibly you'd want a low-pass filter on that since your bw_ratio is a bit funny in the head, but that's another story. Then when you use the balance_rate to actually throttle tasks you apply your secondary control steering the dirty page count, yielding: task_rate = balance_rate * pos_ratio > and task_ratelimit_200ms happen to can be estimated from > > task_ratelimit_200ms ~= balanced_rate_i * pos_ratio > We may alternatively record every task_ratelimit executed in the > past 200ms and average them all to get task_ratelimit_200ms. In this > way we take the "superfluous" pos_ratio out of sight :) Right, so I'm not at all sure that makes sense, its not immediately evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it clear to me why your primary feedback loop uses task_ratelimit_200ms at all. > There is fundamentally no dependency between balanced_rate_(i+1) and > balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation > only asks for _whatever_ CONSTANT task ratelimit to be executed for > 200ms, then it get the balanced rate from the dirty_rate feedback. How can there not be a relation between balance_rate_(i+1) and balance_rate_(i) ? ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-23 10:01 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-23 10:01 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote: > - not a factor at all for updating balanced_rate (whether or not we do (2)) > well, in this concept: the balanced_rate formula inherently does not > derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's > based on the ratelimit executed for the past 200ms: > > balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio Ok, this is where it all goes funny.. So if you want completely separated feedback loops I would expect something like: balance_rate_(i+1) = balance_rate_(i) * bw_ratio ; every 200ms The former is a complete feedback loop, expressing the new value in the old value (*) with bw_ratio as feedback parameter; if we throttled too much, the dirty_rate will have dropped and the bw_ratio will be <1 causing the balance_rate to drop increasing the dirty_rate, and vice versa. (*) which is the form I expected and why I thought your primary feedback loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio With the above balance_rate is an independent variable that tracks the write bandwidth. Now possibly you'd want a low-pass filter on that since your bw_ratio is a bit funny in the head, but that's another story. Then when you use the balance_rate to actually throttle tasks you apply your secondary control steering the dirty page count, yielding: task_rate = balance_rate * pos_ratio > and task_ratelimit_200ms happen to can be estimated from > > task_ratelimit_200ms ~= balanced_rate_i * pos_ratio > We may alternatively record every task_ratelimit executed in the > past 200ms and average them all to get task_ratelimit_200ms. In this > way we take the "superfluous" pos_ratio out of sight :) Right, so I'm not at all sure that makes sense, its not immediately evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it clear to me why your primary feedback loop uses task_ratelimit_200ms at all. > There is fundamentally no dependency between balanced_rate_(i+1) and > balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation > only asks for _whatever_ CONSTANT task ratelimit to be executed for > 200ms, then it get the balanced rate from the dirty_rate feedback. How can there not be a relation between balance_rate_(i+1) and balance_rate_(i) ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-23 10:01 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-23 10:01 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote: > - not a factor at all for updating balanced_rate (whether or not we do (2)) > well, in this concept: the balanced_rate formula inherently does not > derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's > based on the ratelimit executed for the past 200ms: > > balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio Ok, this is where it all goes funny.. So if you want completely separated feedback loops I would expect something like: balance_rate_(i+1) = balance_rate_(i) * bw_ratio ; every 200ms The former is a complete feedback loop, expressing the new value in the old value (*) with bw_ratio as feedback parameter; if we throttled too much, the dirty_rate will have dropped and the bw_ratio will be <1 causing the balance_rate to drop increasing the dirty_rate, and vice versa. (*) which is the form I expected and why I thought your primary feedback loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio With the above balance_rate is an independent variable that tracks the write bandwidth. Now possibly you'd want a low-pass filter on that since your bw_ratio is a bit funny in the head, but that's another story. Then when you use the balance_rate to actually throttle tasks you apply your secondary control steering the dirty page count, yielding: task_rate = balance_rate * pos_ratio > and task_ratelimit_200ms happen to can be estimated from > > task_ratelimit_200ms ~= balanced_rate_i * pos_ratio > We may alternatively record every task_ratelimit executed in the > past 200ms and average them all to get task_ratelimit_200ms. In this > way we take the "superfluous" pos_ratio out of sight :) Right, so I'm not at all sure that makes sense, its not immediately evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it clear to me why your primary feedback loop uses task_ratelimit_200ms at all. > There is fundamentally no dependency between balanced_rate_(i+1) and > balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation > only asks for _whatever_ CONSTANT task ratelimit to be executed for > 200ms, then it get the balanced rate from the dirty_rate feedback. How can there not be a relation between balance_rate_(i+1) and balance_rate_(i) ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-23 10:01 ` Peter Zijlstra @ 2011-08-23 14:15 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-23 14:15 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote: > > - not a factor at all for updating balanced_rate (whether or not we do (2)) > > well, in this concept: the balanced_rate formula inherently does not > > derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's > > based on the ratelimit executed for the past 200ms: > > > > balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio > > Ok, this is where it all goes funny.. > > So if you want completely separated feedback loops I would expect If call it feedback loops, then it's a series of independent feedback loops of depth 1. Because each balanced_rate is a fresh estimation dependent solely on - writeout bandwidth - N, the number of dd tasks in the past 200ms. As long as a CONSTANT ratelimit (whatever value it is) is executed in the past 200ms, we can get the same balanced_rate. balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate The resulted balanced_rate is independent of how large the CONSTANT ratelimit is, because if we start with a doubled CONSTANT ratelimit, we'll see doubled dirty_rate and result in the same balanced_rate. In that manner, balance_rate_(i+1) is not really depending on the value of balance_rate_(i): whatever balance_rate_(i) is, we are going to get the same balance_rate_(i+1) if not considering estimation errors. Note that the estimation errors mainly come from the fluctuations in dirty_rate. That may well be what's already in your mind, just that we disagree about the terms ;) > something like: > > balance_rate_(i+1) = balance_rate_(i) * bw_ratio ; every 200ms > > The former is a complete feedback loop, expressing the new value in the > old value (*) with bw_ratio as feedback parameter; if we throttled too > much, the dirty_rate will have dropped and the bw_ratio will be <1 > causing the balance_rate to drop increasing the dirty_rate, and vice > versa. In principle, the bw_ratio works that way. However since balance_rate_(i) is not the exact _executed_ ratelimit in balance_dirty_pages(). > (*) which is the form I expected and why I thought your primary feedback > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio Because the executed ratelimit was rate_(i) * pos_ratio. > With the above balance_rate is an independent variable that tracks the > write bandwidth. Now possibly you'd want a low-pass filter on that since > your bw_ratio is a bit funny in the head, but that's another story. Yeah. > Then when you use the balance_rate to actually throttle tasks you apply > your secondary control steering the dirty page count, yielding: > > task_rate = balance_rate * pos_ratio Right. Note the above formula is not a derived one, but an original one that later leads to pos_ratio showing up in the calculation of balanced_rate. > > and task_ratelimit_200ms happen to can be estimated from > > > > task_ratelimit_200ms ~= balanced_rate_i * pos_ratio > > > We may alternatively record every task_ratelimit executed in the > > past 200ms and average them all to get task_ratelimit_200ms. In this > > way we take the "superfluous" pos_ratio out of sight :) > > Right, so I'm not at all sure that makes sense, its not immediately > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it > clear to me why your primary feedback loop uses task_ratelimit_200ms at > all. task_ratelimit is used and hence defined to be (balance_rate * pos_ratio) by balance_dirty_pages(). So this is an original formula: task_ratelimit = balance_rate * pos_ratio task_ratelimit_200ms is also used as an original data source in balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate Then we try to estimate task_ratelimit_200ms by assuming all tasks have been executing the same CONSTANT ratelimit in balance_dirty_pages(). Hence we get task_ratelimit_200ms ~= prev_balance_rate * pos_ratio > > There is fundamentally no dependency between balanced_rate_(i+1) and > > balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation > > only asks for _whatever_ CONSTANT task ratelimit to be executed for > > 200ms, then it get the balanced rate from the dirty_rate feedback. > > How can there not be a relation between balance_rate_(i+1) and > balance_rate_(i) ? In this manner: even though balance_rate_(i) is somehow used for calculating balance_rate_(i+1), the latter will evaluate to the same value given whatever balance_rate_(i). That is, there is two dependencies, the seemingly dependency in the formula, and the effective dependency in the data values. Thank, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-23 14:15 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-23 14:15 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote: > > - not a factor at all for updating balanced_rate (whether or not we do (2)) > > well, in this concept: the balanced_rate formula inherently does not > > derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's > > based on the ratelimit executed for the past 200ms: > > > > balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio > > Ok, this is where it all goes funny.. > > So if you want completely separated feedback loops I would expect If call it feedback loops, then it's a series of independent feedback loops of depth 1. Because each balanced_rate is a fresh estimation dependent solely on - writeout bandwidth - N, the number of dd tasks in the past 200ms. As long as a CONSTANT ratelimit (whatever value it is) is executed in the past 200ms, we can get the same balanced_rate. balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate The resulted balanced_rate is independent of how large the CONSTANT ratelimit is, because if we start with a doubled CONSTANT ratelimit, we'll see doubled dirty_rate and result in the same balanced_rate. In that manner, balance_rate_(i+1) is not really depending on the value of balance_rate_(i): whatever balance_rate_(i) is, we are going to get the same balance_rate_(i+1) if not considering estimation errors. Note that the estimation errors mainly come from the fluctuations in dirty_rate. That may well be what's already in your mind, just that we disagree about the terms ;) > something like: > > balance_rate_(i+1) = balance_rate_(i) * bw_ratio ; every 200ms > > The former is a complete feedback loop, expressing the new value in the > old value (*) with bw_ratio as feedback parameter; if we throttled too > much, the dirty_rate will have dropped and the bw_ratio will be <1 > causing the balance_rate to drop increasing the dirty_rate, and vice > versa. In principle, the bw_ratio works that way. However since balance_rate_(i) is not the exact _executed_ ratelimit in balance_dirty_pages(). > (*) which is the form I expected and why I thought your primary feedback > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio Because the executed ratelimit was rate_(i) * pos_ratio. > With the above balance_rate is an independent variable that tracks the > write bandwidth. Now possibly you'd want a low-pass filter on that since > your bw_ratio is a bit funny in the head, but that's another story. Yeah. > Then when you use the balance_rate to actually throttle tasks you apply > your secondary control steering the dirty page count, yielding: > > task_rate = balance_rate * pos_ratio Right. Note the above formula is not a derived one, but an original one that later leads to pos_ratio showing up in the calculation of balanced_rate. > > and task_ratelimit_200ms happen to can be estimated from > > > > task_ratelimit_200ms ~= balanced_rate_i * pos_ratio > > > We may alternatively record every task_ratelimit executed in the > > past 200ms and average them all to get task_ratelimit_200ms. In this > > way we take the "superfluous" pos_ratio out of sight :) > > Right, so I'm not at all sure that makes sense, its not immediately > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it > clear to me why your primary feedback loop uses task_ratelimit_200ms at > all. task_ratelimit is used and hence defined to be (balance_rate * pos_ratio) by balance_dirty_pages(). So this is an original formula: task_ratelimit = balance_rate * pos_ratio task_ratelimit_200ms is also used as an original data source in balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate Then we try to estimate task_ratelimit_200ms by assuming all tasks have been executing the same CONSTANT ratelimit in balance_dirty_pages(). Hence we get task_ratelimit_200ms ~= prev_balance_rate * pos_ratio > > There is fundamentally no dependency between balanced_rate_(i+1) and > > balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation > > only asks for _whatever_ CONSTANT task ratelimit to be executed for > > 200ms, then it get the balanced rate from the dirty_rate feedback. > > How can there not be a relation between balance_rate_(i+1) and > balance_rate_(i) ? In this manner: even though balance_rate_(i) is somehow used for calculating balance_rate_(i+1), the latter will evaluate to the same value given whatever balance_rate_(i). That is, there is two dependencies, the seemingly dependency in the formula, and the effective dependency in the data values. Thank, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-23 14:15 ` Wu Fengguang @ 2011-08-23 17:47 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-23 17:47 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 23, 2011 at 10:15:04PM +0800, Wu Fengguang wrote: > On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote: > > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote: > > > - not a factor at all for updating balanced_rate (whether or not we do (2)) > > > well, in this concept: the balanced_rate formula inherently does not > > > derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's > > > based on the ratelimit executed for the past 200ms: > > > > > > balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio > > > > Ok, this is where it all goes funny.. > > > > So if you want completely separated feedback loops I would expect > > If call it feedback loops, then it's a series of independent feedback > loops of depth 1. Because each balanced_rate is a fresh estimation > dependent solely on > > - writeout bandwidth > - N, the number of dd tasks > > in the past 200ms. > > As long as a CONSTANT ratelimit (whatever value it is) is executed in > the past 200ms, we can get the same balanced_rate. > > balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate > > The resulted balanced_rate is independent of how large the CONSTANT > ratelimit is, because if we start with a doubled CONSTANT ratelimit, > we'll see doubled dirty_rate and result in the same balanced_rate. > > In that manner, balance_rate_(i+1) is not really depending on the > value of balance_rate_(i): whatever balance_rate_(i) is, we are going > to get the same balance_rate_(i+1) if not considering estimation > errors. Note that the estimation errors mainly come from the > fluctuations in dirty_rate. > > That may well be what's already in your mind, just that we disagree > about the terms ;) > > > something like: > > > > balance_rate_(i+1) = balance_rate_(i) * bw_ratio ; every 200ms > > > > The former is a complete feedback loop, expressing the new value in the > > old value (*) with bw_ratio as feedback parameter; if we throttled too > > much, the dirty_rate will have dropped and the bw_ratio will be <1 > > causing the balance_rate to drop increasing the dirty_rate, and vice > > versa. > > In principle, the bw_ratio works that way. However since > balance_rate_(i) is not the exact _executed_ ratelimit in > balance_dirty_pages(). > > > (*) which is the form I expected and why I thought your primary feedback > > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio > > Because the executed ratelimit was rate_(i) * pos_ratio. > > > With the above balance_rate is an independent variable that tracks the > > write bandwidth. Now possibly you'd want a low-pass filter on that since > > your bw_ratio is a bit funny in the head, but that's another story. > > Yeah. > > > Then when you use the balance_rate to actually throttle tasks you apply > > your secondary control steering the dirty page count, yielding: > > > > task_rate = balance_rate * pos_ratio > > Right. Note the above formula is not a derived one, but an original > one that later leads to pos_ratio showing up in the calculation of > balanced_rate. > > > > and task_ratelimit_200ms happen to can be estimated from > > > > > > task_ratelimit_200ms ~= balanced_rate_i * pos_ratio > > > > > We may alternatively record every task_ratelimit executed in the > > > past 200ms and average them all to get task_ratelimit_200ms. In this > > > way we take the "superfluous" pos_ratio out of sight :) > > > > Right, so I'm not at all sure that makes sense, its not immediately > > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it > > clear to me why your primary feedback loop uses task_ratelimit_200ms at > > all. > > task_ratelimit is used and hence defined to be (balance_rate * pos_ratio) > by balance_dirty_pages(). So this is an original formula: > > task_ratelimit = balance_rate * pos_ratio > > task_ratelimit_200ms is also used as an original data source in > > balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate > I think above calculates to. task_ratelimit = balanced_rate * pos_ratio or task_ratelimit = task_ratelimit_200ms * write_bw / dirty_rate * pos_ratio or task_ratelimit = balance_rate * pos_ratio * write_bw / dirty_rate * pos_ratio or 2 task_ratelimit = balance_rate * write_bw / dirty_rate * (pos_ratio) And the question is why not. task_ratelimit = prev-balance_rate * write_bw / dirty_rate * pos_ratio Which sounds intutive as comapred to former one. You somehow directly jump to balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate without explaining why following will not work. balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-23 17:47 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-23 17:47 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 23, 2011 at 10:15:04PM +0800, Wu Fengguang wrote: > On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote: > > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote: > > > - not a factor at all for updating balanced_rate (whether or not we do (2)) > > > well, in this concept: the balanced_rate formula inherently does not > > > derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's > > > based on the ratelimit executed for the past 200ms: > > > > > > balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio > > > > Ok, this is where it all goes funny.. > > > > So if you want completely separated feedback loops I would expect > > If call it feedback loops, then it's a series of independent feedback > loops of depth 1. Because each balanced_rate is a fresh estimation > dependent solely on > > - writeout bandwidth > - N, the number of dd tasks > > in the past 200ms. > > As long as a CONSTANT ratelimit (whatever value it is) is executed in > the past 200ms, we can get the same balanced_rate. > > balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate > > The resulted balanced_rate is independent of how large the CONSTANT > ratelimit is, because if we start with a doubled CONSTANT ratelimit, > we'll see doubled dirty_rate and result in the same balanced_rate. > > In that manner, balance_rate_(i+1) is not really depending on the > value of balance_rate_(i): whatever balance_rate_(i) is, we are going > to get the same balance_rate_(i+1) if not considering estimation > errors. Note that the estimation errors mainly come from the > fluctuations in dirty_rate. > > That may well be what's already in your mind, just that we disagree > about the terms ;) > > > something like: > > > > balance_rate_(i+1) = balance_rate_(i) * bw_ratio ; every 200ms > > > > The former is a complete feedback loop, expressing the new value in the > > old value (*) with bw_ratio as feedback parameter; if we throttled too > > much, the dirty_rate will have dropped and the bw_ratio will be <1 > > causing the balance_rate to drop increasing the dirty_rate, and vice > > versa. > > In principle, the bw_ratio works that way. However since > balance_rate_(i) is not the exact _executed_ ratelimit in > balance_dirty_pages(). > > > (*) which is the form I expected and why I thought your primary feedback > > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio > > Because the executed ratelimit was rate_(i) * pos_ratio. > > > With the above balance_rate is an independent variable that tracks the > > write bandwidth. Now possibly you'd want a low-pass filter on that since > > your bw_ratio is a bit funny in the head, but that's another story. > > Yeah. > > > Then when you use the balance_rate to actually throttle tasks you apply > > your secondary control steering the dirty page count, yielding: > > > > task_rate = balance_rate * pos_ratio > > Right. Note the above formula is not a derived one, but an original > one that later leads to pos_ratio showing up in the calculation of > balanced_rate. > > > > and task_ratelimit_200ms happen to can be estimated from > > > > > > task_ratelimit_200ms ~= balanced_rate_i * pos_ratio > > > > > We may alternatively record every task_ratelimit executed in the > > > past 200ms and average them all to get task_ratelimit_200ms. In this > > > way we take the "superfluous" pos_ratio out of sight :) > > > > Right, so I'm not at all sure that makes sense, its not immediately > > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it > > clear to me why your primary feedback loop uses task_ratelimit_200ms at > > all. > > task_ratelimit is used and hence defined to be (balance_rate * pos_ratio) > by balance_dirty_pages(). So this is an original formula: > > task_ratelimit = balance_rate * pos_ratio > > task_ratelimit_200ms is also used as an original data source in > > balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate > I think above calculates to. task_ratelimit = balanced_rate * pos_ratio or task_ratelimit = task_ratelimit_200ms * write_bw / dirty_rate * pos_ratio or task_ratelimit = balance_rate * pos_ratio * write_bw / dirty_rate * pos_ratio or 2 task_ratelimit = balance_rate * write_bw / dirty_rate * (pos_ratio) And the question is why not. task_ratelimit = prev-balance_rate * write_bw / dirty_rate * pos_ratio Which sounds intutive as comapred to former one. You somehow directly jump to balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate without explaining why following will not work. balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-23 17:47 ` Vivek Goyal @ 2011-08-24 0:12 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-24 0:12 UTC (permalink / raw) To: Vivek Goyal Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML > You somehow directly jump to > > balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate > > without explaining why following will not work. > > balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate Thanks for asking that, it's probably the root of confusions, so let me answer it standalone. It's actually pretty simple to explain this equation: write_bw balanced_rate = task_ratelimit_200ms * ---------- (1) dirty_rate If there are N dd tasks, each task is throttled at task_ratelimit_200ms for the past 200ms, we are going to measure the overall bdi dirty rate dirty_rate = N * task_ratelimit_200ms (2) put (2) into (1) we get balanced_rate = write_bw / N (3) So equation (1) is the right estimation to get the desired target (3). As for write_bw balanced_rate_(i+1) = balanced_rate_(i) * ---------- (4) dirty_rate Let's compare it with the "expanded" form of (1): write_bw balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ---------- (5) dirty_rate So the difference lies in pos_ratio. Believe it or not, it's exactly the seemingly use of pos_ratio that makes (5) independent(*) of the position control. Why? Look at (4), assume the system is in a state - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N - dirty position is not balanced, for example pos_ratio = 0.5 balance_dirty_pages() will be rate limiting each tasks at half the balanced dirty rate, yielding a measured dirty_rate = write_bw / 2 (6) Put (6) into (4), we get balanced_rate_(i+1) = balanced_rate_(i) * 2 = (write_bw / N) * 2 That means, any position imbalance will lead to balanced_rate estimation errors if we follow (4). Whereas if (1)/(5) is used, we always get the right balanced dirty ratelimit value whether or not (pos_ratio == 1.0), hence make the rate estimation independent(*) of dirty position control. (*) independent as in real values, not the seemingly relations in equation Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-24 0:12 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-24 0:12 UTC (permalink / raw) To: Vivek Goyal Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML > You somehow directly jump to > > balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate > > without explaining why following will not work. > > balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate Thanks for asking that, it's probably the root of confusions, so let me answer it standalone. It's actually pretty simple to explain this equation: write_bw balanced_rate = task_ratelimit_200ms * ---------- (1) dirty_rate If there are N dd tasks, each task is throttled at task_ratelimit_200ms for the past 200ms, we are going to measure the overall bdi dirty rate dirty_rate = N * task_ratelimit_200ms (2) put (2) into (1) we get balanced_rate = write_bw / N (3) So equation (1) is the right estimation to get the desired target (3). As for write_bw balanced_rate_(i+1) = balanced_rate_(i) * ---------- (4) dirty_rate Let's compare it with the "expanded" form of (1): write_bw balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ---------- (5) dirty_rate So the difference lies in pos_ratio. Believe it or not, it's exactly the seemingly use of pos_ratio that makes (5) independent(*) of the position control. Why? Look at (4), assume the system is in a state - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N - dirty position is not balanced, for example pos_ratio = 0.5 balance_dirty_pages() will be rate limiting each tasks at half the balanced dirty rate, yielding a measured dirty_rate = write_bw / 2 (6) Put (6) into (4), we get balanced_rate_(i+1) = balanced_rate_(i) * 2 = (write_bw / N) * 2 That means, any position imbalance will lead to balanced_rate estimation errors if we follow (4). Whereas if (1)/(5) is used, we always get the right balanced dirty ratelimit value whether or not (pos_ratio == 1.0), hence make the rate estimation independent(*) of dirty position control. (*) independent as in real values, not the seemingly relations in equation Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-24 0:12 ` Wu Fengguang @ 2011-08-24 16:12 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-24 16:12 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote: > > You somehow directly jump to > > > > balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate > > > > without explaining why following will not work. > > > > balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate > > Thanks for asking that, it's probably the root of confusions, so let > me answer it standalone. > > It's actually pretty simple to explain this equation: > > write_bw > balanced_rate = task_ratelimit_200ms * ---------- (1) > dirty_rate > > If there are N dd tasks, each task is throttled at task_ratelimit_200ms > for the past 200ms, we are going to measure the overall bdi dirty rate > > dirty_rate = N * task_ratelimit_200ms (2) > > put (2) into (1) we get > > balanced_rate = write_bw / N (3) > > So equation (1) is the right estimation to get the desired target (3). > > > As for > > write_bw > balanced_rate_(i+1) = balanced_rate_(i) * ---------- (4) > dirty_rate > > Let's compare it with the "expanded" form of (1): > > write_bw > balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ---------- (5) > dirty_rate > > So the difference lies in pos_ratio. > > Believe it or not, it's exactly the seemingly use of pos_ratio that > makes (5) independent(*) of the position control. > > Why? Look at (4), assume the system is in a state > > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N > - dirty position is not balanced, for example pos_ratio = 0.5 > > balance_dirty_pages() will be rate limiting each tasks at half the > balanced dirty rate, yielding a measured > > dirty_rate = write_bw / 2 (6) > > Put (6) into (4), we get > > balanced_rate_(i+1) = balanced_rate_(i) * 2 > = (write_bw / N) * 2 > > That means, any position imbalance will lead to balanced_rate > estimation errors if we follow (4). Whereas if (1)/(5) is used, we > always get the right balanced dirty ratelimit value whether or not > (pos_ratio == 1.0), hence make the rate estimation independent(*) of > dirty position control. > > (*) independent as in real values, not the seemingly relations in equation The assumption here is that N is a constant.. in the above case pos_ratio would eventually end up at 1 and things would be good again. I see your argument about oscillations, but I think you can introduce similar effects by varying N. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-24 16:12 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-24 16:12 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote: > > You somehow directly jump to > > > > balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate > > > > without explaining why following will not work. > > > > balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate > > Thanks for asking that, it's probably the root of confusions, so let > me answer it standalone. > > It's actually pretty simple to explain this equation: > > write_bw > balanced_rate = task_ratelimit_200ms * ---------- (1) > dirty_rate > > If there are N dd tasks, each task is throttled at task_ratelimit_200ms > for the past 200ms, we are going to measure the overall bdi dirty rate > > dirty_rate = N * task_ratelimit_200ms (2) > > put (2) into (1) we get > > balanced_rate = write_bw / N (3) > > So equation (1) is the right estimation to get the desired target (3). > > > As for > > write_bw > balanced_rate_(i+1) = balanced_rate_(i) * ---------- (4) > dirty_rate > > Let's compare it with the "expanded" form of (1): > > write_bw > balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ---------- (5) > dirty_rate > > So the difference lies in pos_ratio. > > Believe it or not, it's exactly the seemingly use of pos_ratio that > makes (5) independent(*) of the position control. > > Why? Look at (4), assume the system is in a state > > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N > - dirty position is not balanced, for example pos_ratio = 0.5 > > balance_dirty_pages() will be rate limiting each tasks at half the > balanced dirty rate, yielding a measured > > dirty_rate = write_bw / 2 (6) > > Put (6) into (4), we get > > balanced_rate_(i+1) = balanced_rate_(i) * 2 > = (write_bw / N) * 2 > > That means, any position imbalance will lead to balanced_rate > estimation errors if we follow (4). Whereas if (1)/(5) is used, we > always get the right balanced dirty ratelimit value whether or not > (pos_ratio == 1.0), hence make the rate estimation independent(*) of > dirty position control. > > (*) independent as in real values, not the seemingly relations in equation The assumption here is that N is a constant.. in the above case pos_ratio would eventually end up at 1 and things would be good again. I see your argument about oscillations, but I think you can introduce similar effects by varying N. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-24 16:12 ` Peter Zijlstra @ 2011-08-26 0:18 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-26 0:18 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote: > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote: > > > You somehow directly jump to > > > > > > balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate > > > > > > without explaining why following will not work. > > > > > > balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate > > > > Thanks for asking that, it's probably the root of confusions, so let > > me answer it standalone. > > > > It's actually pretty simple to explain this equation: > > > > write_bw > > balanced_rate = task_ratelimit_200ms * ---------- (1) > > dirty_rate > > > > If there are N dd tasks, each task is throttled at task_ratelimit_200ms > > for the past 200ms, we are going to measure the overall bdi dirty rate > > > > dirty_rate = N * task_ratelimit_200ms (2) > > > > put (2) into (1) we get > > > > balanced_rate = write_bw / N (3) > > > > So equation (1) is the right estimation to get the desired target (3). > > > > > > As for > > > > write_bw > > balanced_rate_(i+1) = balanced_rate_(i) * ---------- (4) > > dirty_rate > > > > Let's compare it with the "expanded" form of (1): > > > > write_bw > > balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ---------- (5) > > dirty_rate > > > > So the difference lies in pos_ratio. > > > > Believe it or not, it's exactly the seemingly use of pos_ratio that > > makes (5) independent(*) of the position control. > > > > Why? Look at (4), assume the system is in a state > > > > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N > > - dirty position is not balanced, for example pos_ratio = 0.5 > > > > balance_dirty_pages() will be rate limiting each tasks at half the > > balanced dirty rate, yielding a measured > > > > dirty_rate = write_bw / 2 (6) > > > > Put (6) into (4), we get > > > > balanced_rate_(i+1) = balanced_rate_(i) * 2 > > = (write_bw / N) * 2 > > > > That means, any position imbalance will lead to balanced_rate > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we > > always get the right balanced dirty ratelimit value whether or not > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of > > dirty position control. > > > > (*) independent as in real values, not the seemingly relations in equation > > > The assumption here is that N is a constant.. in the above case > pos_ratio would eventually end up at 1 and things would be good again. I > see your argument about oscillations, but I think you can introduce > similar effects by varying N. Yeah, it's very possible for N to change over time, in which case balanced_rate will adapt to new N in similar way. Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-26 0:18 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-26 0:18 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote: > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote: > > > You somehow directly jump to > > > > > > balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate > > > > > > without explaining why following will not work. > > > > > > balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate > > > > Thanks for asking that, it's probably the root of confusions, so let > > me answer it standalone. > > > > It's actually pretty simple to explain this equation: > > > > write_bw > > balanced_rate = task_ratelimit_200ms * ---------- (1) > > dirty_rate > > > > If there are N dd tasks, each task is throttled at task_ratelimit_200ms > > for the past 200ms, we are going to measure the overall bdi dirty rate > > > > dirty_rate = N * task_ratelimit_200ms (2) > > > > put (2) into (1) we get > > > > balanced_rate = write_bw / N (3) > > > > So equation (1) is the right estimation to get the desired target (3). > > > > > > As for > > > > write_bw > > balanced_rate_(i+1) = balanced_rate_(i) * ---------- (4) > > dirty_rate > > > > Let's compare it with the "expanded" form of (1): > > > > write_bw > > balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ---------- (5) > > dirty_rate > > > > So the difference lies in pos_ratio. > > > > Believe it or not, it's exactly the seemingly use of pos_ratio that > > makes (5) independent(*) of the position control. > > > > Why? Look at (4), assume the system is in a state > > > > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N > > - dirty position is not balanced, for example pos_ratio = 0.5 > > > > balance_dirty_pages() will be rate limiting each tasks at half the > > balanced dirty rate, yielding a measured > > > > dirty_rate = write_bw / 2 (6) > > > > Put (6) into (4), we get > > > > balanced_rate_(i+1) = balanced_rate_(i) * 2 > > = (write_bw / N) * 2 > > > > That means, any position imbalance will lead to balanced_rate > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we > > always get the right balanced dirty ratelimit value whether or not > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of > > dirty position control. > > > > (*) independent as in real values, not the seemingly relations in equation > > > The assumption here is that N is a constant.. in the above case > pos_ratio would eventually end up at 1 and things would be good again. I > see your argument about oscillations, but I think you can introduce > similar effects by varying N. Yeah, it's very possible for N to change over time, in which case balanced_rate will adapt to new N in similar way. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-26 0:18 ` Wu Fengguang @ 2011-08-26 9:04 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-26 9:04 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, 2011-08-26 at 08:18 +0800, Wu Fengguang wrote: > On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote: > > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote: > > > Put (6) into (4), we get > > > > > > balanced_rate_(i+1) = balanced_rate_(i) * 2 > > > = (write_bw / N) * 2 > > > > > > That means, any position imbalance will lead to balanced_rate > > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we > > > always get the right balanced dirty ratelimit value whether or not > > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of > > > dirty position control. > > > > > > (*) independent as in real values, not the seemingly relations in equation > > > > > > The assumption here is that N is a constant.. in the above case > > pos_ratio would eventually end up at 1 and things would be good again. I > > see your argument about oscillations, but I think you can introduce > > similar effects by varying N. > > Yeah, it's very possible for N to change over time, in which case > balanced_rate will adapt to new N in similar way. Gah.. but but but, that gives the same stuff as your (6)+(4). Why won't you accept that for pos_ratio but you don't mind for N ? ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-26 9:04 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-26 9:04 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, 2011-08-26 at 08:18 +0800, Wu Fengguang wrote: > On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote: > > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote: > > > Put (6) into (4), we get > > > > > > balanced_rate_(i+1) = balanced_rate_(i) * 2 > > > = (write_bw / N) * 2 > > > > > > That means, any position imbalance will lead to balanced_rate > > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we > > > always get the right balanced dirty ratelimit value whether or not > > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of > > > dirty position control. > > > > > > (*) independent as in real values, not the seemingly relations in equation > > > > > > The assumption here is that N is a constant.. in the above case > > pos_ratio would eventually end up at 1 and things would be good again. I > > see your argument about oscillations, but I think you can introduce > > similar effects by varying N. > > Yeah, it's very possible for N to change over time, in which case > balanced_rate will adapt to new N in similar way. Gah.. but but but, that gives the same stuff as your (6)+(4). Why won't you accept that for pos_ratio but you don't mind for N ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-26 9:04 ` Peter Zijlstra @ 2011-08-26 10:04 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-26 10:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, Aug 26, 2011 at 05:04:29PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-26 at 08:18 +0800, Wu Fengguang wrote: > > On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote: > > > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote: > > > > > Put (6) into (4), we get > > > > > > > > balanced_rate_(i+1) = balanced_rate_(i) * 2 > > > > = (write_bw / N) * 2 > > > > > > > > That means, any position imbalance will lead to balanced_rate > > > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we > > > > always get the right balanced dirty ratelimit value whether or not > > > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of > > > > dirty position control. > > > > > > > > (*) independent as in real values, not the seemingly relations in equation > > > > > > > > > The assumption here is that N is a constant.. in the above case > > > pos_ratio would eventually end up at 1 and things would be good again. I > > > see your argument about oscillations, but I think you can introduce > > > similar effects by varying N. > > > > Yeah, it's very possible for N to change over time, in which case > > balanced_rate will adapt to new N in similar way. > > Gah.. but but but, that gives the same stuff as your (6)+(4). Why won't > you accept that for pos_ratio but you don't mind for N ? Sorry I'm now feeling lost...anyway it's convenient to try out the pure rate feedback. And the test case exactly includes the sudden change of N. I'm now running the tests with this trivial patch: --- linux-next.orig/mm/page-writeback.c 2011-08-26 17:58:01.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-26 17:59:06.000000000 +0800 @@ -800,7 +800,7 @@ static void bdi_update_dirty_ratelimit(s * the dirty count meet the setpoint, but also where the slope of * pos_ratio is most flat and hence task_ratelimit is least fluctuated. */ - balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw, + balanced_dirty_ratelimit = div_u64((u64)dirty_ratelimit * write_bw, dirty_rate | 1); /* ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-26 10:04 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-26 10:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, Aug 26, 2011 at 05:04:29PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-26 at 08:18 +0800, Wu Fengguang wrote: > > On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote: > > > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote: > > > > > Put (6) into (4), we get > > > > > > > > balanced_rate_(i+1) = balanced_rate_(i) * 2 > > > > = (write_bw / N) * 2 > > > > > > > > That means, any position imbalance will lead to balanced_rate > > > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we > > > > always get the right balanced dirty ratelimit value whether or not > > > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of > > > > dirty position control. > > > > > > > > (*) independent as in real values, not the seemingly relations in equation > > > > > > > > > The assumption here is that N is a constant.. in the above case > > > pos_ratio would eventually end up at 1 and things would be good again. I > > > see your argument about oscillations, but I think you can introduce > > > similar effects by varying N. > > > > Yeah, it's very possible for N to change over time, in which case > > balanced_rate will adapt to new N in similar way. > > Gah.. but but but, that gives the same stuff as your (6)+(4). Why won't > you accept that for pos_ratio but you don't mind for N ? Sorry I'm now feeling lost...anyway it's convenient to try out the pure rate feedback. And the test case exactly includes the sudden change of N. I'm now running the tests with this trivial patch: --- linux-next.orig/mm/page-writeback.c 2011-08-26 17:58:01.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-26 17:59:06.000000000 +0800 @@ -800,7 +800,7 @@ static void bdi_update_dirty_ratelimit(s * the dirty count meet the setpoint, but also where the slope of * pos_ratio is most flat and hence task_ratelimit is least fluctuated. */ - balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw, + balanced_dirty_ratelimit = div_u64((u64)dirty_ratelimit * write_bw, dirty_rate | 1); /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-26 10:04 ` Wu Fengguang @ 2011-08-26 10:42 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-26 10:42 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, 2011-08-26 at 18:04 +0800, Wu Fengguang wrote: > Sorry I'm now feeling lost... hehe welcome to my world ;-) Seriously though, I appreciate all the effort you put in trying to explain things. I feel I do understand things now, although I might not completely agree with them quite yet ;-) I'll go read the v9 patch-set you send out and look at some of the details (such as pos_ratio being comprised of both global and bdi limits, which so far has been somewhat glossed over). ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-26 10:42 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-26 10:42 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, 2011-08-26 at 18:04 +0800, Wu Fengguang wrote: > Sorry I'm now feeling lost... hehe welcome to my world ;-) Seriously though, I appreciate all the effort you put in trying to explain things. I feel I do understand things now, although I might not completely agree with them quite yet ;-) I'll go read the v9 patch-set you send out and look at some of the details (such as pos_ratio being comprised of both global and bdi limits, which so far has been somewhat glossed over). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-26 10:42 ` Peter Zijlstra @ 2011-08-26 10:52 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-26 10:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, Aug 26, 2011 at 06:42:22PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-26 at 18:04 +0800, Wu Fengguang wrote: > > Sorry I'm now feeling lost... > > hehe welcome to my world ;-) Yeah, so sorry... > Seriously though, I appreciate all the effort you put in trying to > explain things. I feel I do understand things now, although I might not > completely agree with them quite yet ;-) Thank you :) > I'll go read the v9 patch-set you send out and look at some of the > details (such as pos_ratio being comprised of both global and bdi > limits, which so far has been somewhat glossed over). Hold on please! I'll immediately post a v10 with all the comment updates. Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-26 10:52 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-26 10:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, Aug 26, 2011 at 06:42:22PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-26 at 18:04 +0800, Wu Fengguang wrote: > > Sorry I'm now feeling lost... > > hehe welcome to my world ;-) Yeah, so sorry... > Seriously though, I appreciate all the effort you put in trying to > explain things. I feel I do understand things now, although I might not > completely agree with them quite yet ;-) Thank you :) > I'll go read the v9 patch-set you send out and look at some of the > details (such as pos_ratio being comprised of both global and bdi > limits, which so far has been somewhat glossed over). Hold on please! I'll immediately post a v10 with all the comment updates. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-26 10:04 ` Wu Fengguang (?) (?) @ 2011-08-26 11:26 ` Wu Fengguang 2011-08-26 12:11 ` Peter Zijlstra -1 siblings, 1 reply; 301+ messages in thread From: Wu Fengguang @ 2011-08-26 11:26 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML [-- Attachment #1: Type: text/plain, Size: 1633 bytes --] Peter, Now I get 3 figures. Test case is: run 1 dd write task for 300s, with a "disturber" dd read task during roughly 120-130s. (1) balance_dirty_pages-pages.png This is the output of the original patchset. Here the "balanced ratelimit" dots are mostly accurate except when near @freerun or @limit. (2) balance_dirty_pages-pages_pure-rate-feedback.png do this change: - balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw, + balanced_dirty_ratelimit = div_u64((u64)dirty_ratelimit * write_bw, dirty_rate | 1); Here the "balanced ratelimit" dots goto the opposite direction comparing to "pos ratelimit", which is the expected result discussed in the other email. Then the system got stuck in unbalanced dirty position. It's slowly moving towards the setpoint thanks to the dirty_ratelimit update policy: it only updates dirty_ratelimit when balanced_dirty_ratelimit fluctuates to the same side of task_ratelimit, hence introduced some systematical "errors" in the right direction ;) (3) balance_dirty_pages-pages_pure-rate-feedback-without-dirty_ratelimit-update-constraints.png further remove the "do conservative bdi->dirty_ratelimit updates" feature, by replacing its update policy with a direct assignment: bdi->dirty_ratelimit = max(balanced_dirty_ratelimit, 1UL); This is to check if dirty_ratelimit can still go back to the balance point without the help of the dirty_ratelimit update policy. To my surprise, dirty_ratelimit jumps to HUGE singular value and shows no sign to come back to normal.. In summary, the original patchset shows the best behavior :) Thanks, Fengguang [-- Attachment #2: balance_dirty_pages-pages.png --] [-- Type: image/png, Size: 75688 bytes --] [-- Attachment #3: balance_dirty_pages-pages_pure-rate-feedback.png --] [-- Type: image/png, Size: 83327 bytes --] [-- Attachment #4: balance_dirty_pages-pages_pure-rate-feedback-without-dirty_ratelimit-update-constraints.png --] [-- Type: image/png, Size: 63923 bytes --] ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-26 11:26 ` Wu Fengguang @ 2011-08-26 12:11 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-26 12:11 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote: > Now I get 3 figures. Test case is: run 1 dd write task for 300s, with > a "disturber" dd read task during roughly 120-130s. Ah, but ideally the disturber task should run in bursts of 100ms (<feedback period), otherwise your N is indeed mostly constant. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-26 12:11 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-26 12:11 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote: > Now I get 3 figures. Test case is: run 1 dd write task for 300s, with > a "disturber" dd read task during roughly 120-130s. Ah, but ideally the disturber task should run in bursts of 100ms (<feedback period), otherwise your N is indeed mostly constant. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-26 12:11 ` Peter Zijlstra @ 2011-08-26 12:20 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-26 12:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, Aug 26, 2011 at 08:11:50PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote: > > Now I get 3 figures. Test case is: run 1 dd write task for 300s, with > > a "disturber" dd read task during roughly 120-130s. > > Ah, but ideally the disturber task should run in bursts of 100ms > (<feedback period), otherwise your N is indeed mostly constant. Ah yeah, the disturber task should be a dd writer! Then we get - 120s: N=1 => N=2 - 130s: N=2 => N=1 I'll try it right away. Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-26 12:20 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-26 12:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, Aug 26, 2011 at 08:11:50PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote: > > Now I get 3 figures. Test case is: run 1 dd write task for 300s, with > > a "disturber" dd read task during roughly 120-130s. > > Ah, but ideally the disturber task should run in bursts of 100ms > (<feedback period), otherwise your N is indeed mostly constant. Ah yeah, the disturber task should be a dd writer! Then we get - 120s: N=1 => N=2 - 130s: N=2 => N=1 I'll try it right away. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-26 12:20 ` Wu Fengguang (?) @ 2011-08-26 13:13 ` Wu Fengguang 2011-08-26 13:18 ` Peter Zijlstra -1 siblings, 1 reply; 301+ messages in thread From: Wu Fengguang @ 2011-08-26 13:13 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML [-- Attachment #1: Type: text/plain, Size: 845 bytes --] On Fri, Aug 26, 2011 at 08:20:57PM +0800, Wu Fengguang wrote: > On Fri, Aug 26, 2011 at 08:11:50PM +0800, Peter Zijlstra wrote: > > On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote: > > > Now I get 3 figures. Test case is: run 1 dd write task for 300s, with > > > a "disturber" dd read task during roughly 120-130s. > > > > Ah, but ideally the disturber task should run in bursts of 100ms > > (<feedback period), otherwise your N is indeed mostly constant. > > Ah yeah, the disturber task should be a dd writer! Then we get > > - 120s: N=1 => N=2 > - 130s: N=2 => N=1 Here they are. The write disturber starts/stops around 150s. We got similar result as in the read disturber case, even though one disturbs N and the other impacts writeout bandwith. The original patchset is consistently performing much better :) Thanks, Fengguang [-- Attachment #2: balance_dirty_pages-pages.png --] [-- Type: image/png, Size: 120914 bytes --] [-- Attachment #3: balance_dirty_pages-pages_pure-rate-feedback.png --] [-- Type: image/png, Size: 142966 bytes --] ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-26 13:13 ` Wu Fengguang @ 2011-08-26 13:18 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-26 13:18 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, 2011-08-26 at 21:13 +0800, Wu Fengguang wrote: > We got similar result as in the read disturber case, even though one > disturbs N and the other impacts writeout bandwith. The original > patchset is consistently performing much better :) It does indeed, and I figure on these timescales it makes sense to assumes N is a constant. Fair enough, thanks! ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-26 13:18 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-26 13:18 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, 2011-08-26 at 21:13 +0800, Wu Fengguang wrote: > We got similar result as in the read disturber case, even though one > disturbs N and the other impacts writeout bandwith. The original > patchset is consistently performing much better :) It does indeed, and I figure on these timescales it makes sense to assumes N is a constant. Fair enough, thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-26 13:18 ` Peter Zijlstra @ 2011-08-26 13:24 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-26 13:24 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, Aug 26, 2011 at 09:18:21PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-26 at 21:13 +0800, Wu Fengguang wrote: > > We got similar result as in the read disturber case, even though one > > disturbs N and the other impacts writeout bandwith. The original > > patchset is consistently performing much better :) > > It does indeed, and I figure on these timescales it makes sense to > assumes N is a constant. Fair enough, thanks! Thank you! Glad that we finally reaches some consensus :) Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-26 13:24 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-26 13:24 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, Aug 26, 2011 at 09:18:21PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-26 at 21:13 +0800, Wu Fengguang wrote: > > We got similar result as in the read disturber case, even though one > > disturbs N and the other impacts writeout bandwith. The original > > patchset is consistently performing much better :) > > It does indeed, and I figure on these timescales it makes sense to > assumes N is a constant. Fair enough, thanks! Thank you! Glad that we finally reaches some consensus :) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-24 0:12 ` Wu Fengguang @ 2011-08-24 18:00 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-24 18:00 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, Aug 24, 2011 at 08:12:58AM +0800, Wu Fengguang wrote: > > You somehow directly jump to > > > > balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate > > > > without explaining why following will not work. > > > > balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate > > Thanks for asking that, it's probably the root of confusions, so let > me answer it standalone. > > It's actually pretty simple to explain this equation: > > write_bw > balanced_rate = task_ratelimit_200ms * ---------- (1) > dirty_rate > > If there are N dd tasks, each task is throttled at task_ratelimit_200ms > for the past 200ms, we are going to measure the overall bdi dirty rate > > dirty_rate = N * task_ratelimit_200ms (2) > > put (2) into (1) we get > > balanced_rate = write_bw / N (3) > > So equation (1) is the right estimation to get the desired target (3). > > > As for > > write_bw > balanced_rate_(i+1) = balanced_rate_(i) * ---------- (4) > dirty_rate > > Let's compare it with the "expanded" form of (1): > > write_bw > balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ---------- (5) > dirty_rate > > So the difference lies in pos_ratio. > > Believe it or not, it's exactly the seemingly use of pos_ratio that > makes (5) independent(*) of the position control. > > Why? Look at (4), assume the system is in a state > > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N > - dirty position is not balanced, for example pos_ratio = 0.5 > > balance_dirty_pages() will be rate limiting each tasks at half the > balanced dirty rate, yielding a measured > > dirty_rate = write_bw / 2 (6) > > Put (6) into (4), we get > > balanced_rate_(i+1) = balanced_rate_(i) * 2 > = (write_bw / N) * 2 > > That means, any position imbalance will lead to balanced_rate > estimation errors if we follow (4). Whereas if (1)/(5) is used, we > always get the right balanced dirty ratelimit value whether or not > (pos_ratio == 1.0), hence make the rate estimation independent(*) of > dirty position control. > > (*) independent as in real values, not the seemingly relations in equation Ok, I think I am beginning to see your point. Let me just elaborate on the example you gave. Assume a system is completely balanced and a task is writing at 100MB/s rate. write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1 bdi->dirty_ratelimit = 100MB/s Now another tasks starts dirtying the page cache on same bdi. Number of dirty pages should go up pretty fast and likely position ratio feedback will kick in to reduce the dirtying rate. (rate based feedback does not kick in till next 200ms) and pos_ratio feedback seems to be instantaneous. Assume new pos_ratio is .5 So new throttle rate for both the tasks is 50MB/s. bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet) task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s Now lets say 200ms have passed and rate base feedback is reevaluated. write_bw bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- dirty_bw bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but that did not happen. And reason being that there are two feedback control loops and pos_ratio loops reacts to imbalances much more quickly. Because previous loop has already reacted to the imbalance and reduced the dirtying rate of task, rate based loop does not try to adjust anything and thinks everything is just fine. Things are fine in the sense that still dirty_rate == write_bw but system is not balanced in terms of number of dirty pages and pos_ratio=.5 So you are trying to make one feedback loop aware of second loop so that if second loop is unbalanced, first loop reacts to that as well and not just look at dirty_rate and write_bw. So refining new balanced rate by pos_ratio helps. write_bw bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio dirty_bw Now if global dirty pages are imbalanced, balanced rate will still go down despite the fact that dirty_bw == write_bw. This will lead to further reduction in task dirty rate. Which in turn will lead to reduced number of dirty rate and should eventually lead to pos_ratio=1. A related question though I should have asked you this long back. How does throttling based on rate helps. Why we could not just work with two pos_ratios. One is gloabl postion ratio and other is bdi position ratio. And then throttle task gradually to achieve smooth throttling behavior. IOW, what property does rate provide which is not available just by looking at per bdi dirty pages. Can't we come up with bdi setpoint and limit the way you have done for gloabl setpoint and throttle tasks accordingly? Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-24 18:00 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-24 18:00 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, Aug 24, 2011 at 08:12:58AM +0800, Wu Fengguang wrote: > > You somehow directly jump to > > > > balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate > > > > without explaining why following will not work. > > > > balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate > > Thanks for asking that, it's probably the root of confusions, so let > me answer it standalone. > > It's actually pretty simple to explain this equation: > > write_bw > balanced_rate = task_ratelimit_200ms * ---------- (1) > dirty_rate > > If there are N dd tasks, each task is throttled at task_ratelimit_200ms > for the past 200ms, we are going to measure the overall bdi dirty rate > > dirty_rate = N * task_ratelimit_200ms (2) > > put (2) into (1) we get > > balanced_rate = write_bw / N (3) > > So equation (1) is the right estimation to get the desired target (3). > > > As for > > write_bw > balanced_rate_(i+1) = balanced_rate_(i) * ---------- (4) > dirty_rate > > Let's compare it with the "expanded" form of (1): > > write_bw > balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ---------- (5) > dirty_rate > > So the difference lies in pos_ratio. > > Believe it or not, it's exactly the seemingly use of pos_ratio that > makes (5) independent(*) of the position control. > > Why? Look at (4), assume the system is in a state > > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N > - dirty position is not balanced, for example pos_ratio = 0.5 > > balance_dirty_pages() will be rate limiting each tasks at half the > balanced dirty rate, yielding a measured > > dirty_rate = write_bw / 2 (6) > > Put (6) into (4), we get > > balanced_rate_(i+1) = balanced_rate_(i) * 2 > = (write_bw / N) * 2 > > That means, any position imbalance will lead to balanced_rate > estimation errors if we follow (4). Whereas if (1)/(5) is used, we > always get the right balanced dirty ratelimit value whether or not > (pos_ratio == 1.0), hence make the rate estimation independent(*) of > dirty position control. > > (*) independent as in real values, not the seemingly relations in equation Ok, I think I am beginning to see your point. Let me just elaborate on the example you gave. Assume a system is completely balanced and a task is writing at 100MB/s rate. write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1 bdi->dirty_ratelimit = 100MB/s Now another tasks starts dirtying the page cache on same bdi. Number of dirty pages should go up pretty fast and likely position ratio feedback will kick in to reduce the dirtying rate. (rate based feedback does not kick in till next 200ms) and pos_ratio feedback seems to be instantaneous. Assume new pos_ratio is .5 So new throttle rate for both the tasks is 50MB/s. bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet) task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s Now lets say 200ms have passed and rate base feedback is reevaluated. write_bw bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- dirty_bw bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but that did not happen. And reason being that there are two feedback control loops and pos_ratio loops reacts to imbalances much more quickly. Because previous loop has already reacted to the imbalance and reduced the dirtying rate of task, rate based loop does not try to adjust anything and thinks everything is just fine. Things are fine in the sense that still dirty_rate == write_bw but system is not balanced in terms of number of dirty pages and pos_ratio=.5 So you are trying to make one feedback loop aware of second loop so that if second loop is unbalanced, first loop reacts to that as well and not just look at dirty_rate and write_bw. So refining new balanced rate by pos_ratio helps. write_bw bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio dirty_bw Now if global dirty pages are imbalanced, balanced rate will still go down despite the fact that dirty_bw == write_bw. This will lead to further reduction in task dirty rate. Which in turn will lead to reduced number of dirty rate and should eventually lead to pos_ratio=1. A related question though I should have asked you this long back. How does throttling based on rate helps. Why we could not just work with two pos_ratios. One is gloabl postion ratio and other is bdi position ratio. And then throttle task gradually to achieve smooth throttling behavior. IOW, what property does rate provide which is not available just by looking at per bdi dirty pages. Can't we come up with bdi setpoint and limit the way you have done for gloabl setpoint and throttle tasks accordingly? Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-24 18:00 ` Vivek Goyal @ 2011-08-25 3:19 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-25 3:19 UTC (permalink / raw) To: Vivek Goyal Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Thu, Aug 25, 2011 at 02:00:58AM +0800, Vivek Goyal wrote: > On Wed, Aug 24, 2011 at 08:12:58AM +0800, Wu Fengguang wrote: > > > You somehow directly jump to > > > > > > balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate > > > > > > without explaining why following will not work. > > > > > > balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate > > > > Thanks for asking that, it's probably the root of confusions, so let > > me answer it standalone. > > > > It's actually pretty simple to explain this equation: > > > > write_bw > > balanced_rate = task_ratelimit_200ms * ---------- (1) > > dirty_rate > > > > If there are N dd tasks, each task is throttled at task_ratelimit_200ms > > for the past 200ms, we are going to measure the overall bdi dirty rate > > > > dirty_rate = N * task_ratelimit_200ms (2) > > > > put (2) into (1) we get > > > > balanced_rate = write_bw / N (3) > > > > So equation (1) is the right estimation to get the desired target (3). > > > > > > As for > > > > write_bw > > balanced_rate_(i+1) = balanced_rate_(i) * ---------- (4) > > dirty_rate > > > > Let's compare it with the "expanded" form of (1): > > > > write_bw > > balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ---------- (5) > > dirty_rate > > > > So the difference lies in pos_ratio. > > > > Believe it or not, it's exactly the seemingly use of pos_ratio that > > makes (5) independent(*) of the position control. > > > > Why? Look at (4), assume the system is in a state > > > > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N > > - dirty position is not balanced, for example pos_ratio = 0.5 > > > > balance_dirty_pages() will be rate limiting each tasks at half the > > balanced dirty rate, yielding a measured > > > > dirty_rate = write_bw / 2 (6) > > > > Put (6) into (4), we get > > > > balanced_rate_(i+1) = balanced_rate_(i) * 2 > > = (write_bw / N) * 2 > > > > That means, any position imbalance will lead to balanced_rate > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we > > always get the right balanced dirty ratelimit value whether or not > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of > > dirty position control. > > > > (*) independent as in real values, not the seemingly relations in equation > > Ok, I think I am beginning to see your point. Let me just elaborate on > the example you gave. Thank you very much :) > Assume a system is completely balanced and a task is writing at 100MB/s > rate. > > write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1 > > bdi->dirty_ratelimit = 100MB/s > > Now another tasks starts dirtying the page cache on same bdi. Number of > dirty pages should go up pretty fast and likely position ratio feedback > will kick in to reduce the dirtying rate. (rate based feedback does not > kick in till next 200ms) and pos_ratio feedback seems to be instantaneous. That's right. There must be some instantaneous feedback to react to fast workload changes. With pos_ratio providing this capability, the estimated balanced rate can take time to follow. Note that pos_ratio by itself is enough to limit dirty pages within the [freerun, limit] control scope. The cost of (temporarily) large error in balanced rate is, task_ratelimit will be fluctuating much more, due to the fact pos_ratio will depart from 1.0 (to the point it can fully compensate for the rate errors) and dirty pages approaching @freerun or @limit where the slope of pos_ratio goes sharp. The correct estimation of balanced rate serves to drive pos_ratio back to 1.0, where it has the most flat slope. > Assume new pos_ratio is .5 > > So new throttle rate for both the tasks is 50MB/s. > > bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet) > task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s > > Now lets say 200ms have passed and rate base feedback is reevaluated. > > write_bw > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- > dirty_bw > > bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s > > Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but > that did not happen. And reason being that there are two feedback control > loops and pos_ratio loops reacts to imbalances much more quickly. Because > previous loop has already reacted to the imbalance and reduced the > dirtying rate of task, rate based loop does not try to adjust anything > and thinks everything is just fine. That's right. > Things are fine in the sense that still dirty_rate == write_bw but > system is not balanced in terms of number of dirty pages and pos_ratio=.5 Yes. The bad thing is, if the above equation (of pure rate feedback) is used, the system is going to remain in that position-imbalanced state forever, which is bad for the smoothness of task_ratelimit. > So you are trying to make one feedback loop aware of second loop so that > if second loop is unbalanced, first loop reacts to that as well and not > just look at dirty_rate and write_bw. So refining new balanced rate by > pos_ratio helps. > write_bw > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio > dirty_bw > > Now if global dirty pages are imbalanced, balanced rate will still go > down despite the fact that dirty_bw == write_bw. This will lead to > further reduction in task dirty rate. Which in turn will lead to reduced > number of dirty rate and should eventually lead to pos_ratio=1. Right, that's a good alternative viewpoint to the below one. write_bw bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * --------- dirty_bw (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0 > A related question though I should have asked you this long back. How does > throttling based on rate helps. Why we could not just work with two > pos_ratios. One is gloabl postion ratio and other is bdi position ratio. > And then throttle task gradually to achieve smooth throttling behavior. > IOW, what property does rate provide which is not available just by > looking at per bdi dirty pages. Can't we come up with bdi setpoint and > limit the way you have done for gloabl setpoint and throttle tasks > accordingly? Good question. If we have no idea of the balanced rate at all, but still want to limit dirty pages within the range [freerun, limit], all we can do is to throttle the task at eg. 1TB/s at @freerun and 0 at @limit. Then you get a really sharp control line which will make task_ratelimit fluctuate like mad... So the balanced rate estimation is the key to get smooth task_ratelimit, while pos_ratio is the ultimate guarantee for the dirty pages range. Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-25 3:19 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-25 3:19 UTC (permalink / raw) To: Vivek Goyal Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Thu, Aug 25, 2011 at 02:00:58AM +0800, Vivek Goyal wrote: > On Wed, Aug 24, 2011 at 08:12:58AM +0800, Wu Fengguang wrote: > > > You somehow directly jump to > > > > > > balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate > > > > > > without explaining why following will not work. > > > > > > balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate > > > > Thanks for asking that, it's probably the root of confusions, so let > > me answer it standalone. > > > > It's actually pretty simple to explain this equation: > > > > write_bw > > balanced_rate = task_ratelimit_200ms * ---------- (1) > > dirty_rate > > > > If there are N dd tasks, each task is throttled at task_ratelimit_200ms > > for the past 200ms, we are going to measure the overall bdi dirty rate > > > > dirty_rate = N * task_ratelimit_200ms (2) > > > > put (2) into (1) we get > > > > balanced_rate = write_bw / N (3) > > > > So equation (1) is the right estimation to get the desired target (3). > > > > > > As for > > > > write_bw > > balanced_rate_(i+1) = balanced_rate_(i) * ---------- (4) > > dirty_rate > > > > Let's compare it with the "expanded" form of (1): > > > > write_bw > > balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ---------- (5) > > dirty_rate > > > > So the difference lies in pos_ratio. > > > > Believe it or not, it's exactly the seemingly use of pos_ratio that > > makes (5) independent(*) of the position control. > > > > Why? Look at (4), assume the system is in a state > > > > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N > > - dirty position is not balanced, for example pos_ratio = 0.5 > > > > balance_dirty_pages() will be rate limiting each tasks at half the > > balanced dirty rate, yielding a measured > > > > dirty_rate = write_bw / 2 (6) > > > > Put (6) into (4), we get > > > > balanced_rate_(i+1) = balanced_rate_(i) * 2 > > = (write_bw / N) * 2 > > > > That means, any position imbalance will lead to balanced_rate > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we > > always get the right balanced dirty ratelimit value whether or not > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of > > dirty position control. > > > > (*) independent as in real values, not the seemingly relations in equation > > Ok, I think I am beginning to see your point. Let me just elaborate on > the example you gave. Thank you very much :) > Assume a system is completely balanced and a task is writing at 100MB/s > rate. > > write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1 > > bdi->dirty_ratelimit = 100MB/s > > Now another tasks starts dirtying the page cache on same bdi. Number of > dirty pages should go up pretty fast and likely position ratio feedback > will kick in to reduce the dirtying rate. (rate based feedback does not > kick in till next 200ms) and pos_ratio feedback seems to be instantaneous. That's right. There must be some instantaneous feedback to react to fast workload changes. With pos_ratio providing this capability, the estimated balanced rate can take time to follow. Note that pos_ratio by itself is enough to limit dirty pages within the [freerun, limit] control scope. The cost of (temporarily) large error in balanced rate is, task_ratelimit will be fluctuating much more, due to the fact pos_ratio will depart from 1.0 (to the point it can fully compensate for the rate errors) and dirty pages approaching @freerun or @limit where the slope of pos_ratio goes sharp. The correct estimation of balanced rate serves to drive pos_ratio back to 1.0, where it has the most flat slope. > Assume new pos_ratio is .5 > > So new throttle rate for both the tasks is 50MB/s. > > bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet) > task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s > > Now lets say 200ms have passed and rate base feedback is reevaluated. > > write_bw > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- > dirty_bw > > bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s > > Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but > that did not happen. And reason being that there are two feedback control > loops and pos_ratio loops reacts to imbalances much more quickly. Because > previous loop has already reacted to the imbalance and reduced the > dirtying rate of task, rate based loop does not try to adjust anything > and thinks everything is just fine. That's right. > Things are fine in the sense that still dirty_rate == write_bw but > system is not balanced in terms of number of dirty pages and pos_ratio=.5 Yes. The bad thing is, if the above equation (of pure rate feedback) is used, the system is going to remain in that position-imbalanced state forever, which is bad for the smoothness of task_ratelimit. > So you are trying to make one feedback loop aware of second loop so that > if second loop is unbalanced, first loop reacts to that as well and not > just look at dirty_rate and write_bw. So refining new balanced rate by > pos_ratio helps. > write_bw > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio > dirty_bw > > Now if global dirty pages are imbalanced, balanced rate will still go > down despite the fact that dirty_bw == write_bw. This will lead to > further reduction in task dirty rate. Which in turn will lead to reduced > number of dirty rate and should eventually lead to pos_ratio=1. Right, that's a good alternative viewpoint to the below one. write_bw bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * --------- dirty_bw (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0 > A related question though I should have asked you this long back. How does > throttling based on rate helps. Why we could not just work with two > pos_ratios. One is gloabl postion ratio and other is bdi position ratio. > And then throttle task gradually to achieve smooth throttling behavior. > IOW, what property does rate provide which is not available just by > looking at per bdi dirty pages. Can't we come up with bdi setpoint and > limit the way you have done for gloabl setpoint and throttle tasks > accordingly? Good question. If we have no idea of the balanced rate at all, but still want to limit dirty pages within the range [freerun, limit], all we can do is to throttle the task at eg. 1TB/s at @freerun and 0 at @limit. Then you get a really sharp control line which will make task_ratelimit fluctuate like mad... So the balanced rate estimation is the key to get smooth task_ratelimit, while pos_ratio is the ultimate guarantee for the dirty pages range. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-25 3:19 ` Wu Fengguang @ 2011-08-25 22:20 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-25 22:20 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Thu, Aug 25, 2011 at 11:19:34AM +0800, Wu Fengguang wrote: [..] > > So you are trying to make one feedback loop aware of second loop so that > > if second loop is unbalanced, first loop reacts to that as well and not > > just look at dirty_rate and write_bw. So refining new balanced rate by > > pos_ratio helps. > > write_bw > > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio > > dirty_bw > > > > Now if global dirty pages are imbalanced, balanced rate will still go > > down despite the fact that dirty_bw == write_bw. This will lead to > > further reduction in task dirty rate. Which in turn will lead to reduced > > number of dirty rate and should eventually lead to pos_ratio=1. > > Right, that's a good alternative viewpoint to the below one. > > write_bw > bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * --------- > dirty_bw > > (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms > (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0 Personally I found it much easier to understand the other representation. Once you have come up with equation. balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw Can you please put few lines of comments to explain that why above alone is not sufficient and we need to take pos_ratio also in to account to keep number of dirty pages in check. And then go onto balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw * pos_ratio This kind of maintains the continuity of explanation and explains that why are we deviating from the theory we discussed so far. > > > A related question though I should have asked you this long back. How does > > throttling based on rate helps. Why we could not just work with two > > pos_ratios. One is gloabl postion ratio and other is bdi position ratio. > > And then throttle task gradually to achieve smooth throttling behavior. > > IOW, what property does rate provide which is not available just by > > looking at per bdi dirty pages. Can't we come up with bdi setpoint and > > limit the way you have done for gloabl setpoint and throttle tasks > > accordingly? > > Good question. If we have no idea of the balanced rate at all, but > still want to limit dirty pages within the range [freerun, limit], > all we can do is to throttle the task at eg. 1TB/s at @freerun and > 0 at @limit. Then you get a really sharp control line which will make > task_ratelimit fluctuate like mad... > > So the balanced rate estimation is the key to get smooth task_ratelimit, > while pos_ratio is the ultimate guarantee for the dirty pages range. Ok, that makes sense. By keeping an estimation of rate at which bdi can write, our range of throttling goes down. Say 0 to 300MB/s instead of 0 to 1TB/sec and that can lead to a more smooth behavior. Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-25 22:20 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-25 22:20 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Thu, Aug 25, 2011 at 11:19:34AM +0800, Wu Fengguang wrote: [..] > > So you are trying to make one feedback loop aware of second loop so that > > if second loop is unbalanced, first loop reacts to that as well and not > > just look at dirty_rate and write_bw. So refining new balanced rate by > > pos_ratio helps. > > write_bw > > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio > > dirty_bw > > > > Now if global dirty pages are imbalanced, balanced rate will still go > > down despite the fact that dirty_bw == write_bw. This will lead to > > further reduction in task dirty rate. Which in turn will lead to reduced > > number of dirty rate and should eventually lead to pos_ratio=1. > > Right, that's a good alternative viewpoint to the below one. > > write_bw > bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * --------- > dirty_bw > > (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms > (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0 Personally I found it much easier to understand the other representation. Once you have come up with equation. balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw Can you please put few lines of comments to explain that why above alone is not sufficient and we need to take pos_ratio also in to account to keep number of dirty pages in check. And then go onto balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw * pos_ratio This kind of maintains the continuity of explanation and explains that why are we deviating from the theory we discussed so far. > > > A related question though I should have asked you this long back. How does > > throttling based on rate helps. Why we could not just work with two > > pos_ratios. One is gloabl postion ratio and other is bdi position ratio. > > And then throttle task gradually to achieve smooth throttling behavior. > > IOW, what property does rate provide which is not available just by > > looking at per bdi dirty pages. Can't we come up with bdi setpoint and > > limit the way you have done for gloabl setpoint and throttle tasks > > accordingly? > > Good question. If we have no idea of the balanced rate at all, but > still want to limit dirty pages within the range [freerun, limit], > all we can do is to throttle the task at eg. 1TB/s at @freerun and > 0 at @limit. Then you get a really sharp control line which will make > task_ratelimit fluctuate like mad... > > So the balanced rate estimation is the key to get smooth task_ratelimit, > while pos_ratio is the ultimate guarantee for the dirty pages range. Ok, that makes sense. By keeping an estimation of rate at which bdi can write, our range of throttling goes down. Say 0 to 300MB/s instead of 0 to 1TB/sec and that can lead to a more smooth behavior. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-25 22:20 ` Vivek Goyal @ 2011-08-26 1:56 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-26 1:56 UTC (permalink / raw) To: Vivek Goyal Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, Aug 26, 2011 at 06:20:01AM +0800, Vivek Goyal wrote: > On Thu, Aug 25, 2011 at 11:19:34AM +0800, Wu Fengguang wrote: > > [..] > > > So you are trying to make one feedback loop aware of second loop so that > > > if second loop is unbalanced, first loop reacts to that as well and not > > > just look at dirty_rate and write_bw. So refining new balanced rate by > > > pos_ratio helps. > > > write_bw > > > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio > > > dirty_bw > > > > > > Now if global dirty pages are imbalanced, balanced rate will still go > > > down despite the fact that dirty_bw == write_bw. This will lead to > > > further reduction in task dirty rate. Which in turn will lead to reduced > > > number of dirty rate and should eventually lead to pos_ratio=1. > > > > Right, that's a good alternative viewpoint to the below one. > > > > write_bw > > bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * --------- > > dirty_bw > > > > (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms > > (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0 > > Personally I found it much easier to understand the other representation. > Once you have come up with equation. > > balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw > > Can you please put few lines of comments to explain that why above > alone is not sufficient and we need to take pos_ratio also in to > account to keep number of dirty pages in check. And then go onto > > balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw * pos_ratio > > This kind of maintains the continuity of explanation and explains > that why are we deviating from the theory we discussed so far. Good point. Here is the commented code: /* * task_ratelimit reflects each dd's dirty rate for the past 200ms. */ task_ratelimit = (u64)dirty_ratelimit * pos_ratio >> RATELIMIT_CALC_SHIFT; /* * A linear estimation of the "balanced" throttle rate. The theory is, * if there are N dd tasks, each throttled at task_ratelimit, the bdi's * dirty_rate will be measured to be (N * task_ratelimit). So the below * formula will yield the balanced rate limit (write_bw / N). * * Note that the expanded form is not a pure rate feedback: * rate_(i+1) = rate_(i) * (write_bw / dirty_rate) (1) * but also takes pos_ratio into account: * rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio (2) * * (1) is not realistic because pos_ratio also takes part in balancing * the dirty rate. Consider the state * pos_ratio = 0.5 (3) * rate = 2 * (write_bw / N) (4) * If (1) is used, it will stuck in that state! Because each dd will be * throttled at * task_ratelimit = pos_ratio * rate = (write_bw / N) (5) * yielding * dirty_rate = N * task_ratelimit = write_bw (6) * put (6) into (1) we get * rate_(i+1) = rate_(i) (7) * * So we end up using (2) to always keep * rate_(i+1) ~= (write_bw / N) (8) * regardless of the value of pos_ratio. As long as (8) is satisfied, * pos_ratio is able to drive itself to 1.0, which is not only where * the dirty count meet the setpoint, but also where the slope of * pos_ratio is most flat and hence task_ratelimit is least fluctuated. */ balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw, dirty_rate | 1); > > > > > A related question though I should have asked you this long back. How does > > > throttling based on rate helps. Why we could not just work with two > > > pos_ratios. One is gloabl postion ratio and other is bdi position ratio. > > > And then throttle task gradually to achieve smooth throttling behavior. > > > IOW, what property does rate provide which is not available just by > > > looking at per bdi dirty pages. Can't we come up with bdi setpoint and > > > limit the way you have done for gloabl setpoint and throttle tasks > > > accordingly? > > > > Good question. If we have no idea of the balanced rate at all, but > > still want to limit dirty pages within the range [freerun, limit], > > all we can do is to throttle the task at eg. 1TB/s at @freerun and > > 0 at @limit. Then you get a really sharp control line which will make > > task_ratelimit fluctuate like mad... > > > > So the balanced rate estimation is the key to get smooth task_ratelimit, > > while pos_ratio is the ultimate guarantee for the dirty pages range. > > Ok, that makes sense. By keeping an estimation of rate at which bdi > can write, our range of throttling goes down. Say 0 to 300MB/s instead > of 0 to 1TB/sec and that can lead to a more smooth behavior. Yeah exactly, and even better, we can make the slope much more flat around the setpoint to achieve excellent smoothness in stable state :) Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-26 1:56 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-26 1:56 UTC (permalink / raw) To: Vivek Goyal Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, Aug 26, 2011 at 06:20:01AM +0800, Vivek Goyal wrote: > On Thu, Aug 25, 2011 at 11:19:34AM +0800, Wu Fengguang wrote: > > [..] > > > So you are trying to make one feedback loop aware of second loop so that > > > if second loop is unbalanced, first loop reacts to that as well and not > > > just look at dirty_rate and write_bw. So refining new balanced rate by > > > pos_ratio helps. > > > write_bw > > > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio > > > dirty_bw > > > > > > Now if global dirty pages are imbalanced, balanced rate will still go > > > down despite the fact that dirty_bw == write_bw. This will lead to > > > further reduction in task dirty rate. Which in turn will lead to reduced > > > number of dirty rate and should eventually lead to pos_ratio=1. > > > > Right, that's a good alternative viewpoint to the below one. > > > > write_bw > > bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * --------- > > dirty_bw > > > > (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms > > (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0 > > Personally I found it much easier to understand the other representation. > Once you have come up with equation. > > balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw > > Can you please put few lines of comments to explain that why above > alone is not sufficient and we need to take pos_ratio also in to > account to keep number of dirty pages in check. And then go onto > > balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw * pos_ratio > > This kind of maintains the continuity of explanation and explains > that why are we deviating from the theory we discussed so far. Good point. Here is the commented code: /* * task_ratelimit reflects each dd's dirty rate for the past 200ms. */ task_ratelimit = (u64)dirty_ratelimit * pos_ratio >> RATELIMIT_CALC_SHIFT; /* * A linear estimation of the "balanced" throttle rate. The theory is, * if there are N dd tasks, each throttled at task_ratelimit, the bdi's * dirty_rate will be measured to be (N * task_ratelimit). So the below * formula will yield the balanced rate limit (write_bw / N). * * Note that the expanded form is not a pure rate feedback: * rate_(i+1) = rate_(i) * (write_bw / dirty_rate) (1) * but also takes pos_ratio into account: * rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio (2) * * (1) is not realistic because pos_ratio also takes part in balancing * the dirty rate. Consider the state * pos_ratio = 0.5 (3) * rate = 2 * (write_bw / N) (4) * If (1) is used, it will stuck in that state! Because each dd will be * throttled at * task_ratelimit = pos_ratio * rate = (write_bw / N) (5) * yielding * dirty_rate = N * task_ratelimit = write_bw (6) * put (6) into (1) we get * rate_(i+1) = rate_(i) (7) * * So we end up using (2) to always keep * rate_(i+1) ~= (write_bw / N) (8) * regardless of the value of pos_ratio. As long as (8) is satisfied, * pos_ratio is able to drive itself to 1.0, which is not only where * the dirty count meet the setpoint, but also where the slope of * pos_ratio is most flat and hence task_ratelimit is least fluctuated. */ balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw, dirty_rate | 1); > > > > > A related question though I should have asked you this long back. How does > > > throttling based on rate helps. Why we could not just work with two > > > pos_ratios. One is gloabl postion ratio and other is bdi position ratio. > > > And then throttle task gradually to achieve smooth throttling behavior. > > > IOW, what property does rate provide which is not available just by > > > looking at per bdi dirty pages. Can't we come up with bdi setpoint and > > > limit the way you have done for gloabl setpoint and throttle tasks > > > accordingly? > > > > Good question. If we have no idea of the balanced rate at all, but > > still want to limit dirty pages within the range [freerun, limit], > > all we can do is to throttle the task at eg. 1TB/s at @freerun and > > 0 at @limit. Then you get a really sharp control line which will make > > task_ratelimit fluctuate like mad... > > > > So the balanced rate estimation is the key to get smooth task_ratelimit, > > while pos_ratio is the ultimate guarantee for the dirty pages range. > > Ok, that makes sense. By keeping an estimation of rate at which bdi > can write, our range of throttling goes down. Say 0 to 300MB/s instead > of 0 to 1TB/sec and that can lead to a more smooth behavior. Yeah exactly, and even better, we can make the slope much more flat around the setpoint to achieve excellent smoothness in stable state :) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-26 1:56 ` Wu Fengguang @ 2011-08-26 8:56 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-26 8:56 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, 2011-08-26 at 09:56 +0800, Wu Fengguang wrote: > /* > * A linear estimation of the "balanced" throttle rate. The theory is, > * if there are N dd tasks, each throttled at task_ratelimit, the bdi's > * dirty_rate will be measured to be (N * task_ratelimit). So the below > * formula will yield the balanced rate limit (write_bw / N). > * > * Note that the expanded form is not a pure rate feedback: > * rate_(i+1) = rate_(i) * (write_bw / dirty_rate) (1) > * but also takes pos_ratio into account: > * rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio (2) > * > * (1) is not realistic because pos_ratio also takes part in balancing > * the dirty rate. Consider the state > * pos_ratio = 0.5 (3) > * rate = 2 * (write_bw / N) (4) > * If (1) is used, it will stuck in that state! Because each dd will be > * throttled at > * task_ratelimit = pos_ratio * rate = (write_bw / N) (5) > * yielding > * dirty_rate = N * task_ratelimit = write_bw (6) > * put (6) into (1) we get > * rate_(i+1) = rate_(i) (7) > * > * So we end up using (2) to always keep > * rate_(i+1) ~= (write_bw / N) (8) > * regardless of the value of pos_ratio. As long as (8) is satisfied, > * pos_ratio is able to drive itself to 1.0, which is not only where > * the dirty count meet the setpoint, but also where the slope of > * pos_ratio is most flat and hence task_ratelimit is least fluctuated. > */ I'm still not buying this, it has the massive assumption N is a constant, without that assumption you get the same kind of thing you get from not adding pos_ratio to the feedback term. Also, I've yet to see what harm it does if you leave it out, all feedback loops should stabilize just fine. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-26 8:56 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-26 8:56 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, 2011-08-26 at 09:56 +0800, Wu Fengguang wrote: > /* > * A linear estimation of the "balanced" throttle rate. The theory is, > * if there are N dd tasks, each throttled at task_ratelimit, the bdi's > * dirty_rate will be measured to be (N * task_ratelimit). So the below > * formula will yield the balanced rate limit (write_bw / N). > * > * Note that the expanded form is not a pure rate feedback: > * rate_(i+1) = rate_(i) * (write_bw / dirty_rate) (1) > * but also takes pos_ratio into account: > * rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio (2) > * > * (1) is not realistic because pos_ratio also takes part in balancing > * the dirty rate. Consider the state > * pos_ratio = 0.5 (3) > * rate = 2 * (write_bw / N) (4) > * If (1) is used, it will stuck in that state! Because each dd will be > * throttled at > * task_ratelimit = pos_ratio * rate = (write_bw / N) (5) > * yielding > * dirty_rate = N * task_ratelimit = write_bw (6) > * put (6) into (1) we get > * rate_(i+1) = rate_(i) (7) > * > * So we end up using (2) to always keep > * rate_(i+1) ~= (write_bw / N) (8) > * regardless of the value of pos_ratio. As long as (8) is satisfied, > * pos_ratio is able to drive itself to 1.0, which is not only where > * the dirty count meet the setpoint, but also where the slope of > * pos_ratio is most flat and hence task_ratelimit is least fluctuated. > */ I'm still not buying this, it has the massive assumption N is a constant, without that assumption you get the same kind of thing you get from not adding pos_ratio to the feedback term. Also, I've yet to see what harm it does if you leave it out, all feedback loops should stabilize just fine. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-26 8:56 ` Peter Zijlstra @ 2011-08-26 9:53 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-26 9:53 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, Aug 26, 2011 at 04:56:11PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-26 at 09:56 +0800, Wu Fengguang wrote: > > /* > > * A linear estimation of the "balanced" throttle rate. The theory is, > > * if there are N dd tasks, each throttled at task_ratelimit, the bdi's > > * dirty_rate will be measured to be (N * task_ratelimit). So the below > > * formula will yield the balanced rate limit (write_bw / N). > > * > > * Note that the expanded form is not a pure rate feedback: > > * rate_(i+1) = rate_(i) * (write_bw / dirty_rate) (1) > > * but also takes pos_ratio into account: > > * rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio (2) > > * > > * (1) is not realistic because pos_ratio also takes part in balancing > > * the dirty rate. Consider the state > > * pos_ratio = 0.5 (3) > > * rate = 2 * (write_bw / N) (4) > > * If (1) is used, it will stuck in that state! Because each dd will be > > * throttled at > > * task_ratelimit = pos_ratio * rate = (write_bw / N) (5) > > * yielding > > * dirty_rate = N * task_ratelimit = write_bw (6) > > * put (6) into (1) we get > > * rate_(i+1) = rate_(i) (7) > > * > > * So we end up using (2) to always keep > > * rate_(i+1) ~= (write_bw / N) (8) > > * regardless of the value of pos_ratio. As long as (8) is satisfied, > > * pos_ratio is able to drive itself to 1.0, which is not only where > > * the dirty count meet the setpoint, but also where the slope of > > * pos_ratio is most flat and hence task_ratelimit is least fluctuated. > > */ > > I'm still not buying this, it has the massive assumption N is a > constant, without that assumption you get the same kind of thing you get > from not adding pos_ratio to the feedback term. The reasoning between (3)-(7) actually assumes both N and write_bw to be some constant. It's documenting some stuck state.. > Also, I've yet to see what harm it does if you leave it out, all > feedback loops should stabilize just fine. That's a good question. It should be trivial to try out equation (1) and see how it work out in practice. Let me collect some figures.. Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-26 9:53 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-26 9:53 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, Aug 26, 2011 at 04:56:11PM +0800, Peter Zijlstra wrote: > On Fri, 2011-08-26 at 09:56 +0800, Wu Fengguang wrote: > > /* > > * A linear estimation of the "balanced" throttle rate. The theory is, > > * if there are N dd tasks, each throttled at task_ratelimit, the bdi's > > * dirty_rate will be measured to be (N * task_ratelimit). So the below > > * formula will yield the balanced rate limit (write_bw / N). > > * > > * Note that the expanded form is not a pure rate feedback: > > * rate_(i+1) = rate_(i) * (write_bw / dirty_rate) (1) > > * but also takes pos_ratio into account: > > * rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio (2) > > * > > * (1) is not realistic because pos_ratio also takes part in balancing > > * the dirty rate. Consider the state > > * pos_ratio = 0.5 (3) > > * rate = 2 * (write_bw / N) (4) > > * If (1) is used, it will stuck in that state! Because each dd will be > > * throttled at > > * task_ratelimit = pos_ratio * rate = (write_bw / N) (5) > > * yielding > > * dirty_rate = N * task_ratelimit = write_bw (6) > > * put (6) into (1) we get > > * rate_(i+1) = rate_(i) (7) > > * > > * So we end up using (2) to always keep > > * rate_(i+1) ~= (write_bw / N) (8) > > * regardless of the value of pos_ratio. As long as (8) is satisfied, > > * pos_ratio is able to drive itself to 1.0, which is not only where > > * the dirty count meet the setpoint, but also where the slope of > > * pos_ratio is most flat and hence task_ratelimit is least fluctuated. > > */ > > I'm still not buying this, it has the massive assumption N is a > constant, without that assumption you get the same kind of thing you get > from not adding pos_ratio to the feedback term. The reasoning between (3)-(7) actually assumes both N and write_bw to be some constant. It's documenting some stuck state.. > Also, I've yet to see what harm it does if you leave it out, all > feedback loops should stabilize just fine. That's a good question. It should be trivial to try out equation (1) and see how it work out in practice. Let me collect some figures.. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-24 18:00 ` Vivek Goyal @ 2011-08-29 13:12 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-29 13:12 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, 2011-08-24 at 14:00 -0400, Vivek Goyal wrote: > > Ok, I think I am beginning to see your point. Let me just elaborate on > the example you gave. > > Assume a system is completely balanced and a task is writing at 100MB/s > rate. > > write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1 > > bdi->dirty_ratelimit = 100MB/s > > Now another tasks starts dirtying the page cache on same bdi. Number of > dirty pages should go up pretty fast and likely position ratio feedback > will kick in to reduce the dirtying rate. (rate based feedback does not > kick in till next 200ms) and pos_ratio feedback seems to be instantaneous. > Assume new pos_ratio is .5 > > So new throttle rate for both the tasks is 50MB/s. > > bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet) > task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s > > Now lets say 200ms have passed and rate base feedback is reevaluated. > > write_bw > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- > dirty_bw > > bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s > > Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but > that did not happen. And reason being that there are two feedback control > loops and pos_ratio loops reacts to imbalances much more quickly. Because > previous loop has already reacted to the imbalance and reduced the > dirtying rate of task, rate based loop does not try to adjust anything > and thinks everything is just fine. > > Things are fine in the sense that still dirty_rate == write_bw but > system is not balanced in terms of number of dirty pages and pos_ratio=.5 > > So you are trying to make one feedback loop aware of second loop so that > if second loop is unbalanced, first loop reacts to that as well and not > just look at dirty_rate and write_bw. So refining new balanced rate by > pos_ratio helps. > write_bw > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio > dirty_bw > > Now if global dirty pages are imbalanced, balanced rate will still go > down despite the fact that dirty_bw == write_bw. This will lead to > further reduction in task dirty rate. Which in turn will lead to reduced > number of dirty rate and should eventually lead to pos_ratio=1. Ok so this argument makes sense, is there some formalism to describe such systems where such things are more evident? ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-29 13:12 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-29 13:12 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, 2011-08-24 at 14:00 -0400, Vivek Goyal wrote: > > Ok, I think I am beginning to see your point. Let me just elaborate on > the example you gave. > > Assume a system is completely balanced and a task is writing at 100MB/s > rate. > > write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1 > > bdi->dirty_ratelimit = 100MB/s > > Now another tasks starts dirtying the page cache on same bdi. Number of > dirty pages should go up pretty fast and likely position ratio feedback > will kick in to reduce the dirtying rate. (rate based feedback does not > kick in till next 200ms) and pos_ratio feedback seems to be instantaneous. > Assume new pos_ratio is .5 > > So new throttle rate for both the tasks is 50MB/s. > > bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet) > task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s > > Now lets say 200ms have passed and rate base feedback is reevaluated. > > write_bw > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- > dirty_bw > > bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s > > Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but > that did not happen. And reason being that there are two feedback control > loops and pos_ratio loops reacts to imbalances much more quickly. Because > previous loop has already reacted to the imbalance and reduced the > dirtying rate of task, rate based loop does not try to adjust anything > and thinks everything is just fine. > > Things are fine in the sense that still dirty_rate == write_bw but > system is not balanced in terms of number of dirty pages and pos_ratio=.5 > > So you are trying to make one feedback loop aware of second loop so that > if second loop is unbalanced, first loop reacts to that as well and not > just look at dirty_rate and write_bw. So refining new balanced rate by > pos_ratio helps. > write_bw > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio > dirty_bw > > Now if global dirty pages are imbalanced, balanced rate will still go > down despite the fact that dirty_bw == write_bw. This will lead to > further reduction in task dirty rate. Which in turn will lead to reduced > number of dirty rate and should eventually lead to pos_ratio=1. Ok so this argument makes sense, is there some formalism to describe such systems where such things are more evident? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-29 13:12 ` Peter Zijlstra @ 2011-08-29 13:37 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-29 13:37 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Mon, Aug 29, 2011 at 09:12:07PM +0800, Peter Zijlstra wrote: > On Wed, 2011-08-24 at 14:00 -0400, Vivek Goyal wrote: > > > > Ok, I think I am beginning to see your point. Let me just elaborate on > > the example you gave. > > > > Assume a system is completely balanced and a task is writing at 100MB/s > > rate. > > > > write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1 > > > > bdi->dirty_ratelimit = 100MB/s > > > > Now another tasks starts dirtying the page cache on same bdi. Number of > > dirty pages should go up pretty fast and likely position ratio feedback > > will kick in to reduce the dirtying rate. (rate based feedback does not > > kick in till next 200ms) and pos_ratio feedback seems to be instantaneous. > > Assume new pos_ratio is .5 > > > > So new throttle rate for both the tasks is 50MB/s. > > > > bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet) > > task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s > > > > Now lets say 200ms have passed and rate base feedback is reevaluated. > > > > write_bw > > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- > > dirty_bw > > > > bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s > > > > Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but > > that did not happen. And reason being that there are two feedback control > > loops and pos_ratio loops reacts to imbalances much more quickly. Because > > previous loop has already reacted to the imbalance and reduced the > > dirtying rate of task, rate based loop does not try to adjust anything > > and thinks everything is just fine. > > > > Things are fine in the sense that still dirty_rate == write_bw but > > system is not balanced in terms of number of dirty pages and pos_ratio=.5 > > > > So you are trying to make one feedback loop aware of second loop so that > > if second loop is unbalanced, first loop reacts to that as well and not > > just look at dirty_rate and write_bw. So refining new balanced rate by > > pos_ratio helps. > > write_bw > > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio > > dirty_bw > > > > Now if global dirty pages are imbalanced, balanced rate will still go > > down despite the fact that dirty_bw == write_bw. This will lead to > > further reduction in task dirty rate. Which in turn will lead to reduced > > number of dirty rate and should eventually lead to pos_ratio=1. > > > Ok so this argument makes sense, is there some formalism to describe > such systems where such things are more evident? I find the most easy and clean way to describe it is, (1) the below formula write_bw bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio dirty_bw is able to yield dirty_ratelimit_(i) ~= (write_bw / N) as long as - write_bw, dirty_bw and pos_ratio are not changing rapidly - dirty pages are not around @freerun or @limit Otherwise there will be larger estimation errors. (2) based on (1), we get task_ratelimit ~= (write_bw / N) * pos_ratio So the pos_ratio feedback is able to drive dirty count to the setpoint, where pos_ratio = 1. That interpretation based on _real values_ can neatly decouple the two feedback loops :) It makes full utilization of the fact "the dirty_ratelimit _value_ is independent on pos_ratio except for possible impacts on estimation errors". Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-29 13:37 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-29 13:37 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Mon, Aug 29, 2011 at 09:12:07PM +0800, Peter Zijlstra wrote: > On Wed, 2011-08-24 at 14:00 -0400, Vivek Goyal wrote: > > > > Ok, I think I am beginning to see your point. Let me just elaborate on > > the example you gave. > > > > Assume a system is completely balanced and a task is writing at 100MB/s > > rate. > > > > write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1 > > > > bdi->dirty_ratelimit = 100MB/s > > > > Now another tasks starts dirtying the page cache on same bdi. Number of > > dirty pages should go up pretty fast and likely position ratio feedback > > will kick in to reduce the dirtying rate. (rate based feedback does not > > kick in till next 200ms) and pos_ratio feedback seems to be instantaneous. > > Assume new pos_ratio is .5 > > > > So new throttle rate for both the tasks is 50MB/s. > > > > bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet) > > task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s > > > > Now lets say 200ms have passed and rate base feedback is reevaluated. > > > > write_bw > > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- > > dirty_bw > > > > bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s > > > > Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but > > that did not happen. And reason being that there are two feedback control > > loops and pos_ratio loops reacts to imbalances much more quickly. Because > > previous loop has already reacted to the imbalance and reduced the > > dirtying rate of task, rate based loop does not try to adjust anything > > and thinks everything is just fine. > > > > Things are fine in the sense that still dirty_rate == write_bw but > > system is not balanced in terms of number of dirty pages and pos_ratio=.5 > > > > So you are trying to make one feedback loop aware of second loop so that > > if second loop is unbalanced, first loop reacts to that as well and not > > just look at dirty_rate and write_bw. So refining new balanced rate by > > pos_ratio helps. > > write_bw > > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio > > dirty_bw > > > > Now if global dirty pages are imbalanced, balanced rate will still go > > down despite the fact that dirty_bw == write_bw. This will lead to > > further reduction in task dirty rate. Which in turn will lead to reduced > > number of dirty rate and should eventually lead to pos_ratio=1. > > > Ok so this argument makes sense, is there some formalism to describe > such systems where such things are more evident? I find the most easy and clean way to describe it is, (1) the below formula write_bw bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio dirty_bw is able to yield dirty_ratelimit_(i) ~= (write_bw / N) as long as - write_bw, dirty_bw and pos_ratio are not changing rapidly - dirty pages are not around @freerun or @limit Otherwise there will be larger estimation errors. (2) based on (1), we get task_ratelimit ~= (write_bw / N) * pos_ratio So the pos_ratio feedback is able to drive dirty count to the setpoint, where pos_ratio = 1. That interpretation based on _real values_ can neatly decouple the two feedback loops :) It makes full utilization of the fact "the dirty_ratelimit _value_ is independent on pos_ratio except for possible impacts on estimation errors". Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-29 13:37 ` Wu Fengguang @ 2011-09-02 12:16 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-09-02 12:16 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Mon, 2011-08-29 at 21:37 +0800, Wu Fengguang wrote: > > > > Ok so this argument makes sense, is there some formalism to describe > > such systems where such things are more evident? > > I find the most easy and clean way to describe it is, > > (1) the below formula > write_bw > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio > dirty_bw > is able to yield > > dirty_ratelimit_(i) ~= (write_bw / N) > > as long as > > - write_bw, dirty_bw and pos_ratio are not changing rapidly > - dirty pages are not around @freerun or @limit > > Otherwise there will be larger estimation errors. > > (2) based on (1), we get > > task_ratelimit ~= (write_bw / N) * pos_ratio > > So the pos_ratio feedback is able to drive dirty count to the > setpoint, where pos_ratio = 1. > > That interpretation based on _real values_ can neatly decouple the two > feedback loops :) It makes full utilization of the fact "the > dirty_ratelimit _value_ is independent on pos_ratio except for > possible impacts on estimation errors". OK, so the 'problem' I have with this is that the whole control thing really doesn't care about N. All it does is measure: - dirty rate - writeback rate observe: - dirty count; with the independent input of its setpoint control: - ratelimit so I was looking for a way to describe the interaction between the two feedback loops without involving the exact details of what they're controlling, but that might just end up being an oxymoron. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-09-02 12:16 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-09-02 12:16 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Mon, 2011-08-29 at 21:37 +0800, Wu Fengguang wrote: > > > > Ok so this argument makes sense, is there some formalism to describe > > such systems where such things are more evident? > > I find the most easy and clean way to describe it is, > > (1) the below formula > write_bw > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio > dirty_bw > is able to yield > > dirty_ratelimit_(i) ~= (write_bw / N) > > as long as > > - write_bw, dirty_bw and pos_ratio are not changing rapidly > - dirty pages are not around @freerun or @limit > > Otherwise there will be larger estimation errors. > > (2) based on (1), we get > > task_ratelimit ~= (write_bw / N) * pos_ratio > > So the pos_ratio feedback is able to drive dirty count to the > setpoint, where pos_ratio = 1. > > That interpretation based on _real values_ can neatly decouple the two > feedback loops :) It makes full utilization of the fact "the > dirty_ratelimit _value_ is independent on pos_ratio except for > possible impacts on estimation errors". OK, so the 'problem' I have with this is that the whole control thing really doesn't care about N. All it does is measure: - dirty rate - writeback rate observe: - dirty count; with the independent input of its setpoint control: - ratelimit so I was looking for a way to describe the interaction between the two feedback loops without involving the exact details of what they're controlling, but that might just end up being an oxymoron. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-29 13:37 ` Wu Fengguang @ 2011-09-06 12:40 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-09-06 12:40 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, 2011-09-02 at 14:16 +0200, Peter Zijlstra wrote: > On Mon, 2011-08-29 at 21:37 +0800, Wu Fengguang wrote: > > > > > > Ok so this argument makes sense, is there some formalism to describe > > > such systems where such things are more evident? > > > > I find the most easy and clean way to describe it is, > > > > (1) the below formula > > write_bw > > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio > > dirty_bw > > is able to yield > > > > dirty_ratelimit_(i) ~= (write_bw / N) > > > > as long as > > > > - write_bw, dirty_bw and pos_ratio are not changing rapidly > > - dirty pages are not around @freerun or @limit > > > > Otherwise there will be larger estimation errors. > > > > (2) based on (1), we get > > > > task_ratelimit ~= (write_bw / N) * pos_ratio > > > > So the pos_ratio feedback is able to drive dirty count to the > > setpoint, where pos_ratio = 1. > > > > That interpretation based on _real values_ can neatly decouple the two > > feedback loops :) It makes full utilization of the fact "the > > dirty_ratelimit _value_ is independent on pos_ratio except for > > possible impacts on estimation errors". > > OK, so the 'problem' I have with this is that the whole control thing > really doesn't care about N. All it does is measure: > > - dirty rate > - writeback rate > > observe: > > - dirty count; with the independent input of its setpoint > > control: > > - ratelimit > > so I was looking for a way to describe the interaction between the two > feedback loops without involving the exact details of what they're > controlling, but that might just end up being an oxymoron. Hmm, so per Vivek's argument the system without pos_ratio in the feedback term isn't convergent. Therefore we should be able to argue from convergent/stability grounds that this term is indeed needed. Does the stability proof of a control system need the model of what its controlling? I guess I ought to go get a book on this or so. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-09-06 12:40 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-09-06 12:40 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, 2011-09-02 at 14:16 +0200, Peter Zijlstra wrote: > On Mon, 2011-08-29 at 21:37 +0800, Wu Fengguang wrote: > > > > > > Ok so this argument makes sense, is there some formalism to describe > > > such systems where such things are more evident? > > > > I find the most easy and clean way to describe it is, > > > > (1) the below formula > > write_bw > > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio > > dirty_bw > > is able to yield > > > > dirty_ratelimit_(i) ~= (write_bw / N) > > > > as long as > > > > - write_bw, dirty_bw and pos_ratio are not changing rapidly > > - dirty pages are not around @freerun or @limit > > > > Otherwise there will be larger estimation errors. > > > > (2) based on (1), we get > > > > task_ratelimit ~= (write_bw / N) * pos_ratio > > > > So the pos_ratio feedback is able to drive dirty count to the > > setpoint, where pos_ratio = 1. > > > > That interpretation based on _real values_ can neatly decouple the two > > feedback loops :) It makes full utilization of the fact "the > > dirty_ratelimit _value_ is independent on pos_ratio except for > > possible impacts on estimation errors". > > OK, so the 'problem' I have with this is that the whole control thing > really doesn't care about N. All it does is measure: > > - dirty rate > - writeback rate > > observe: > > - dirty count; with the independent input of its setpoint > > control: > > - ratelimit > > so I was looking for a way to describe the interaction between the two > feedback loops without involving the exact details of what they're > controlling, but that might just end up being an oxymoron. Hmm, so per Vivek's argument the system without pos_ratio in the feedback term isn't convergent. Therefore we should be able to argue from convergent/stability grounds that this term is indeed needed. Does the stability proof of a control system need the model of what its controlling? I guess I ought to go get a book on this or so. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-23 14:15 ` Wu Fengguang (?) @ 2011-08-24 15:57 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-24 15:57 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote: > On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote: > > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote: > > > - not a factor at all for updating balanced_rate (whether or not we do (2)) > > > well, in this concept: the balanced_rate formula inherently does not > > > derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's > > > based on the ratelimit executed for the past 200ms: > > > > > > balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio > > > > Ok, this is where it all goes funny.. > > > > So if you want completely separated feedback loops I would expect > > If call it feedback loops, then it's a series of independent feedback > loops of depth 1. Because each balanced_rate is a fresh estimation > dependent solely on > > - writeout bandwidth > - N, the number of dd tasks > > in the past 200ms. > > As long as a CONSTANT ratelimit (whatever value it is) is executed in > the past 200ms, we can get the same balanced_rate. > > balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate > > The resulted balanced_rate is independent of how large the CONSTANT > ratelimit is, because if we start with a doubled CONSTANT ratelimit, > we'll see doubled dirty_rate and result in the same balanced_rate. > > In that manner, balance_rate_(i+1) is not really depending on the > value of balance_rate_(i): whatever balance_rate_(i) is, we are going > to get the same balance_rate_(i+1) At best this argument says it doesn't matter what we use, making balance_rate_i an equally valid choice. However I don't buy this, your argument is broken, your CONSTANT_ratelimit breaks feedback but then you rely on the iterative form of feedback to finish your argument. Consider: r_(i+1) = r_i * ratio_i you say, r_i := C for all i, then by definition ratio_i must be 1 and you've got nothing. The only way your conclusion can be right is by allowing the proper iteration, otherwise we'll never reach the equilibrium. Now it is true you can introduce random perturbations in r_i at any given point and still end up in equilibrium, such is the power of iterative feedback, but that doesn't say you can do away with r_i. > > something like: > > > > balance_rate_(i+1) = balance_rate_(i) * bw_ratio ; every 200ms > > > > The former is a complete feedback loop, expressing the new value in the > > old value (*) with bw_ratio as feedback parameter; if we throttled too > > much, the dirty_rate will have dropped and the bw_ratio will be <1 > > causing the balance_rate to drop increasing the dirty_rate, and vice > > versa. > > In principle, the bw_ratio works that way. However since > balance_rate_(i) is not the exact _executed_ ratelimit in > balance_dirty_pages(). This seems to be where your argument goes bad, the actually executed ratelimit is not important, the variance introduced by pos_ratio is purely for the benefit of the dirty page count. It doesn't matter for the balance_rate. Without pos_ratio, the dirty page count would stay stable (ignoring all these oscillations and other fun things), and therefore it is the balance_rate we should be using for the iterative feedback. > > (*) which is the form I expected and why I thought your primary feedback > > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio > > Because the executed ratelimit was rate_(i) * pos_ratio. No, because iterative feedback has the form: new = old $op $feedback-term > > Then when you use the balance_rate to actually throttle tasks you apply > > your secondary control steering the dirty page count, yielding: > > > > task_rate = balance_rate * pos_ratio > > Right. Note the above formula is not a derived one, Agreed, its not a derived expression but the originator of the dirty page count control. > but an original > one that later leads to pos_ratio showing up in the calculation of > balanced_rate. That's where I disagree :-) > > > and task_ratelimit_200ms happen to can be estimated from > > > > > > task_ratelimit_200ms ~= balanced_rate_i * pos_ratio > > > > > We may alternatively record every task_ratelimit executed in the > > > past 200ms and average them all to get task_ratelimit_200ms. In this > > > way we take the "superfluous" pos_ratio out of sight :) > > > > Right, so I'm not at all sure that makes sense, its not immediately > > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it > > clear to me why your primary feedback loop uses task_ratelimit_200ms at > > all. > > task_ratelimit is used and hence defined to be (balance_rate * pos_ratio) > by balance_dirty_pages(). So this is an original formula: > > task_ratelimit = balance_rate * pos_ratio > > task_ratelimit_200ms is also used as an original data source in > > balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate But that's exactly where you conflate the positional feedback with the throughput feedback, the effective ratelimit includes the positional feedback so that the dirty page count can move around, but that is completely orthogonal to the throughput feedback since the throughout thing would leave the dirty count constant (ideal case again). That is, yes the iterative feedback still works because you still got your primary feedback in place, but the addition of pos_ratio in the feedback loop is a pure perturbation and doesn't matter one whit. > Then we try to estimate task_ratelimit_200ms by assuming all tasks > have been executing the same CONSTANT ratelimit in > balance_dirty_pages(). Hence we get > > task_ratelimit_200ms ~= prev_balance_rate * pos_ratio But this just cannot be true (and, as argued above, is completely unnecessary). Consider the case where the dirty count is way below the setpoint but the base ratelimit is pretty accurate. In that case we would start out by creating very low task ratelimits such that the dirty count can increase. Once we match the setpoint we go back to the base ratelimit. The average over those 200ms would be <1, but since we're right at the setpoint when we do the base ratelimit feedback we pick exactly 1. Anyway, its completely irrelevant.. :-) > > > There is fundamentally no dependency between balanced_rate_(i+1) and > > > balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation > > > only asks for _whatever_ CONSTANT task ratelimit to be executed for > > > 200ms, then it get the balanced rate from the dirty_rate feedback. > > > > How can there not be a relation between balance_rate_(i+1) and > > balance_rate_(i) ? > > In this manner: even though balance_rate_(i) is somehow used for > calculating balance_rate_(i+1), the latter will evaluate to the same > value given whatever balance_rate_(i). But only if you allow for the iterative feedback to work, you absolutely need that balance_rate_(i), without that its completely broken. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-24 15:57 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-24 15:57 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote: > On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote: > > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote: > > > - not a factor at all for updating balanced_rate (whether or not we do (2)) > > > well, in this concept: the balanced_rate formula inherently does not > > > derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's > > > based on the ratelimit executed for the past 200ms: > > > > > > balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio > > > > Ok, this is where it all goes funny.. > > > > So if you want completely separated feedback loops I would expect > > If call it feedback loops, then it's a series of independent feedback > loops of depth 1. Because each balanced_rate is a fresh estimation > dependent solely on > > - writeout bandwidth > - N, the number of dd tasks > > in the past 200ms. > > As long as a CONSTANT ratelimit (whatever value it is) is executed in > the past 200ms, we can get the same balanced_rate. > > balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate > > The resulted balanced_rate is independent of how large the CONSTANT > ratelimit is, because if we start with a doubled CONSTANT ratelimit, > we'll see doubled dirty_rate and result in the same balanced_rate. > > In that manner, balance_rate_(i+1) is not really depending on the > value of balance_rate_(i): whatever balance_rate_(i) is, we are going > to get the same balance_rate_(i+1) At best this argument says it doesn't matter what we use, making balance_rate_i an equally valid choice. However I don't buy this, your argument is broken, your CONSTANT_ratelimit breaks feedback but then you rely on the iterative form of feedback to finish your argument. Consider: r_(i+1) = r_i * ratio_i you say, r_i := C for all i, then by definition ratio_i must be 1 and you've got nothing. The only way your conclusion can be right is by allowing the proper iteration, otherwise we'll never reach the equilibrium. Now it is true you can introduce random perturbations in r_i at any given point and still end up in equilibrium, such is the power of iterative feedback, but that doesn't say you can do away with r_i. > > something like: > > > > balance_rate_(i+1) = balance_rate_(i) * bw_ratio ; every 200ms > > > > The former is a complete feedback loop, expressing the new value in the > > old value (*) with bw_ratio as feedback parameter; if we throttled too > > much, the dirty_rate will have dropped and the bw_ratio will be <1 > > causing the balance_rate to drop increasing the dirty_rate, and vice > > versa. > > In principle, the bw_ratio works that way. However since > balance_rate_(i) is not the exact _executed_ ratelimit in > balance_dirty_pages(). This seems to be where your argument goes bad, the actually executed ratelimit is not important, the variance introduced by pos_ratio is purely for the benefit of the dirty page count. It doesn't matter for the balance_rate. Without pos_ratio, the dirty page count would stay stable (ignoring all these oscillations and other fun things), and therefore it is the balance_rate we should be using for the iterative feedback. > > (*) which is the form I expected and why I thought your primary feedback > > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio > > Because the executed ratelimit was rate_(i) * pos_ratio. No, because iterative feedback has the form: new = old $op $feedback-term > > Then when you use the balance_rate to actually throttle tasks you apply > > your secondary control steering the dirty page count, yielding: > > > > task_rate = balance_rate * pos_ratio > > Right. Note the above formula is not a derived one, Agreed, its not a derived expression but the originator of the dirty page count control. > but an original > one that later leads to pos_ratio showing up in the calculation of > balanced_rate. That's where I disagree :-) > > > and task_ratelimit_200ms happen to can be estimated from > > > > > > task_ratelimit_200ms ~= balanced_rate_i * pos_ratio > > > > > We may alternatively record every task_ratelimit executed in the > > > past 200ms and average them all to get task_ratelimit_200ms. In this > > > way we take the "superfluous" pos_ratio out of sight :) > > > > Right, so I'm not at all sure that makes sense, its not immediately > > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it > > clear to me why your primary feedback loop uses task_ratelimit_200ms at > > all. > > task_ratelimit is used and hence defined to be (balance_rate * pos_ratio) > by balance_dirty_pages(). So this is an original formula: > > task_ratelimit = balance_rate * pos_ratio > > task_ratelimit_200ms is also used as an original data source in > > balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate But that's exactly where you conflate the positional feedback with the throughput feedback, the effective ratelimit includes the positional feedback so that the dirty page count can move around, but that is completely orthogonal to the throughput feedback since the throughout thing would leave the dirty count constant (ideal case again). That is, yes the iterative feedback still works because you still got your primary feedback in place, but the addition of pos_ratio in the feedback loop is a pure perturbation and doesn't matter one whit. > Then we try to estimate task_ratelimit_200ms by assuming all tasks > have been executing the same CONSTANT ratelimit in > balance_dirty_pages(). Hence we get > > task_ratelimit_200ms ~= prev_balance_rate * pos_ratio But this just cannot be true (and, as argued above, is completely unnecessary). Consider the case where the dirty count is way below the setpoint but the base ratelimit is pretty accurate. In that case we would start out by creating very low task ratelimits such that the dirty count can increase. Once we match the setpoint we go back to the base ratelimit. The average over those 200ms would be <1, but since we're right at the setpoint when we do the base ratelimit feedback we pick exactly 1. Anyway, its completely irrelevant.. :-) > > > There is fundamentally no dependency between balanced_rate_(i+1) and > > > balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation > > > only asks for _whatever_ CONSTANT task ratelimit to be executed for > > > 200ms, then it get the balanced rate from the dirty_rate feedback. > > > > How can there not be a relation between balance_rate_(i+1) and > > balance_rate_(i) ? > > In this manner: even though balance_rate_(i) is somehow used for > calculating balance_rate_(i+1), the latter will evaluate to the same > value given whatever balance_rate_(i). But only if you allow for the iterative feedback to work, you absolutely need that balance_rate_(i), without that its completely broken. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-24 15:57 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-24 15:57 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote: > On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote: > > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote: > > > - not a factor at all for updating balanced_rate (whether or not we do (2)) > > > well, in this concept: the balanced_rate formula inherently does not > > > derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's > > > based on the ratelimit executed for the past 200ms: > > > > > > balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio > > > > Ok, this is where it all goes funny.. > > > > So if you want completely separated feedback loops I would expect > > If call it feedback loops, then it's a series of independent feedback > loops of depth 1. Because each balanced_rate is a fresh estimation > dependent solely on > > - writeout bandwidth > - N, the number of dd tasks > > in the past 200ms. > > As long as a CONSTANT ratelimit (whatever value it is) is executed in > the past 200ms, we can get the same balanced_rate. > > balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate > > The resulted balanced_rate is independent of how large the CONSTANT > ratelimit is, because if we start with a doubled CONSTANT ratelimit, > we'll see doubled dirty_rate and result in the same balanced_rate. > > In that manner, balance_rate_(i+1) is not really depending on the > value of balance_rate_(i): whatever balance_rate_(i) is, we are going > to get the same balance_rate_(i+1) At best this argument says it doesn't matter what we use, making balance_rate_i an equally valid choice. However I don't buy this, your argument is broken, your CONSTANT_ratelimit breaks feedback but then you rely on the iterative form of feedback to finish your argument. Consider: r_(i+1) = r_i * ratio_i you say, r_i := C for all i, then by definition ratio_i must be 1 and you've got nothing. The only way your conclusion can be right is by allowing the proper iteration, otherwise we'll never reach the equilibrium. Now it is true you can introduce random perturbations in r_i at any given point and still end up in equilibrium, such is the power of iterative feedback, but that doesn't say you can do away with r_i. > > something like: > > > > balance_rate_(i+1) = balance_rate_(i) * bw_ratio ; every 200ms > > > > The former is a complete feedback loop, expressing the new value in the > > old value (*) with bw_ratio as feedback parameter; if we throttled too > > much, the dirty_rate will have dropped and the bw_ratio will be <1 > > causing the balance_rate to drop increasing the dirty_rate, and vice > > versa. > > In principle, the bw_ratio works that way. However since > balance_rate_(i) is not the exact _executed_ ratelimit in > balance_dirty_pages(). This seems to be where your argument goes bad, the actually executed ratelimit is not important, the variance introduced by pos_ratio is purely for the benefit of the dirty page count. It doesn't matter for the balance_rate. Without pos_ratio, the dirty page count would stay stable (ignoring all these oscillations and other fun things), and therefore it is the balance_rate we should be using for the iterative feedback. > > (*) which is the form I expected and why I thought your primary feedback > > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio > > Because the executed ratelimit was rate_(i) * pos_ratio. No, because iterative feedback has the form: new = old $op $feedback-term > > Then when you use the balance_rate to actually throttle tasks you apply > > your secondary control steering the dirty page count, yielding: > > > > task_rate = balance_rate * pos_ratio > > Right. Note the above formula is not a derived one, Agreed, its not a derived expression but the originator of the dirty page count control. > but an original > one that later leads to pos_ratio showing up in the calculation of > balanced_rate. That's where I disagree :-) > > > and task_ratelimit_200ms happen to can be estimated from > > > > > > task_ratelimit_200ms ~= balanced_rate_i * pos_ratio > > > > > We may alternatively record every task_ratelimit executed in the > > > past 200ms and average them all to get task_ratelimit_200ms. In this > > > way we take the "superfluous" pos_ratio out of sight :) > > > > Right, so I'm not at all sure that makes sense, its not immediately > > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it > > clear to me why your primary feedback loop uses task_ratelimit_200ms at > > all. > > task_ratelimit is used and hence defined to be (balance_rate * pos_ratio) > by balance_dirty_pages(). So this is an original formula: > > task_ratelimit = balance_rate * pos_ratio > > task_ratelimit_200ms is also used as an original data source in > > balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate But that's exactly where you conflate the positional feedback with the throughput feedback, the effective ratelimit includes the positional feedback so that the dirty page count can move around, but that is completely orthogonal to the throughput feedback since the throughout thing would leave the dirty count constant (ideal case again). That is, yes the iterative feedback still works because you still got your primary feedback in place, but the addition of pos_ratio in the feedback loop is a pure perturbation and doesn't matter one whit. > Then we try to estimate task_ratelimit_200ms by assuming all tasks > have been executing the same CONSTANT ratelimit in > balance_dirty_pages(). Hence we get > > task_ratelimit_200ms ~= prev_balance_rate * pos_ratio But this just cannot be true (and, as argued above, is completely unnecessary). Consider the case where the dirty count is way below the setpoint but the base ratelimit is pretty accurate. In that case we would start out by creating very low task ratelimits such that the dirty count can increase. Once we match the setpoint we go back to the base ratelimit. The average over those 200ms would be <1, but since we're right at the setpoint when we do the base ratelimit feedback we pick exactly 1. Anyway, its completely irrelevant.. :-) > > > There is fundamentally no dependency between balanced_rate_(i+1) and > > > balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation > > > only asks for _whatever_ CONSTANT task ratelimit to be executed for > > > 200ms, then it get the balanced rate from the dirty_rate feedback. > > > > How can there not be a relation between balance_rate_(i+1) and > > balance_rate_(i) ? > > In this manner: even though balance_rate_(i) is somehow used for > calculating balance_rate_(i+1), the latter will evaluate to the same > value given whatever balance_rate_(i). But only if you allow for the iterative feedback to work, you absolutely need that balance_rate_(i), without that its completely broken. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-24 15:57 ` Peter Zijlstra @ 2011-08-25 5:30 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-25 5:30 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Wed, Aug 24, 2011 at 11:57:39PM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote: > > On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote: > > > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote: > > > > - not a factor at all for updating balanced_rate (whether or not we do (2)) > > > > well, in this concept: the balanced_rate formula inherently does not > > > > derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's > > > > based on the ratelimit executed for the past 200ms: > > > > > > > > balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio > > > > > > Ok, this is where it all goes funny.. > > > > > > So if you want completely separated feedback loops I would expect > > > > If call it feedback loops, then it's a series of independent feedback > > loops of depth 1. Because each balanced_rate is a fresh estimation > > dependent solely on > > > > - writeout bandwidth > > - N, the number of dd tasks > > > > in the past 200ms. > > > > As long as a CONSTANT ratelimit (whatever value it is) is executed in > > the past 200ms, we can get the same balanced_rate. > > > > balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate > > > > The resulted balanced_rate is independent of how large the CONSTANT > > ratelimit is, because if we start with a doubled CONSTANT ratelimit, > > we'll see doubled dirty_rate and result in the same balanced_rate. > > > > In that manner, balance_rate_(i+1) is not really depending on the > > value of balance_rate_(i): whatever balance_rate_(i) is, we are going > > to get the same balance_rate_(i+1) > > At best this argument says it doesn't matter what we use, making > balance_rate_i an equally valid choice. However I don't buy this, your > argument is broken, your CONSTANT_ratelimit breaks feedback but then you > rely on the iterative form of feedback to finish your argument. > > Consider: > > r_(i+1) = r_i * ratio_i > > you say, r_i := C for all i, then by definition ratio_i must be 1 and > you've got nothing. The only way your conclusion can be right is by > allowing the proper iteration, otherwise we'll never reach the > equilibrium. > > Now it is true you can introduce random perturbations in r_i at any > given point and still end up in equilibrium, such is the power of > iterative feedback, but that doesn't say you can do away with r_i. Sure there are always r_i. Sorry what I mean CONSTANT_ratelimit is, it remains CONSTANT _inside_ every 200ms. There will be a series of different CONSTANT values for each 200ms, which is roughly (r_i * pos_ratio_i). > > > something like: > > > > > > balance_rate_(i+1) = balance_rate_(i) * bw_ratio ; every 200ms > > > > > > The former is a complete feedback loop, expressing the new value in the > > > old value (*) with bw_ratio as feedback parameter; if we throttled too > > > much, the dirty_rate will have dropped and the bw_ratio will be <1 > > > causing the balance_rate to drop increasing the dirty_rate, and vice > > > versa. > > > > In principle, the bw_ratio works that way. However since > > balance_rate_(i) is not the exact _executed_ ratelimit in > > balance_dirty_pages(). > > This seems to be where your argument goes bad, the actually executed > ratelimit is not important, the variance introduced by pos_ratio is > purely for the benefit of the dirty page count. > > It doesn't matter for the balance_rate. Without pos_ratio, the dirty > page count would stay stable (ignoring all these oscillations and other > fun things), and therefore it is the balance_rate we should be using for > the iterative feedback. Nope. The dirty page count can always stay stable somewhere (but not necessarily at setpoint) purely by the pos_ratio feedback, as illustrated by Vivek's example. But that's not the balance state we want. Although the pos_ratio feedback all by itself is strong enough to keep (dirty_rate == write_bw), the ideal state is to achieve pos_ratio=1 and eliminate its feedback error as much as possible, so as to get smooth task_ratelimit. We may take this viewpoint: a "successful" balance_rate should help keep pos_ratio around 1.0 in long term. > > > (*) which is the form I expected and why I thought your primary feedback > > > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio > > > > Because the executed ratelimit was rate_(i) * pos_ratio. > > No, because iterative feedback has the form: > > new = old $op $feedback-term > The problem is, the pos_ratio feedback will jump in and prematurely make $feedback-term = 1, thus rendering the pure rate feedback weak/useless. > > > Then when you use the balance_rate to actually throttle tasks you apply > > > your secondary control steering the dirty page count, yielding: > > > > > > task_rate = balance_rate * pos_ratio > > > > Right. Note the above formula is not a derived one, > > Agreed, its not a derived expression but the originator of the dirty > page count control. > > > but an original > > one that later leads to pos_ratio showing up in the calculation of > > balanced_rate. > > That's where I disagree :-) > > > > > and task_ratelimit_200ms happen to can be estimated from > > > > > > > > task_ratelimit_200ms ~= balanced_rate_i * pos_ratio > > > > > > > We may alternatively record every task_ratelimit executed in the > > > > past 200ms and average them all to get task_ratelimit_200ms. In this > > > > way we take the "superfluous" pos_ratio out of sight :) > > > > > > Right, so I'm not at all sure that makes sense, its not immediately > > > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it > > > clear to me why your primary feedback loop uses task_ratelimit_200ms at > > > all. > > > > task_ratelimit is used and hence defined to be (balance_rate * pos_ratio) > > by balance_dirty_pages(). So this is an original formula: > > > > task_ratelimit = balance_rate * pos_ratio > > > > task_ratelimit_200ms is also used as an original data source in > > > > balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate > > But that's exactly where you conflate the positional feedback with the > throughput feedback, the effective ratelimit includes the positional > feedback so that the dirty page count can move around, but that is > completely orthogonal to the throughput feedback since the throughout > thing would leave the dirty count constant (ideal case again). > > That is, yes the iterative feedback still works because you still got > your primary feedback in place, but the addition of pos_ratio in the > feedback loop is a pure perturbation and doesn't matter one whit. The problem is that pure rate feedback is not possible because pos_ratio also takes part in altering the task rate... > > Then we try to estimate task_ratelimit_200ms by assuming all tasks > > have been executing the same CONSTANT ratelimit in > > balance_dirty_pages(). Hence we get > > > > task_ratelimit_200ms ~= prev_balance_rate * pos_ratio > > But this just cannot be true (and, as argued above, is completely > unnecessary). > > Consider the case where the dirty count is way below the setpoint but > the base ratelimit is pretty accurate. In that case we would start out > by creating very low task ratelimits such that the dirty count can s/low/high/ > increase. Once we match the setpoint we go back to the base ratelimit. > The average over those 200ms would be <1, but since we're right at the > setpoint when we do the base ratelimit feedback we pick exactly 1. Yeah that's the kind of error introduced by the CONSTANT ratelimit. Which could be pretty large in small memory boxes. Given that pos_ratio will fluctuate more anyway when memory and hence the dirty control scope is small, such rate estimation errors are tolerable. > Anyway, its completely irrelevant.. :-) Yeah, that's one step further to discuss all kinds of possible errors on top of the basic theory :) > > > > There is fundamentally no dependency between balanced_rate_(i+1) and > > > > balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation > > > > only asks for _whatever_ CONSTANT task ratelimit to be executed for > > > > 200ms, then it get the balanced rate from the dirty_rate feedback. > > > > > > How can there not be a relation between balance_rate_(i+1) and > > > balance_rate_(i) ? > > > > In this manner: even though balance_rate_(i) is somehow used for > > calculating balance_rate_(i+1), the latter will evaluate to the same > > value given whatever balance_rate_(i). > > But only if you allow for the iterative feedback to work, you absolutely > need that balance_rate_(i), without that its completely broken. Agreed. Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-25 5:30 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-25 5:30 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Wed, Aug 24, 2011 at 11:57:39PM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote: > > On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote: > > > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote: > > > > - not a factor at all for updating balanced_rate (whether or not we do (2)) > > > > well, in this concept: the balanced_rate formula inherently does not > > > > derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's > > > > based on the ratelimit executed for the past 200ms: > > > > > > > > balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio > > > > > > Ok, this is where it all goes funny.. > > > > > > So if you want completely separated feedback loops I would expect > > > > If call it feedback loops, then it's a series of independent feedback > > loops of depth 1. Because each balanced_rate is a fresh estimation > > dependent solely on > > > > - writeout bandwidth > > - N, the number of dd tasks > > > > in the past 200ms. > > > > As long as a CONSTANT ratelimit (whatever value it is) is executed in > > the past 200ms, we can get the same balanced_rate. > > > > balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate > > > > The resulted balanced_rate is independent of how large the CONSTANT > > ratelimit is, because if we start with a doubled CONSTANT ratelimit, > > we'll see doubled dirty_rate and result in the same balanced_rate. > > > > In that manner, balance_rate_(i+1) is not really depending on the > > value of balance_rate_(i): whatever balance_rate_(i) is, we are going > > to get the same balance_rate_(i+1) > > At best this argument says it doesn't matter what we use, making > balance_rate_i an equally valid choice. However I don't buy this, your > argument is broken, your CONSTANT_ratelimit breaks feedback but then you > rely on the iterative form of feedback to finish your argument. > > Consider: > > r_(i+1) = r_i * ratio_i > > you say, r_i := C for all i, then by definition ratio_i must be 1 and > you've got nothing. The only way your conclusion can be right is by > allowing the proper iteration, otherwise we'll never reach the > equilibrium. > > Now it is true you can introduce random perturbations in r_i at any > given point and still end up in equilibrium, such is the power of > iterative feedback, but that doesn't say you can do away with r_i. Sure there are always r_i. Sorry what I mean CONSTANT_ratelimit is, it remains CONSTANT _inside_ every 200ms. There will be a series of different CONSTANT values for each 200ms, which is roughly (r_i * pos_ratio_i). > > > something like: > > > > > > balance_rate_(i+1) = balance_rate_(i) * bw_ratio ; every 200ms > > > > > > The former is a complete feedback loop, expressing the new value in the > > > old value (*) with bw_ratio as feedback parameter; if we throttled too > > > much, the dirty_rate will have dropped and the bw_ratio will be <1 > > > causing the balance_rate to drop increasing the dirty_rate, and vice > > > versa. > > > > In principle, the bw_ratio works that way. However since > > balance_rate_(i) is not the exact _executed_ ratelimit in > > balance_dirty_pages(). > > This seems to be where your argument goes bad, the actually executed > ratelimit is not important, the variance introduced by pos_ratio is > purely for the benefit of the dirty page count. > > It doesn't matter for the balance_rate. Without pos_ratio, the dirty > page count would stay stable (ignoring all these oscillations and other > fun things), and therefore it is the balance_rate we should be using for > the iterative feedback. Nope. The dirty page count can always stay stable somewhere (but not necessarily at setpoint) purely by the pos_ratio feedback, as illustrated by Vivek's example. But that's not the balance state we want. Although the pos_ratio feedback all by itself is strong enough to keep (dirty_rate == write_bw), the ideal state is to achieve pos_ratio=1 and eliminate its feedback error as much as possible, so as to get smooth task_ratelimit. We may take this viewpoint: a "successful" balance_rate should help keep pos_ratio around 1.0 in long term. > > > (*) which is the form I expected and why I thought your primary feedback > > > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio > > > > Because the executed ratelimit was rate_(i) * pos_ratio. > > No, because iterative feedback has the form: > > new = old $op $feedback-term > The problem is, the pos_ratio feedback will jump in and prematurely make $feedback-term = 1, thus rendering the pure rate feedback weak/useless. > > > Then when you use the balance_rate to actually throttle tasks you apply > > > your secondary control steering the dirty page count, yielding: > > > > > > task_rate = balance_rate * pos_ratio > > > > Right. Note the above formula is not a derived one, > > Agreed, its not a derived expression but the originator of the dirty > page count control. > > > but an original > > one that later leads to pos_ratio showing up in the calculation of > > balanced_rate. > > That's where I disagree :-) > > > > > and task_ratelimit_200ms happen to can be estimated from > > > > > > > > task_ratelimit_200ms ~= balanced_rate_i * pos_ratio > > > > > > > We may alternatively record every task_ratelimit executed in the > > > > past 200ms and average them all to get task_ratelimit_200ms. In this > > > > way we take the "superfluous" pos_ratio out of sight :) > > > > > > Right, so I'm not at all sure that makes sense, its not immediately > > > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it > > > clear to me why your primary feedback loop uses task_ratelimit_200ms at > > > all. > > > > task_ratelimit is used and hence defined to be (balance_rate * pos_ratio) > > by balance_dirty_pages(). So this is an original formula: > > > > task_ratelimit = balance_rate * pos_ratio > > > > task_ratelimit_200ms is also used as an original data source in > > > > balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate > > But that's exactly where you conflate the positional feedback with the > throughput feedback, the effective ratelimit includes the positional > feedback so that the dirty page count can move around, but that is > completely orthogonal to the throughput feedback since the throughout > thing would leave the dirty count constant (ideal case again). > > That is, yes the iterative feedback still works because you still got > your primary feedback in place, but the addition of pos_ratio in the > feedback loop is a pure perturbation and doesn't matter one whit. The problem is that pure rate feedback is not possible because pos_ratio also takes part in altering the task rate... > > Then we try to estimate task_ratelimit_200ms by assuming all tasks > > have been executing the same CONSTANT ratelimit in > > balance_dirty_pages(). Hence we get > > > > task_ratelimit_200ms ~= prev_balance_rate * pos_ratio > > But this just cannot be true (and, as argued above, is completely > unnecessary). > > Consider the case where the dirty count is way below the setpoint but > the base ratelimit is pretty accurate. In that case we would start out > by creating very low task ratelimits such that the dirty count can s/low/high/ > increase. Once we match the setpoint we go back to the base ratelimit. > The average over those 200ms would be <1, but since we're right at the > setpoint when we do the base ratelimit feedback we pick exactly 1. Yeah that's the kind of error introduced by the CONSTANT ratelimit. Which could be pretty large in small memory boxes. Given that pos_ratio will fluctuate more anyway when memory and hence the dirty control scope is small, such rate estimation errors are tolerable. > Anyway, its completely irrelevant.. :-) Yeah, that's one step further to discuss all kinds of possible errors on top of the basic theory :) > > > > There is fundamentally no dependency between balanced_rate_(i+1) and > > > > balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation > > > > only asks for _whatever_ CONSTANT task ratelimit to be executed for > > > > 200ms, then it get the balanced rate from the dirty_rate feedback. > > > > > > How can there not be a relation between balance_rate_(i+1) and > > > balance_rate_(i) ? > > > > In this manner: even though balance_rate_(i) is somehow used for > > calculating balance_rate_(i+1), the latter will evaluate to the same > > value given whatever balance_rate_(i). > > But only if you allow for the iterative feedback to work, you absolutely > need that balance_rate_(i), without that its completely broken. Agreed. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-23 10:01 ` Peter Zijlstra @ 2011-08-23 14:36 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-23 14:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 23, 2011 at 12:01:00PM +0200, Peter Zijlstra wrote: > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote: > > - not a factor at all for updating balanced_rate (whether or not we do (2)) > > well, in this concept: the balanced_rate formula inherently does not > > derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's > > based on the ratelimit executed for the past 200ms: > > > > balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio > > Ok, this is where it all goes funny.. Exactly. This is where it gets confusing and is bone of contention. > > So if you want completely separated feedback loops I would expect > something like: > > balance_rate_(i+1) = balance_rate_(i) * bw_ratio ; every 200ms > I agree. This makes sense. IOW. write_bw bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_(n-1) * ------- dirty_rate > The former is a complete feedback loop, expressing the new value in the > old value (*) with bw_ratio as feedback parameter; if we throttled too > much, the dirty_rate will have dropped and the bw_ratio will be <1 > causing the balance_rate to drop increasing the dirty_rate, and vice > versa. I think you meant. "if we throttled too much, the dirty_rate will have dropped and the bw_ratio will be >1 causing the balance_rate to increase hence increasing the dirty_rate, and vice versa." > > (*) which is the form I expected and why I thought your primary feedback > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio > > With the above balance_rate is an independent variable that tracks the > write bandwidth. Now possibly you'd want a low-pass filter on that since > your bw_ratio is a bit funny in the head, but that's another story. > > Then when you use the balance_rate to actually throttle tasks you apply > your secondary control steering the dirty page count, yielding: > > task_rate = balance_rate * pos_ratio > > > and task_ratelimit_200ms happen to can be estimated from > > > > task_ratelimit_200ms ~= balanced_rate_i * pos_ratio > > > We may alternatively record every task_ratelimit executed in the > > past 200ms and average them all to get task_ratelimit_200ms. In this > > way we take the "superfluous" pos_ratio out of sight :) > > Right, so I'm not at all sure that makes sense, its not immediately > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it > clear to me why your primary feedback loop uses task_ratelimit_200ms at > all. > We I thought that this is evident that. task_ratelimit = balanced_rate * pos_ratio What is not evident to me is following. balanced_rate_(i+1) = task_ratelimit_200ms * pos_ratio. Instead, like you, I also thought that following is more obivious. balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-23 14:36 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-23 14:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 23, 2011 at 12:01:00PM +0200, Peter Zijlstra wrote: > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote: > > - not a factor at all for updating balanced_rate (whether or not we do (2)) > > well, in this concept: the balanced_rate formula inherently does not > > derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's > > based on the ratelimit executed for the past 200ms: > > > > balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio > > Ok, this is where it all goes funny.. Exactly. This is where it gets confusing and is bone of contention. > > So if you want completely separated feedback loops I would expect > something like: > > balance_rate_(i+1) = balance_rate_(i) * bw_ratio ; every 200ms > I agree. This makes sense. IOW. write_bw bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_(n-1) * ------- dirty_rate > The former is a complete feedback loop, expressing the new value in the > old value (*) with bw_ratio as feedback parameter; if we throttled too > much, the dirty_rate will have dropped and the bw_ratio will be <1 > causing the balance_rate to drop increasing the dirty_rate, and vice > versa. I think you meant. "if we throttled too much, the dirty_rate will have dropped and the bw_ratio will be >1 causing the balance_rate to increase hence increasing the dirty_rate, and vice versa." > > (*) which is the form I expected and why I thought your primary feedback > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio > > With the above balance_rate is an independent variable that tracks the > write bandwidth. Now possibly you'd want a low-pass filter on that since > your bw_ratio is a bit funny in the head, but that's another story. > > Then when you use the balance_rate to actually throttle tasks you apply > your secondary control steering the dirty page count, yielding: > > task_rate = balance_rate * pos_ratio > > > and task_ratelimit_200ms happen to can be estimated from > > > > task_ratelimit_200ms ~= balanced_rate_i * pos_ratio > > > We may alternatively record every task_ratelimit executed in the > > past 200ms and average them all to get task_ratelimit_200ms. In this > > way we take the "superfluous" pos_ratio out of sight :) > > Right, so I'm not at all sure that makes sense, its not immediately > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it > clear to me why your primary feedback loop uses task_ratelimit_200ms at > all. > We I thought that this is evident that. task_ratelimit = balanced_rate * pos_ratio What is not evident to me is following. balanced_rate_(i+1) = task_ratelimit_200ms * pos_ratio. Instead, like you, I also thought that following is more obivious. balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-06 8:44 ` Wu Fengguang @ 2011-08-09 2:08 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-09 2:08 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:49PM +0800, Wu Fengguang wrote: > Old scheme is, > | > free run area | throttle area > ----------------------------------------+----------------------------> > thresh^ dirty pages > > New scheme is, > > ^ task rate limit > | > | * > | * > | * > |[free run] * [smooth throttled] > | * > | * > | * > ..bdi->dirty_ratelimit..........* > | . * > | . * > | . * > | . * > | . * > +-------------------------------.-----------------------*------------> > setpoint^ limit^ dirty pages > > For simplicity, only the global/bdi setpoint control lines are > implemented here, so the [*] curve is more straight than the ideal one > showed in the above figure. > > bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so > that the resulted task rate limit can drive the dirty pages back to the > global/bdi setpoints. > IMHO, "position_ratio" is not necessarily very intutive. Can there be a better name? Based on your slides, it is scaling factor applied to task rate limit depending on how well we are doing in terms of meeting our goal of dirty limit. Will "dirty_rate_scale_factor" or something like that make sense and be little more intutive? Thanks Vivek > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > mm/page-writeback.c | 143 ++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 143 insertions(+) > > --- linux-next.orig/mm/page-writeback.c 2011-08-06 10:31:32.000000000 +0800 > +++ linux-next/mm/page-writeback.c 2011-08-06 11:17:07.000000000 +0800 > @@ -46,6 +46,8 @@ > */ > #define BANDWIDTH_INTERVAL max(HZ/5, 1) > > +#define BANDWIDTH_CALC_SHIFT 10 > + > /* > * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited > * will look to see if it needs to force writeback or throttling. > @@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac > return bdi_dirty; > } > > +/* > + * Dirty position control. > + * > + * (o) global/bdi setpoints > + * > + * When the number of dirty pages go higher/lower than the setpoint, the dirty > + * position ratio (and hence dirty rate limit) will be decreased/increased to > + * bring the dirty pages back to the setpoint. > + * > + * setpoint > + * v > + * |-------------------------------*-------------------------------|-----------| > + * ^ ^ ^ ^ > + * (thresh + background_thresh)/2 thresh - thresh/DIRTY_SCOPE thresh limit > + * > + * bdi setpoint > + * v > + * |-------------------------------*-------------------------------------------| > + * ^ ^ ^ > + * 0 bdi_thresh - bdi_thresh/DIRTY_SCOPE limit > + * > + * (o) pseudo code > + * > + * pos_ratio = 1 << BANDWIDTH_CALC_SHIFT > + * > + * if (dirty < thresh) scale up pos_ratio > + * if (dirty > thresh) scale down pos_ratio > + * > + * if (bdi_dirty < bdi_thresh) scale up pos_ratio > + * if (bdi_dirty > bdi_thresh) scale down pos_ratio > + * > + * (o) global/bdi control lines > + * > + * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by > + * several control lines in turn. > + * > + * The control lines for the global/bdi setpoints both stretch up to @limit. > + * If any control line drops below Y=0 before reaching @limit, an auxiliary > + * line will be setup to connect them. The below figure illustrates the main > + * bdi control line with an auxiliary line extending it to @limit. > + * > + * This allows smoothly throttling bdi_dirty down to normal if it starts high > + * in situations like > + * - start writing to a slow SD card and a fast disk at the same time. The SD > + * card's bdi_dirty may rush to 5 times higher than bdi setpoint. > + * - the bdi dirty thresh goes down quickly due to change of JBOD workload > + * > + * o > + * o > + * o [o] main control line > + * o [*] auxiliary control line > + * o > + * o > + * o > + * o > + * o > + * o > + * o--------------------- balance point, bw scale = 1 > + * | o > + * | o > + * | o > + * | o > + * | o > + * | o > + * | o------- connect point, bw scale = 1/2 > + * | .* > + * | . * > + * | . * > + * | . * > + * | . * > + * | . * > + * | . * > + * [--------------------+-----------------------------.--------------------*] > + * 0 bdi setpoint bdi origin limit > + * > + * The bdi control line: if (origin < limit), an auxiliary control line (*) > + * will be setup to extend the main control line (o) to @limit. > + */ > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, > + unsigned long thresh, > + unsigned long dirty, > + unsigned long bdi_thresh, > + unsigned long bdi_dirty) > +{ > + unsigned long limit = hard_dirty_limit(thresh); > + unsigned long origin; > + unsigned long goal; > + unsigned long long span; > + unsigned long long pos_ratio; /* for scaling up/down the rate limit */ > + > + if (unlikely(dirty >= limit)) > + return 0; > + > + /* > + * global setpoint > + */ > + goal = thresh - thresh / DIRTY_SCOPE; > + origin = 4 * thresh; > + > + if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { > + origin = limit; /* auxiliary control line */ > + goal = (goal + origin) / 2; > + pos_ratio >>= 1; > + } > + pos_ratio = origin - dirty; > + pos_ratio <<= BANDWIDTH_CALC_SHIFT; > + do_div(pos_ratio, origin - goal + 1); > + > + /* > + * bdi setpoint > + */ > + if (unlikely(bdi_thresh > thresh)) > + bdi_thresh = thresh; > + goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE; > + /* > + * Use span=(4*bw) in single disk case and transit to bdi_thresh in > + * JBOD case. For JBOD, bdi_thresh could fluctuate up to its own size. > + * Otherwise the bdi write bandwidth is good for limiting the floating > + * area, which makes the bdi control line a good backup when the global > + * control line is too flat/weak in large memory systems. > + */ > + span = (u64) bdi_thresh * (thresh - bdi_thresh) + > + (4 * bdi->avg_write_bandwidth) * bdi_thresh; > + do_div(span, thresh + 1); > + origin = goal + 2 * span; > + > + if (unlikely(bdi_dirty > goal + span)) { > + if (bdi_dirty > limit) > + return 0; > + if (origin < limit) { > + origin = limit; /* auxiliary control line */ > + goal += span; > + pos_ratio >>= 1; > + } > + } > + pos_ratio *= origin - bdi_dirty; > + do_div(pos_ratio, origin - goal + 1); > + > + return pos_ratio; > +} > + > static void bdi_update_write_bandwidth(struct backing_dev_info *bdi, > unsigned long elapsed, > unsigned long written) > ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-09 2:08 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-09 2:08 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:49PM +0800, Wu Fengguang wrote: > Old scheme is, > | > free run area | throttle area > ----------------------------------------+----------------------------> > thresh^ dirty pages > > New scheme is, > > ^ task rate limit > | > | * > | * > | * > |[free run] * [smooth throttled] > | * > | * > | * > ..bdi->dirty_ratelimit..........* > | . * > | . * > | . * > | . * > | . * > +-------------------------------.-----------------------*------------> > setpoint^ limit^ dirty pages > > For simplicity, only the global/bdi setpoint control lines are > implemented here, so the [*] curve is more straight than the ideal one > showed in the above figure. > > bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so > that the resulted task rate limit can drive the dirty pages back to the > global/bdi setpoints. > IMHO, "position_ratio" is not necessarily very intutive. Can there be a better name? Based on your slides, it is scaling factor applied to task rate limit depending on how well we are doing in terms of meeting our goal of dirty limit. Will "dirty_rate_scale_factor" or something like that make sense and be little more intutive? Thanks Vivek > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > mm/page-writeback.c | 143 ++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 143 insertions(+) > > --- linux-next.orig/mm/page-writeback.c 2011-08-06 10:31:32.000000000 +0800 > +++ linux-next/mm/page-writeback.c 2011-08-06 11:17:07.000000000 +0800 > @@ -46,6 +46,8 @@ > */ > #define BANDWIDTH_INTERVAL max(HZ/5, 1) > > +#define BANDWIDTH_CALC_SHIFT 10 > + > /* > * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited > * will look to see if it needs to force writeback or throttling. > @@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac > return bdi_dirty; > } > > +/* > + * Dirty position control. > + * > + * (o) global/bdi setpoints > + * > + * When the number of dirty pages go higher/lower than the setpoint, the dirty > + * position ratio (and hence dirty rate limit) will be decreased/increased to > + * bring the dirty pages back to the setpoint. > + * > + * setpoint > + * v > + * |-------------------------------*-------------------------------|-----------| > + * ^ ^ ^ ^ > + * (thresh + background_thresh)/2 thresh - thresh/DIRTY_SCOPE thresh limit > + * > + * bdi setpoint > + * v > + * |-------------------------------*-------------------------------------------| > + * ^ ^ ^ > + * 0 bdi_thresh - bdi_thresh/DIRTY_SCOPE limit > + * > + * (o) pseudo code > + * > + * pos_ratio = 1 << BANDWIDTH_CALC_SHIFT > + * > + * if (dirty < thresh) scale up pos_ratio > + * if (dirty > thresh) scale down pos_ratio > + * > + * if (bdi_dirty < bdi_thresh) scale up pos_ratio > + * if (bdi_dirty > bdi_thresh) scale down pos_ratio > + * > + * (o) global/bdi control lines > + * > + * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by > + * several control lines in turn. > + * > + * The control lines for the global/bdi setpoints both stretch up to @limit. > + * If any control line drops below Y=0 before reaching @limit, an auxiliary > + * line will be setup to connect them. The below figure illustrates the main > + * bdi control line with an auxiliary line extending it to @limit. > + * > + * This allows smoothly throttling bdi_dirty down to normal if it starts high > + * in situations like > + * - start writing to a slow SD card and a fast disk at the same time. The SD > + * card's bdi_dirty may rush to 5 times higher than bdi setpoint. > + * - the bdi dirty thresh goes down quickly due to change of JBOD workload > + * > + * o > + * o > + * o [o] main control line > + * o [*] auxiliary control line > + * o > + * o > + * o > + * o > + * o > + * o > + * o--------------------- balance point, bw scale = 1 > + * | o > + * | o > + * | o > + * | o > + * | o > + * | o > + * | o------- connect point, bw scale = 1/2 > + * | .* > + * | . * > + * | . * > + * | . * > + * | . * > + * | . * > + * | . * > + * [--------------------+-----------------------------.--------------------*] > + * 0 bdi setpoint bdi origin limit > + * > + * The bdi control line: if (origin < limit), an auxiliary control line (*) > + * will be setup to extend the main control line (o) to @limit. > + */ > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, > + unsigned long thresh, > + unsigned long dirty, > + unsigned long bdi_thresh, > + unsigned long bdi_dirty) > +{ > + unsigned long limit = hard_dirty_limit(thresh); > + unsigned long origin; > + unsigned long goal; > + unsigned long long span; > + unsigned long long pos_ratio; /* for scaling up/down the rate limit */ > + > + if (unlikely(dirty >= limit)) > + return 0; > + > + /* > + * global setpoint > + */ > + goal = thresh - thresh / DIRTY_SCOPE; > + origin = 4 * thresh; > + > + if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { > + origin = limit; /* auxiliary control line */ > + goal = (goal + origin) / 2; > + pos_ratio >>= 1; > + } > + pos_ratio = origin - dirty; > + pos_ratio <<= BANDWIDTH_CALC_SHIFT; > + do_div(pos_ratio, origin - goal + 1); > + > + /* > + * bdi setpoint > + */ > + if (unlikely(bdi_thresh > thresh)) > + bdi_thresh = thresh; > + goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE; > + /* > + * Use span=(4*bw) in single disk case and transit to bdi_thresh in > + * JBOD case. For JBOD, bdi_thresh could fluctuate up to its own size. > + * Otherwise the bdi write bandwidth is good for limiting the floating > + * area, which makes the bdi control line a good backup when the global > + * control line is too flat/weak in large memory systems. > + */ > + span = (u64) bdi_thresh * (thresh - bdi_thresh) + > + (4 * bdi->avg_write_bandwidth) * bdi_thresh; > + do_div(span, thresh + 1); > + origin = goal + 2 * span; > + > + if (unlikely(bdi_dirty > goal + span)) { > + if (bdi_dirty > limit) > + return 0; > + if (origin < limit) { > + origin = limit; /* auxiliary control line */ > + goal += span; > + pos_ratio >>= 1; > + } > + } > + pos_ratio *= origin - bdi_dirty; > + do_div(pos_ratio, origin - goal + 1); > + > + return pos_ratio; > +} > + > static void bdi_update_write_bandwidth(struct backing_dev_info *bdi, > unsigned long elapsed, > unsigned long written) > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control 2011-08-09 2:08 ` Vivek Goyal @ 2011-08-16 8:59 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-16 8:59 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML > > bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so > > that the resulted task rate limit can drive the dirty pages back to the > > global/bdi setpoints. > > > > IMHO, "position_ratio" is not necessarily very intutive. Can there be > a better name? Based on your slides, it is scaling factor applied to > task rate limit depending on how well we are doing in terms of meeting > our goal of dirty limit. Will "dirty_rate_scale_factor" or something like > that make sense and be little more intutive? Yeah position_ratio is some scale factor to the dirty rate, and I added a comment for that. On the other hand position_ratio does reflect the underlying "position control of dirty pages" logic. So over time it should be reasonably understandable in the other way :) Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 2/5] writeback: dirty position control @ 2011-08-16 8:59 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-16 8:59 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML > > bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so > > that the resulted task rate limit can drive the dirty pages back to the > > global/bdi setpoints. > > > > IMHO, "position_ratio" is not necessarily very intutive. Can there be > a better name? Based on your slides, it is scaling factor applied to > task rate limit depending on how well we are doing in terms of meeting > our goal of dirty limit. Will "dirty_rate_scale_factor" or something like > that make sense and be little more intutive? Yeah position_ratio is some scale factor to the dirty rate, and I added a comment for that. On the other hand position_ratio does reflect the underlying "position control of dirty pages" logic. So over time it should be reasonably understandable in the other way :) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 3/5] writeback: dirty rate control 2011-08-06 8:44 ` Wu Fengguang (?) @ 2011-08-06 8:44 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: dirty-ratelimit --] [-- Type: text/plain, Size: 6415 bytes --] It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N) when there are N dd tasks. On write() syscall, use bdi->dirty_ratelimit ============================================ balance_dirty_pages(pages_dirtied) { pos_bw = bdi->dirty_ratelimit * bdi_position_ratio(); pause = pages_dirtied / pos_bw; sleep(pause); } On every 200ms, update bdi->dirty_ratelimit =========================================== bdi_update_dirty_ratelimit() { bw = bdi->dirty_ratelimit; ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw; if (dirty pages unbalanced) bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4; } Estimation of balanced bdi->dirty_ratelimit =========================================== When started N dd, throttle each dd at task_ratelimit = pos_bw (any non-zero initial value is OK) After 200ms, we got dirty_bw = # of pages dirtied by app / 200ms write_bw = # of pages written to disk / 200ms For aggressive dirtiers, the equality holds dirty_bw == N * task_ratelimit == N * pos_bw (1) The balanced throttle bandwidth can be estimated by ref_bw = pos_bw * write_bw / dirty_bw (2) >From (1) and (2), we get equality ref_bw == write_bw / N (3) If the N dd's are all throttled at ref_bw, the dirty/writeback rates will match. So ref_bw is the balanced dirty rate. In practice, the ref_bw calculated by (2) may fluctuate and have estimation errors. So the bdi->dirty_ratelimit update policy is to follow it only when both pos_bw and ref_bw point to the same direction (indicating not only the dirty position has deviated from the global/bdi setpoints, but also it's still departing away). Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/linux/backing-dev.h | 7 +++ mm/backing-dev.c | 1 mm/page-writeback.c | 69 +++++++++++++++++++++++++++++++++- 3 files changed, 75 insertions(+), 2 deletions(-) --- linux-next.orig/include/linux/backing-dev.h 2011-08-05 18:05:36.000000000 +0800 +++ linux-next/include/linux/backing-dev.h 2011-08-05 18:05:36.000000000 +0800 @@ -75,10 +75,17 @@ struct backing_dev_info { struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS]; unsigned long bw_time_stamp; /* last time write bw is updated */ + unsigned long dirtied_stamp; unsigned long written_stamp; /* pages written at bw_time_stamp */ unsigned long write_bandwidth; /* the estimated write bandwidth */ unsigned long avg_write_bandwidth; /* further smoothed write bw */ + /* + * The base throttle bandwidth, re-calculated on every 200ms. + * All the bdi tasks' dirty rate will be curbed under it. + */ + unsigned long dirty_ratelimit; + struct prop_local_percpu completions; int dirty_exceeded; --- linux-next.orig/mm/backing-dev.c 2011-08-05 18:05:36.000000000 +0800 +++ linux-next/mm/backing-dev.c 2011-08-05 18:05:36.000000000 +0800 @@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd bdi->bw_time_stamp = jiffies; bdi->written_stamp = 0; + bdi->dirty_ratelimit = INIT_BW; bdi->write_bandwidth = INIT_BW; bdi->avg_write_bandwidth = INIT_BW; --- linux-next.orig/mm/page-writeback.c 2011-08-05 18:05:36.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-06 09:08:35.000000000 +0800 @@ -736,6 +736,66 @@ static void global_update_bandwidth(unsi spin_unlock(&dirty_lock); } +/* + * Maintain bdi->dirty_ratelimit, the base throttle bandwidth. + * + * Normal bdi tasks will be curbed at or below it in long term. + * Obviously it should be around (write_bw / N) when there are N dd tasks. + */ +static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi, + unsigned long thresh, + unsigned long dirty, + unsigned long bdi_thresh, + unsigned long bdi_dirty, + unsigned long dirtied, + unsigned long elapsed) +{ + unsigned long bw = bdi->dirty_ratelimit; + unsigned long dirty_bw; + unsigned long pos_bw; + unsigned long ref_bw; + unsigned long long pos_ratio; + + /* + * The dirty rate will match the writeback rate in long term, except + * when dirty pages are truncated by userspace or re-dirtied by FS. + */ + dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed; + + pos_ratio = bdi_position_ratio(bdi, thresh, dirty, + bdi_thresh, bdi_dirty); + /* + * pos_bw reflects each dd's dirty rate enforced for the past 200ms. + */ + pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; + pos_bw++; /* this avoids bdi->dirty_ratelimit get stuck in 0 */ + + /* + * ref_bw = pos_bw * write_bw / dirty_bw + * + * It's a linear estimation of the "balanced" throttle bandwidth. + */ + pos_ratio *= bdi->avg_write_bandwidth; + do_div(pos_ratio, dirty_bw | 1); + ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; + + /* + * dirty_ratelimit will follow ref_bw/pos_bw conservatively iff they + * are on the same side of dirty_ratelimit. Which not only makes it + * more stable, but also is essential for preventing it being driven + * away by possible systematic errors in ref_bw. + */ + if (pos_bw < bw) { + if (ref_bw < bw) + bw = max(ref_bw, pos_bw); + } else { + if (ref_bw > bw) + bw = min(ref_bw, pos_bw); + } + + bdi->dirty_ratelimit = bw; +} + void __bdi_update_bandwidth(struct backing_dev_info *bdi, unsigned long thresh, unsigned long dirty, @@ -745,6 +805,7 @@ void __bdi_update_bandwidth(struct backi { unsigned long now = jiffies; unsigned long elapsed = now - bdi->bw_time_stamp; + unsigned long dirtied; unsigned long written; /* @@ -753,6 +814,7 @@ void __bdi_update_bandwidth(struct backi if (elapsed < BANDWIDTH_INTERVAL) return; + dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]); written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]); /* @@ -762,12 +824,15 @@ void __bdi_update_bandwidth(struct backi if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time)) goto snapshot; - if (thresh) + if (thresh) { global_update_bandwidth(thresh, dirty, now); - + bdi_update_dirty_ratelimit(bdi, thresh, dirty, bdi_thresh, + bdi_dirty, dirtied, elapsed); + } bdi_update_write_bandwidth(bdi, elapsed, written); snapshot: + bdi->dirtied_stamp = dirtied; bdi->written_stamp = written; bdi->bw_time_stamp = now; } ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 3/5] writeback: dirty rate control @ 2011-08-06 8:44 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: dirty-ratelimit --] [-- Type: text/plain, Size: 6718 bytes --] It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N) when there are N dd tasks. On write() syscall, use bdi->dirty_ratelimit ============================================ balance_dirty_pages(pages_dirtied) { pos_bw = bdi->dirty_ratelimit * bdi_position_ratio(); pause = pages_dirtied / pos_bw; sleep(pause); } On every 200ms, update bdi->dirty_ratelimit =========================================== bdi_update_dirty_ratelimit() { bw = bdi->dirty_ratelimit; ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw; if (dirty pages unbalanced) bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4; } Estimation of balanced bdi->dirty_ratelimit =========================================== When started N dd, throttle each dd at task_ratelimit = pos_bw (any non-zero initial value is OK) After 200ms, we got dirty_bw = # of pages dirtied by app / 200ms write_bw = # of pages written to disk / 200ms For aggressive dirtiers, the equality holds dirty_bw == N * task_ratelimit == N * pos_bw (1) The balanced throttle bandwidth can be estimated by ref_bw = pos_bw * write_bw / dirty_bw (2) >From (1) and (2), we get equality ref_bw == write_bw / N (3) If the N dd's are all throttled at ref_bw, the dirty/writeback rates will match. So ref_bw is the balanced dirty rate. In practice, the ref_bw calculated by (2) may fluctuate and have estimation errors. So the bdi->dirty_ratelimit update policy is to follow it only when both pos_bw and ref_bw point to the same direction (indicating not only the dirty position has deviated from the global/bdi setpoints, but also it's still departing away). Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/linux/backing-dev.h | 7 +++ mm/backing-dev.c | 1 mm/page-writeback.c | 69 +++++++++++++++++++++++++++++++++- 3 files changed, 75 insertions(+), 2 deletions(-) --- linux-next.orig/include/linux/backing-dev.h 2011-08-05 18:05:36.000000000 +0800 +++ linux-next/include/linux/backing-dev.h 2011-08-05 18:05:36.000000000 +0800 @@ -75,10 +75,17 @@ struct backing_dev_info { struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS]; unsigned long bw_time_stamp; /* last time write bw is updated */ + unsigned long dirtied_stamp; unsigned long written_stamp; /* pages written at bw_time_stamp */ unsigned long write_bandwidth; /* the estimated write bandwidth */ unsigned long avg_write_bandwidth; /* further smoothed write bw */ + /* + * The base throttle bandwidth, re-calculated on every 200ms. + * All the bdi tasks' dirty rate will be curbed under it. + */ + unsigned long dirty_ratelimit; + struct prop_local_percpu completions; int dirty_exceeded; --- linux-next.orig/mm/backing-dev.c 2011-08-05 18:05:36.000000000 +0800 +++ linux-next/mm/backing-dev.c 2011-08-05 18:05:36.000000000 +0800 @@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd bdi->bw_time_stamp = jiffies; bdi->written_stamp = 0; + bdi->dirty_ratelimit = INIT_BW; bdi->write_bandwidth = INIT_BW; bdi->avg_write_bandwidth = INIT_BW; --- linux-next.orig/mm/page-writeback.c 2011-08-05 18:05:36.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-06 09:08:35.000000000 +0800 @@ -736,6 +736,66 @@ static void global_update_bandwidth(unsi spin_unlock(&dirty_lock); } +/* + * Maintain bdi->dirty_ratelimit, the base throttle bandwidth. + * + * Normal bdi tasks will be curbed at or below it in long term. + * Obviously it should be around (write_bw / N) when there are N dd tasks. + */ +static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi, + unsigned long thresh, + unsigned long dirty, + unsigned long bdi_thresh, + unsigned long bdi_dirty, + unsigned long dirtied, + unsigned long elapsed) +{ + unsigned long bw = bdi->dirty_ratelimit; + unsigned long dirty_bw; + unsigned long pos_bw; + unsigned long ref_bw; + unsigned long long pos_ratio; + + /* + * The dirty rate will match the writeback rate in long term, except + * when dirty pages are truncated by userspace or re-dirtied by FS. + */ + dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed; + + pos_ratio = bdi_position_ratio(bdi, thresh, dirty, + bdi_thresh, bdi_dirty); + /* + * pos_bw reflects each dd's dirty rate enforced for the past 200ms. + */ + pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; + pos_bw++; /* this avoids bdi->dirty_ratelimit get stuck in 0 */ + + /* + * ref_bw = pos_bw * write_bw / dirty_bw + * + * It's a linear estimation of the "balanced" throttle bandwidth. + */ + pos_ratio *= bdi->avg_write_bandwidth; + do_div(pos_ratio, dirty_bw | 1); + ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; + + /* + * dirty_ratelimit will follow ref_bw/pos_bw conservatively iff they + * are on the same side of dirty_ratelimit. Which not only makes it + * more stable, but also is essential for preventing it being driven + * away by possible systematic errors in ref_bw. + */ + if (pos_bw < bw) { + if (ref_bw < bw) + bw = max(ref_bw, pos_bw); + } else { + if (ref_bw > bw) + bw = min(ref_bw, pos_bw); + } + + bdi->dirty_ratelimit = bw; +} + void __bdi_update_bandwidth(struct backing_dev_info *bdi, unsigned long thresh, unsigned long dirty, @@ -745,6 +805,7 @@ void __bdi_update_bandwidth(struct backi { unsigned long now = jiffies; unsigned long elapsed = now - bdi->bw_time_stamp; + unsigned long dirtied; unsigned long written; /* @@ -753,6 +814,7 @@ void __bdi_update_bandwidth(struct backi if (elapsed < BANDWIDTH_INTERVAL) return; + dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]); written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]); /* @@ -762,12 +824,15 @@ void __bdi_update_bandwidth(struct backi if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time)) goto snapshot; - if (thresh) + if (thresh) { global_update_bandwidth(thresh, dirty, now); - + bdi_update_dirty_ratelimit(bdi, thresh, dirty, bdi_thresh, + bdi_dirty, dirtied, elapsed); + } bdi_update_write_bandwidth(bdi, elapsed, written); snapshot: + bdi->dirtied_stamp = dirtied; bdi->written_stamp = written; bdi->bw_time_stamp = now; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 3/5] writeback: dirty rate control @ 2011-08-06 8:44 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: dirty-ratelimit --] [-- Type: text/plain, Size: 6718 bytes --] It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N) when there are N dd tasks. On write() syscall, use bdi->dirty_ratelimit ============================================ balance_dirty_pages(pages_dirtied) { pos_bw = bdi->dirty_ratelimit * bdi_position_ratio(); pause = pages_dirtied / pos_bw; sleep(pause); } On every 200ms, update bdi->dirty_ratelimit =========================================== bdi_update_dirty_ratelimit() { bw = bdi->dirty_ratelimit; ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw; if (dirty pages unbalanced) bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4; } Estimation of balanced bdi->dirty_ratelimit =========================================== When started N dd, throttle each dd at task_ratelimit = pos_bw (any non-zero initial value is OK) After 200ms, we got dirty_bw = # of pages dirtied by app / 200ms write_bw = # of pages written to disk / 200ms For aggressive dirtiers, the equality holds dirty_bw == N * task_ratelimit == N * pos_bw (1) The balanced throttle bandwidth can be estimated by ref_bw = pos_bw * write_bw / dirty_bw (2) >From (1) and (2), we get equality ref_bw == write_bw / N (3) If the N dd's are all throttled at ref_bw, the dirty/writeback rates will match. So ref_bw is the balanced dirty rate. In practice, the ref_bw calculated by (2) may fluctuate and have estimation errors. So the bdi->dirty_ratelimit update policy is to follow it only when both pos_bw and ref_bw point to the same direction (indicating not only the dirty position has deviated from the global/bdi setpoints, but also it's still departing away). Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/linux/backing-dev.h | 7 +++ mm/backing-dev.c | 1 mm/page-writeback.c | 69 +++++++++++++++++++++++++++++++++- 3 files changed, 75 insertions(+), 2 deletions(-) --- linux-next.orig/include/linux/backing-dev.h 2011-08-05 18:05:36.000000000 +0800 +++ linux-next/include/linux/backing-dev.h 2011-08-05 18:05:36.000000000 +0800 @@ -75,10 +75,17 @@ struct backing_dev_info { struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS]; unsigned long bw_time_stamp; /* last time write bw is updated */ + unsigned long dirtied_stamp; unsigned long written_stamp; /* pages written at bw_time_stamp */ unsigned long write_bandwidth; /* the estimated write bandwidth */ unsigned long avg_write_bandwidth; /* further smoothed write bw */ + /* + * The base throttle bandwidth, re-calculated on every 200ms. + * All the bdi tasks' dirty rate will be curbed under it. + */ + unsigned long dirty_ratelimit; + struct prop_local_percpu completions; int dirty_exceeded; --- linux-next.orig/mm/backing-dev.c 2011-08-05 18:05:36.000000000 +0800 +++ linux-next/mm/backing-dev.c 2011-08-05 18:05:36.000000000 +0800 @@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd bdi->bw_time_stamp = jiffies; bdi->written_stamp = 0; + bdi->dirty_ratelimit = INIT_BW; bdi->write_bandwidth = INIT_BW; bdi->avg_write_bandwidth = INIT_BW; --- linux-next.orig/mm/page-writeback.c 2011-08-05 18:05:36.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-06 09:08:35.000000000 +0800 @@ -736,6 +736,66 @@ static void global_update_bandwidth(unsi spin_unlock(&dirty_lock); } +/* + * Maintain bdi->dirty_ratelimit, the base throttle bandwidth. + * + * Normal bdi tasks will be curbed at or below it in long term. + * Obviously it should be around (write_bw / N) when there are N dd tasks. + */ +static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi, + unsigned long thresh, + unsigned long dirty, + unsigned long bdi_thresh, + unsigned long bdi_dirty, + unsigned long dirtied, + unsigned long elapsed) +{ + unsigned long bw = bdi->dirty_ratelimit; + unsigned long dirty_bw; + unsigned long pos_bw; + unsigned long ref_bw; + unsigned long long pos_ratio; + + /* + * The dirty rate will match the writeback rate in long term, except + * when dirty pages are truncated by userspace or re-dirtied by FS. + */ + dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed; + + pos_ratio = bdi_position_ratio(bdi, thresh, dirty, + bdi_thresh, bdi_dirty); + /* + * pos_bw reflects each dd's dirty rate enforced for the past 200ms. + */ + pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; + pos_bw++; /* this avoids bdi->dirty_ratelimit get stuck in 0 */ + + /* + * ref_bw = pos_bw * write_bw / dirty_bw + * + * It's a linear estimation of the "balanced" throttle bandwidth. + */ + pos_ratio *= bdi->avg_write_bandwidth; + do_div(pos_ratio, dirty_bw | 1); + ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; + + /* + * dirty_ratelimit will follow ref_bw/pos_bw conservatively iff they + * are on the same side of dirty_ratelimit. Which not only makes it + * more stable, but also is essential for preventing it being driven + * away by possible systematic errors in ref_bw. + */ + if (pos_bw < bw) { + if (ref_bw < bw) + bw = max(ref_bw, pos_bw); + } else { + if (ref_bw > bw) + bw = min(ref_bw, pos_bw); + } + + bdi->dirty_ratelimit = bw; +} + void __bdi_update_bandwidth(struct backing_dev_info *bdi, unsigned long thresh, unsigned long dirty, @@ -745,6 +805,7 @@ void __bdi_update_bandwidth(struct backi { unsigned long now = jiffies; unsigned long elapsed = now - bdi->bw_time_stamp; + unsigned long dirtied; unsigned long written; /* @@ -753,6 +814,7 @@ void __bdi_update_bandwidth(struct backi if (elapsed < BANDWIDTH_INTERVAL) return; + dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]); written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]); /* @@ -762,12 +824,15 @@ void __bdi_update_bandwidth(struct backi if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time)) goto snapshot; - if (thresh) + if (thresh) { global_update_bandwidth(thresh, dirty, now); - + bdi_update_dirty_ratelimit(bdi, thresh, dirty, bdi_thresh, + bdi_dirty, dirtied, elapsed); + } bdi_update_write_bandwidth(bdi, elapsed, written); snapshot: + bdi->dirtied_stamp = dirtied; bdi->written_stamp = written; bdi->bw_time_stamp = now; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control 2011-08-06 8:44 ` Wu Fengguang @ 2011-08-09 14:54 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-09 14:54 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote: > It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N) > when there are N dd tasks. > > On write() syscall, use bdi->dirty_ratelimit > ============================================ > > balance_dirty_pages(pages_dirtied) > { > pos_bw = bdi->dirty_ratelimit * bdi_position_ratio(); > pause = pages_dirtied / pos_bw; > sleep(pause); > } > > On every 200ms, update bdi->dirty_ratelimit > =========================================== > > bdi_update_dirty_ratelimit() > { > bw = bdi->dirty_ratelimit; > ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw; > if (dirty pages unbalanced) > bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4; > } > > Estimation of balanced bdi->dirty_ratelimit > =========================================== > > When started N dd, throttle each dd at > > task_ratelimit = pos_bw (any non-zero initial value is OK) > > After 200ms, we got > > dirty_bw = # of pages dirtied by app / 200ms > write_bw = # of pages written to disk / 200ms > > For aggressive dirtiers, the equality holds > > dirty_bw == N * task_ratelimit > == N * pos_bw (1) > > The balanced throttle bandwidth can be estimated by > > ref_bw = pos_bw * write_bw / dirty_bw (2) > > >From (1) and (2), we get equality > > ref_bw == write_bw / N (3) > > If the N dd's are all throttled at ref_bw, the dirty/writeback rates > will match. So ref_bw is the balanced dirty rate. Hi Fengguang, So how much work it is to extend all this to handle the case of cgroups? IOW, I would imagine that you shall have to keep track of per cgroup/per bdi state of many of the variables. For example, write_bw will become per cgroup/per bdi entity instead of per bdi entity only. Same should be true for position ratio, dirty_bw etc? I am assuming that if some cgroup is low weight on end device, then WRITE bandwidth of that cgroup should go down and that should be accounted for at per bdi state and task throttling should happen accordingly so that a lower weight cgroup tasks get throttled more as compared to higher weight cgroup tasks? Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-09 14:54 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-09 14:54 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote: > It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N) > when there are N dd tasks. > > On write() syscall, use bdi->dirty_ratelimit > ============================================ > > balance_dirty_pages(pages_dirtied) > { > pos_bw = bdi->dirty_ratelimit * bdi_position_ratio(); > pause = pages_dirtied / pos_bw; > sleep(pause); > } > > On every 200ms, update bdi->dirty_ratelimit > =========================================== > > bdi_update_dirty_ratelimit() > { > bw = bdi->dirty_ratelimit; > ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw; > if (dirty pages unbalanced) > bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4; > } > > Estimation of balanced bdi->dirty_ratelimit > =========================================== > > When started N dd, throttle each dd at > > task_ratelimit = pos_bw (any non-zero initial value is OK) > > After 200ms, we got > > dirty_bw = # of pages dirtied by app / 200ms > write_bw = # of pages written to disk / 200ms > > For aggressive dirtiers, the equality holds > > dirty_bw == N * task_ratelimit > == N * pos_bw (1) > > The balanced throttle bandwidth can be estimated by > > ref_bw = pos_bw * write_bw / dirty_bw (2) > > >From (1) and (2), we get equality > > ref_bw == write_bw / N (3) > > If the N dd's are all throttled at ref_bw, the dirty/writeback rates > will match. So ref_bw is the balanced dirty rate. Hi Fengguang, So how much work it is to extend all this to handle the case of cgroups? IOW, I would imagine that you shall have to keep track of per cgroup/per bdi state of many of the variables. For example, write_bw will become per cgroup/per bdi entity instead of per bdi entity only. Same should be true for position ratio, dirty_bw etc? I am assuming that if some cgroup is low weight on end device, then WRITE bandwidth of that cgroup should go down and that should be accounted for at per bdi state and task throttling should happen accordingly so that a lower weight cgroup tasks get throttled more as compared to higher weight cgroup tasks? Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control 2011-08-09 14:54 ` Vivek Goyal @ 2011-08-11 3:42 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-11 3:42 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 09, 2011 at 10:54:38PM +0800, Vivek Goyal wrote: > On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote: > > It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N) > > when there are N dd tasks. > > > > On write() syscall, use bdi->dirty_ratelimit > > ============================================ > > > > balance_dirty_pages(pages_dirtied) > > { > > pos_bw = bdi->dirty_ratelimit * bdi_position_ratio(); > > pause = pages_dirtied / pos_bw; > > sleep(pause); > > } > > > > On every 200ms, update bdi->dirty_ratelimit > > =========================================== > > > > bdi_update_dirty_ratelimit() > > { > > bw = bdi->dirty_ratelimit; > > ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw; > > if (dirty pages unbalanced) > > bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4; > > } > > > > Estimation of balanced bdi->dirty_ratelimit > > =========================================== > > > > When started N dd, throttle each dd at > > > > task_ratelimit = pos_bw (any non-zero initial value is OK) > > > > After 200ms, we got > > > > dirty_bw = # of pages dirtied by app / 200ms > > write_bw = # of pages written to disk / 200ms > > > > For aggressive dirtiers, the equality holds > > > > dirty_bw == N * task_ratelimit > > == N * pos_bw (1) > > > > The balanced throttle bandwidth can be estimated by > > > > ref_bw = pos_bw * write_bw / dirty_bw (2) > > > > >From (1) and (2), we get equality > > > > ref_bw == write_bw / N (3) > > > > If the N dd's are all throttled at ref_bw, the dirty/writeback rates > > will match. So ref_bw is the balanced dirty rate. > > Hi Fengguang, Hi Vivek, > So how much work it is to extend all this to handle the case of cgroups? Here is the simplest form. writeback: async write IO controllers http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=blobdiff;f=mm/page-writeback.c;h=0b579e7fd338fd1f59cc36bf15fda06ff6260634;hp=34dff9f0d28d0f4f0794eb41187f71b4ade6b8a2;hb=1a58ad99ce1f6a9df6618a4b92fa4859cc3e7e90;hpb=5b6fcb3125ea52ff04a2fad27a51307842deb1a0 And an old email on this topic: https://lkml.org/lkml/2011/4/28/229 > IOW, I would imagine that you shall have to keep track of per cgroup/per > bdi state of many of the variables. For example, write_bw will become > per cgroup/per bdi entity instead of per bdi entity only. Same should > be true for position ratio, dirty_bw etc? The dirty_bw, write_bw and dirty_ratelimit should be replicated, but not necessarily dirty pages and position ratio. The cgroup can just rely on the root cgroup's dirty pages position control if it does not care about its own dirty pages consumptions. > I am assuming that if some cgroup is low weight on end device, then > WRITE bandwidth of that cgroup should go down and that should be > accounted for at per bdi state and task throttling should happen > accordingly so that a lower weight cgroup tasks get throttled more > as compared to higher weight cgroup tasks? Sorry I don't quite catch your meaning, but the current ->dirty_ratelimit adaptation scheme (detailed in another email) should handle all such rate/bw allocation issues automatically? Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-11 3:42 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-11 3:42 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 09, 2011 at 10:54:38PM +0800, Vivek Goyal wrote: > On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote: > > It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N) > > when there are N dd tasks. > > > > On write() syscall, use bdi->dirty_ratelimit > > ============================================ > > > > balance_dirty_pages(pages_dirtied) > > { > > pos_bw = bdi->dirty_ratelimit * bdi_position_ratio(); > > pause = pages_dirtied / pos_bw; > > sleep(pause); > > } > > > > On every 200ms, update bdi->dirty_ratelimit > > =========================================== > > > > bdi_update_dirty_ratelimit() > > { > > bw = bdi->dirty_ratelimit; > > ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw; > > if (dirty pages unbalanced) > > bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4; > > } > > > > Estimation of balanced bdi->dirty_ratelimit > > =========================================== > > > > When started N dd, throttle each dd at > > > > task_ratelimit = pos_bw (any non-zero initial value is OK) > > > > After 200ms, we got > > > > dirty_bw = # of pages dirtied by app / 200ms > > write_bw = # of pages written to disk / 200ms > > > > For aggressive dirtiers, the equality holds > > > > dirty_bw == N * task_ratelimit > > == N * pos_bw (1) > > > > The balanced throttle bandwidth can be estimated by > > > > ref_bw = pos_bw * write_bw / dirty_bw (2) > > > > >From (1) and (2), we get equality > > > > ref_bw == write_bw / N (3) > > > > If the N dd's are all throttled at ref_bw, the dirty/writeback rates > > will match. So ref_bw is the balanced dirty rate. > > Hi Fengguang, Hi Vivek, > So how much work it is to extend all this to handle the case of cgroups? Here is the simplest form. writeback: async write IO controllers http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=blobdiff;f=mm/page-writeback.c;h=0b579e7fd338fd1f59cc36bf15fda06ff6260634;hp=34dff9f0d28d0f4f0794eb41187f71b4ade6b8a2;hb=1a58ad99ce1f6a9df6618a4b92fa4859cc3e7e90;hpb=5b6fcb3125ea52ff04a2fad27a51307842deb1a0 And an old email on this topic: https://lkml.org/lkml/2011/4/28/229 > IOW, I would imagine that you shall have to keep track of per cgroup/per > bdi state of many of the variables. For example, write_bw will become > per cgroup/per bdi entity instead of per bdi entity only. Same should > be true for position ratio, dirty_bw etc? The dirty_bw, write_bw and dirty_ratelimit should be replicated, but not necessarily dirty pages and position ratio. The cgroup can just rely on the root cgroup's dirty pages position control if it does not care about its own dirty pages consumptions. > I am assuming that if some cgroup is low weight on end device, then > WRITE bandwidth of that cgroup should go down and that should be > accounted for at per bdi state and task throttling should happen > accordingly so that a lower weight cgroup tasks get throttled more > as compared to higher weight cgroup tasks? Sorry I don't quite catch your meaning, but the current ->dirty_ratelimit adaptation scheme (detailed in another email) should handle all such rate/bw allocation issues automatically? Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control 2011-08-06 8:44 ` Wu Fengguang (?) @ 2011-08-09 14:57 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 14:57 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > Estimation of balanced bdi->dirty_ratelimit > =========================================== > > When started N dd, throttle each dd at > > task_ratelimit = pos_bw (any non-zero initial value is OK) This is (0), since it makes (1). But it fails to explain what the difference is between task_ratelimit and pos_bw (and why positional bandwidth is a good name). > After 200ms, we got > > dirty_bw = # of pages dirtied by app / 200ms > write_bw = # of pages written to disk / 200ms Right, so that I get. And our premise for the whole work is to delay applications so that we match the dirty_bw to the write_bw, right? > For aggressive dirtiers, the equality holds > > dirty_bw == N * task_ratelimit > == N * pos_bw (1) So dirty_bw is in pages/s, so task_ratelimit should also be in pages/s, since N is a unit-less number. What does task_ratelimit in pages/s mean? Since we make the tasks sleep the only thing we can make from this is a measure of pages. So I expect (in a later patch) we compute the sleep time on the amount of pages we want written out, using this ratelimit measure, right? > The balanced throttle bandwidth can be estimated by > > ref_bw = pos_bw * write_bw / dirty_bw (2) Here you introduce reference bandwidth, what does it mean and what is its relation to positional bandwidth. Going by the equation, we got (pages/s * pages/s) / (pages/s) so we indeed have a bandwidth unit. write_bw/dirty_bw is the ration between output and input of dirty pages, but what is pos_bw and what does that make ref_bw? > >From (1) and (2), we get equality > > ref_bw == write_bw / N (3) Somehow this seems like the primary postulate, yet you present it like a derivation. The whole purpose of your control system is to provide this fairness between processes, therefore I would expect you start out with this postulate and reason therefrom. > If the N dd's are all throttled at ref_bw, the dirty/writeback rates > will match. So ref_bw is the balanced dirty rate. Which does lead to the question why its not called that instead ;-) > In practice, the ref_bw calculated by (2) may fluctuate and have > estimation errors. So the bdi->dirty_ratelimit update policy is to > follow it only when both pos_bw and ref_bw point to the same direction > (indicating not only the dirty position has deviated from the global/bdi > setpoints, but also it's still departing away). Which is where you introduce the need for pos_bw, yet you have not yet explained its meaning. In this explanation you allude to it being the speed (first time derivative) of the deviation from the setpoint. The set point's measure is in pages, so the measure of its first time derivative would indeed be pages/s, just like bandwidth, but calling it a bandwidth seems highly confusing indeed. I would also like a few more words on your update condition, why did you pick those, and what are the full ramifications of them. Also missing in this story is your pos_ratio thing, it is used in the code, but there is no explanation on how it ties in with the above things. You seem very skilled in control systems (your earlier read-ahead work was also a very complex system), but the explanations of your systems are highly confusing. Can you go back to the roots and explain how you constructed your model and why you did so? (without using graphs please) PS. I'm not criticizing your work, the results are impressive (as always), but I find it very hard to understand. PPS. If it would help, feel free to refer me to educational material on control system theory, either online or in books. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-09 14:57 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 14:57 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > Estimation of balanced bdi->dirty_ratelimit > =========================================== > > When started N dd, throttle each dd at > > task_ratelimit = pos_bw (any non-zero initial value is OK) This is (0), since it makes (1). But it fails to explain what the difference is between task_ratelimit and pos_bw (and why positional bandwidth is a good name). > After 200ms, we got > > dirty_bw = # of pages dirtied by app / 200ms > write_bw = # of pages written to disk / 200ms Right, so that I get. And our premise for the whole work is to delay applications so that we match the dirty_bw to the write_bw, right? > For aggressive dirtiers, the equality holds > > dirty_bw == N * task_ratelimit > == N * pos_bw (1) So dirty_bw is in pages/s, so task_ratelimit should also be in pages/s, since N is a unit-less number. What does task_ratelimit in pages/s mean? Since we make the tasks sleep the only thing we can make from this is a measure of pages. So I expect (in a later patch) we compute the sleep time on the amount of pages we want written out, using this ratelimit measure, right? > The balanced throttle bandwidth can be estimated by > > ref_bw = pos_bw * write_bw / dirty_bw (2) Here you introduce reference bandwidth, what does it mean and what is its relation to positional bandwidth. Going by the equation, we got (pages/s * pages/s) / (pages/s) so we indeed have a bandwidth unit. write_bw/dirty_bw is the ration between output and input of dirty pages, but what is pos_bw and what does that make ref_bw? > >From (1) and (2), we get equality > > ref_bw == write_bw / N (3) Somehow this seems like the primary postulate, yet you present it like a derivation. The whole purpose of your control system is to provide this fairness between processes, therefore I would expect you start out with this postulate and reason therefrom. > If the N dd's are all throttled at ref_bw, the dirty/writeback rates > will match. So ref_bw is the balanced dirty rate. Which does lead to the question why its not called that instead ;-) > In practice, the ref_bw calculated by (2) may fluctuate and have > estimation errors. So the bdi->dirty_ratelimit update policy is to > follow it only when both pos_bw and ref_bw point to the same direction > (indicating not only the dirty position has deviated from the global/bdi > setpoints, but also it's still departing away). Which is where you introduce the need for pos_bw, yet you have not yet explained its meaning. In this explanation you allude to it being the speed (first time derivative) of the deviation from the setpoint. The set point's measure is in pages, so the measure of its first time derivative would indeed be pages/s, just like bandwidth, but calling it a bandwidth seems highly confusing indeed. I would also like a few more words on your update condition, why did you pick those, and what are the full ramifications of them. Also missing in this story is your pos_ratio thing, it is used in the code, but there is no explanation on how it ties in with the above things. You seem very skilled in control systems (your earlier read-ahead work was also a very complex system), but the explanations of your systems are highly confusing. Can you go back to the roots and explain how you constructed your model and why you did so? (without using graphs please) PS. I'm not criticizing your work, the results are impressive (as always), but I find it very hard to understand. PPS. If it would help, feel free to refer me to educational material on control system theory, either online or in books. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-09 14:57 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 14:57 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > Estimation of balanced bdi->dirty_ratelimit > =========================================== > > When started N dd, throttle each dd at > > task_ratelimit = pos_bw (any non-zero initial value is OK) This is (0), since it makes (1). But it fails to explain what the difference is between task_ratelimit and pos_bw (and why positional bandwidth is a good name). > After 200ms, we got > > dirty_bw = # of pages dirtied by app / 200ms > write_bw = # of pages written to disk / 200ms Right, so that I get. And our premise for the whole work is to delay applications so that we match the dirty_bw to the write_bw, right? > For aggressive dirtiers, the equality holds > > dirty_bw == N * task_ratelimit > == N * pos_bw (1) So dirty_bw is in pages/s, so task_ratelimit should also be in pages/s, since N is a unit-less number. What does task_ratelimit in pages/s mean? Since we make the tasks sleep the only thing we can make from this is a measure of pages. So I expect (in a later patch) we compute the sleep time on the amount of pages we want written out, using this ratelimit measure, right? > The balanced throttle bandwidth can be estimated by > > ref_bw = pos_bw * write_bw / dirty_bw (2) Here you introduce reference bandwidth, what does it mean and what is its relation to positional bandwidth. Going by the equation, we got (pages/s * pages/s) / (pages/s) so we indeed have a bandwidth unit. write_bw/dirty_bw is the ration between output and input of dirty pages, but what is pos_bw and what does that make ref_bw? > >From (1) and (2), we get equality > > ref_bw == write_bw / N (3) Somehow this seems like the primary postulate, yet you present it like a derivation. The whole purpose of your control system is to provide this fairness between processes, therefore I would expect you start out with this postulate and reason therefrom. > If the N dd's are all throttled at ref_bw, the dirty/writeback rates > will match. So ref_bw is the balanced dirty rate. Which does lead to the question why its not called that instead ;-) > In practice, the ref_bw calculated by (2) may fluctuate and have > estimation errors. So the bdi->dirty_ratelimit update policy is to > follow it only when both pos_bw and ref_bw point to the same direction > (indicating not only the dirty position has deviated from the global/bdi > setpoints, but also it's still departing away). Which is where you introduce the need for pos_bw, yet you have not yet explained its meaning. In this explanation you allude to it being the speed (first time derivative) of the deviation from the setpoint. The set point's measure is in pages, so the measure of its first time derivative would indeed be pages/s, just like bandwidth, but calling it a bandwidth seems highly confusing indeed. I would also like a few more words on your update condition, why did you pick those, and what are the full ramifications of them. Also missing in this story is your pos_ratio thing, it is used in the code, but there is no explanation on how it ties in with the above things. You seem very skilled in control systems (your earlier read-ahead work was also a very complex system), but the explanations of your systems are highly confusing. Can you go back to the roots and explain how you constructed your model and why you did so? (without using graphs please) PS. I'm not criticizing your work, the results are impressive (as always), but I find it very hard to understand. PPS. If it would help, feel free to refer me to educational material on control system theory, either online or in books. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control 2011-08-09 14:57 ` Peter Zijlstra @ 2011-08-10 11:07 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 11:07 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, Aug 09, 2011 at 10:57:32PM +0800, Peter Zijlstra wrote: > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > > > Estimation of balanced bdi->dirty_ratelimit > > =========================================== > > > > When started N dd, throttle each dd at > > > > task_ratelimit = pos_bw (any non-zero initial value is OK) > > This is (0), since it makes (1). But it fails to explain what the > difference is between task_ratelimit and pos_bw (and why positional > bandwidth is a good name). Yeah it's (0) and is another form of the formula used in balance_dirty_pages(): rate = bdi->dirty_ratelimit * pos_ratio In fact the estimation of ref_bw can take a more general form, by writing (0) as task_ratelimit = task_ratelimit_0 where task_ratelimit_0 is any non-zero value balance_dirty_pages() uses to throttle the tasks during that 200ms. > > After 200ms, we got > > > > dirty_bw = # of pages dirtied by app / 200ms > > write_bw = # of pages written to disk / 200ms > > Right, so that I get. And our premise for the whole work is to delay > applications so that we match the dirty_bw to the write_bw, right? Right, the balance target is (dirty_bw == write_bw), but let's rename dirty_bw to dirty_rate as you suggested. > > For aggressive dirtiers, the equality holds > > > > dirty_bw == N * task_ratelimit > > == N * pos_bw (1) > > So dirty_bw is in pages/s, so task_ratelimit should also be in pages/s, > since N is a unit-less number. Right. > What does task_ratelimit in pages/s mean? Since we make the tasks sleep > the only thing we can make from this is a measure of pages. So I expect > (in a later patch) we compute the sleep time on the amount of pages we > want written out, using this ratelimit measure, right? Right. balance_dirty_pages() will use it this way (the variable name used in code is 'bw', will change to 'rate'): pause = (HZ * pages_dirtied) / task_ratelimit > > The balanced throttle bandwidth can be estimated by > > > > ref_bw = pos_bw * write_bw / dirty_bw (2) > > Here you introduce reference bandwidth, what does it mean and what is > its relation to positional bandwidth. Going by the equation, we got > (pages/s * pages/s) / (pages/s) so we indeed have a bandwidth unit. Yeah. Or better do some renames: balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (2) > write_bw/dirty_bw is the ration between output and input of dirty pages, > but what is pos_bw and what does that make ref_bw? It's (bdi->dirty_ratelimit * pos_ratio), the effective dirty rate balance_dirty_pages() used to limit each bdi task for the past 200ms. For example, if (task_ratelimit_0 = write_bw). Then the N dd tasks will make bdi dirty rate (dirty_rate = N * task_ratelimit_0), and the balanced ratelimit will be balanced_rate = task_ratelimit_0 * (write_bw / (N * task_ratelimit_0)) = write_bw / N Thus within 200ms, we get the estimation of balanced_rate without knowing N beforehand. > > >From (1) and (2), we get equality > > > > ref_bw == write_bw / N (3) > > Somehow this seems like the primary postulate, yet you present it like a > derivation. The whole purpose of your control system is to provide this > fairness between processes, therefore I would expect you start out with > this postulate and reason therefrom. Good idea. > > If the N dd's are all throttled at ref_bw, the dirty/writeback rates > > will match. So ref_bw is the balanced dirty rate. > > Which does lead to the question why its not called that instead ;-) Sure, changed to balanced_rate :-) > > In practice, the ref_bw calculated by (2) may fluctuate and have > > estimation errors. So the bdi->dirty_ratelimit update policy is to > > follow it only when both pos_bw and ref_bw point to the same direction > > (indicating not only the dirty position has deviated from the global/bdi > > setpoints, but also it's still departing away). > > Which is where you introduce the need for pos_bw, yet you have not yet > explained its meaning. In this explanation you allude to it being the > speed (first time derivative) of the deviation from the setpoint. That's right. > The set point's measure is in pages, so the measure of its first time > derivative would indeed be pages/s, just like bandwidth, but calling it > a bandwidth seems highly confusing indeed. Yeah, I'll rename the relevant vars *bw to *rate. > I would also like a few more words on your update condition, why did you > pick those, and what are the full ramifications of them. OK. > Also missing in this story is your pos_ratio thing, it is used in the > code, but there is no explanation on how it ties in with the above > things. There are two control targets (1) dirty setpoint (2) dirty rate pos_ratio does the position based control for (1). It's not inherently relevant to the computation of balanced_rate. I hope the below rephrased text will make it easier to understand. : When started N dd, we would like to throttle each dd at : : balanced_rate == write_bw / N (1) : : We don't know N beforehand, but still can estimate balanced_rate : within 200ms. : : Start by throttling each dd task at rate : : task_ratelimit = task_ratelimit_0 (2) : (any non-zero initial value is OK) : : After 200ms, we got : : dirty_rate = # of pages dirtied by all dd's / 200ms : write_bw = # of pages written to the disk / 200ms : : For the aggressive dd dirtiers, the equality holds : : dirty_rate == N * task_rate : == N * task_ratelimit : == N * task_ratelimit_0 (3) : Or : task_ratelimit_0 = dirty_rate / N (4) : : So the balanced throttle bandwidth can be estimated by : : balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (5) : : Because with (4) and (5) we can get the desired equality (1): : : balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate) : == write_bw / N : : Since balance_dirty_pages() will be using : : task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio() (6) : : : Taking (5) and (6), we get the real formula used in the code : : balanced_rate = bdi->dirty_ratelimit * bdi_position_ratio() * : (write_bw / dirty_rate) (7) : > You seem very skilled in control systems (your earlier read-ahead work > was also a very complex system), Thank you! I majored in the college "Pattern Recognition and Intelligent Systems" and "Control theory and Control Engineering", which happen to be the perfect preparations for read-ahead and dirty balancing :) > but the explanations of your systems are highly confusing. Sorry for that! > Can you go back to the roots and explain how you constructed your > model and why you did so? (without using graphs please) As mentioned above, the root requirements are (1) position target: to keep dirty pages around the bdi/global setpoints (2) rate target: to keep bdi dirty rate around bdi write bandwidth In order to meet (2), we try to estimate (balanced_rate = write_bw / N) and use it to throttle the N dd tasks. However that's not enough. When the dirty rate perfectly matches the write bandwidth, the dirty pages can stay stationary at any point. We want the dirty pages to stay around the setpoints as required by (1). So if the dirty pages are ABOVE the setpoints, we throttle each task a bit more HEAVY than balanced_rate, so that the dirty pages are created less fast than they are cleaned, thus DROP to the setpoints (and the reverse). With that positional adjustment, the formula is transformed from task_ratelimit = balanced_rate => meets (2) to task_ratelimit = balanced_rate * pos_ratio => meets both (1),(2) At last, due to the possible large fluctuations in the raw balanced_rate value, the more stable bdi->dirty_ratelimit which tracks balanced_rate in a conservative way is used, resulting in the final form task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio() > PS. I'm not criticizing your work, the results are impressive (as > always), but I find it very hard to understand. > > PPS. If it would help, feel free to refer me to educational material on > control system theory, either online or in books. Fortunately no fancy control theory is used here ;) Only the simple theory of negative feedback control is used, which states that there will be overshoots and ringing if trying to correct the errors way too fast. The overshooting concept can be explained in the graph of the below page, where the step response can be a sudden start of dd reader that took away all the disk write bandwidth. http://en.wikipedia.org/wiki/Step_response In terms of the negative feedback control theory, the bdi_position_ratio() function (control lines) can be expressed as 1) f(setpoint) = 1.0 2) df/dt < 0 3) optionally, abs(df/dt) should be large on large errors (= dirty - setpoint) in order to cancel the errors fast, and be smaller when dirty pages get closer to the setpoints in order to avoid overshooting. The principle of (3) will be implemented in some follow up patches :) Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-10 11:07 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 11:07 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Tue, Aug 09, 2011 at 10:57:32PM +0800, Peter Zijlstra wrote: > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > > > Estimation of balanced bdi->dirty_ratelimit > > =========================================== > > > > When started N dd, throttle each dd at > > > > task_ratelimit = pos_bw (any non-zero initial value is OK) > > This is (0), since it makes (1). But it fails to explain what the > difference is between task_ratelimit and pos_bw (and why positional > bandwidth is a good name). Yeah it's (0) and is another form of the formula used in balance_dirty_pages(): rate = bdi->dirty_ratelimit * pos_ratio In fact the estimation of ref_bw can take a more general form, by writing (0) as task_ratelimit = task_ratelimit_0 where task_ratelimit_0 is any non-zero value balance_dirty_pages() uses to throttle the tasks during that 200ms. > > After 200ms, we got > > > > dirty_bw = # of pages dirtied by app / 200ms > > write_bw = # of pages written to disk / 200ms > > Right, so that I get. And our premise for the whole work is to delay > applications so that we match the dirty_bw to the write_bw, right? Right, the balance target is (dirty_bw == write_bw), but let's rename dirty_bw to dirty_rate as you suggested. > > For aggressive dirtiers, the equality holds > > > > dirty_bw == N * task_ratelimit > > == N * pos_bw (1) > > So dirty_bw is in pages/s, so task_ratelimit should also be in pages/s, > since N is a unit-less number. Right. > What does task_ratelimit in pages/s mean? Since we make the tasks sleep > the only thing we can make from this is a measure of pages. So I expect > (in a later patch) we compute the sleep time on the amount of pages we > want written out, using this ratelimit measure, right? Right. balance_dirty_pages() will use it this way (the variable name used in code is 'bw', will change to 'rate'): pause = (HZ * pages_dirtied) / task_ratelimit > > The balanced throttle bandwidth can be estimated by > > > > ref_bw = pos_bw * write_bw / dirty_bw (2) > > Here you introduce reference bandwidth, what does it mean and what is > its relation to positional bandwidth. Going by the equation, we got > (pages/s * pages/s) / (pages/s) so we indeed have a bandwidth unit. Yeah. Or better do some renames: balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (2) > write_bw/dirty_bw is the ration between output and input of dirty pages, > but what is pos_bw and what does that make ref_bw? It's (bdi->dirty_ratelimit * pos_ratio), the effective dirty rate balance_dirty_pages() used to limit each bdi task for the past 200ms. For example, if (task_ratelimit_0 = write_bw). Then the N dd tasks will make bdi dirty rate (dirty_rate = N * task_ratelimit_0), and the balanced ratelimit will be balanced_rate = task_ratelimit_0 * (write_bw / (N * task_ratelimit_0)) = write_bw / N Thus within 200ms, we get the estimation of balanced_rate without knowing N beforehand. > > >From (1) and (2), we get equality > > > > ref_bw == write_bw / N (3) > > Somehow this seems like the primary postulate, yet you present it like a > derivation. The whole purpose of your control system is to provide this > fairness between processes, therefore I would expect you start out with > this postulate and reason therefrom. Good idea. > > If the N dd's are all throttled at ref_bw, the dirty/writeback rates > > will match. So ref_bw is the balanced dirty rate. > > Which does lead to the question why its not called that instead ;-) Sure, changed to balanced_rate :-) > > In practice, the ref_bw calculated by (2) may fluctuate and have > > estimation errors. So the bdi->dirty_ratelimit update policy is to > > follow it only when both pos_bw and ref_bw point to the same direction > > (indicating not only the dirty position has deviated from the global/bdi > > setpoints, but also it's still departing away). > > Which is where you introduce the need for pos_bw, yet you have not yet > explained its meaning. In this explanation you allude to it being the > speed (first time derivative) of the deviation from the setpoint. That's right. > The set point's measure is in pages, so the measure of its first time > derivative would indeed be pages/s, just like bandwidth, but calling it > a bandwidth seems highly confusing indeed. Yeah, I'll rename the relevant vars *bw to *rate. > I would also like a few more words on your update condition, why did you > pick those, and what are the full ramifications of them. OK. > Also missing in this story is your pos_ratio thing, it is used in the > code, but there is no explanation on how it ties in with the above > things. There are two control targets (1) dirty setpoint (2) dirty rate pos_ratio does the position based control for (1). It's not inherently relevant to the computation of balanced_rate. I hope the below rephrased text will make it easier to understand. : When started N dd, we would like to throttle each dd at : : balanced_rate == write_bw / N (1) : : We don't know N beforehand, but still can estimate balanced_rate : within 200ms. : : Start by throttling each dd task at rate : : task_ratelimit = task_ratelimit_0 (2) : (any non-zero initial value is OK) : : After 200ms, we got : : dirty_rate = # of pages dirtied by all dd's / 200ms : write_bw = # of pages written to the disk / 200ms : : For the aggressive dd dirtiers, the equality holds : : dirty_rate == N * task_rate : == N * task_ratelimit : == N * task_ratelimit_0 (3) : Or : task_ratelimit_0 = dirty_rate / N (4) : : So the balanced throttle bandwidth can be estimated by : : balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (5) : : Because with (4) and (5) we can get the desired equality (1): : : balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate) : == write_bw / N : : Since balance_dirty_pages() will be using : : task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio() (6) : : : Taking (5) and (6), we get the real formula used in the code : : balanced_rate = bdi->dirty_ratelimit * bdi_position_ratio() * : (write_bw / dirty_rate) (7) : > You seem very skilled in control systems (your earlier read-ahead work > was also a very complex system), Thank you! I majored in the college "Pattern Recognition and Intelligent Systems" and "Control theory and Control Engineering", which happen to be the perfect preparations for read-ahead and dirty balancing :) > but the explanations of your systems are highly confusing. Sorry for that! > Can you go back to the roots and explain how you constructed your > model and why you did so? (without using graphs please) As mentioned above, the root requirements are (1) position target: to keep dirty pages around the bdi/global setpoints (2) rate target: to keep bdi dirty rate around bdi write bandwidth In order to meet (2), we try to estimate (balanced_rate = write_bw / N) and use it to throttle the N dd tasks. However that's not enough. When the dirty rate perfectly matches the write bandwidth, the dirty pages can stay stationary at any point. We want the dirty pages to stay around the setpoints as required by (1). So if the dirty pages are ABOVE the setpoints, we throttle each task a bit more HEAVY than balanced_rate, so that the dirty pages are created less fast than they are cleaned, thus DROP to the setpoints (and the reverse). With that positional adjustment, the formula is transformed from task_ratelimit = balanced_rate => meets (2) to task_ratelimit = balanced_rate * pos_ratio => meets both (1),(2) At last, due to the possible large fluctuations in the raw balanced_rate value, the more stable bdi->dirty_ratelimit which tracks balanced_rate in a conservative way is used, resulting in the final form task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio() > PS. I'm not criticizing your work, the results are impressive (as > always), but I find it very hard to understand. > > PPS. If it would help, feel free to refer me to educational material on > control system theory, either online or in books. Fortunately no fancy control theory is used here ;) Only the simple theory of negative feedback control is used, which states that there will be overshoots and ringing if trying to correct the errors way too fast. The overshooting concept can be explained in the graph of the below page, where the step response can be a sudden start of dd reader that took away all the disk write bandwidth. http://en.wikipedia.org/wiki/Step_response In terms of the negative feedback control theory, the bdi_position_ratio() function (control lines) can be expressed as 1) f(setpoint) = 1.0 2) df/dt < 0 3) optionally, abs(df/dt) should be large on large errors (= dirty - setpoint) in order to cancel the errors fast, and be smaller when dirty pages get closer to the setpoints in order to avoid overshooting. The principle of (3) will be implemented in some follow up patches :) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control 2011-08-10 11:07 ` Wu Fengguang (?) @ 2011-08-10 16:17 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-10 16:17 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML How about something like the below, it still needs some more work, but its more or less complete in that is now explains both controls in one story. The actual update bit is still missing. --- balance_dirty_pages() needs to throttle tasks dirtying pages such that the total amount of dirty pages stays below the specified dirty limit in order to avoid memory deadlocks. Furthermore we desire fairness in that tasks get throttled proportionally to the amount of pages they dirty. IOW we want to throttle tasks such that we match the dirty rate to the writeout bandwidth, this yields a stable amount of dirty pages: ratelimit = writeout_bandwidth The fairness requirements gives us: task_ratelimit = write_bandwidth / N > : When started N dd, we would like to throttle each dd at > : > : balanced_rate == write_bw / N (1) > : > : We don't know N beforehand, but still can estimate balanced_rate > : within 200ms. > : > : Start by throttling each dd task at rate > : > : task_ratelimit = task_ratelimit_0 (2) > : (any non-zero initial value is OK) > : > : After 200ms, we got > : > : dirty_rate = # of pages dirtied by all dd's / 200ms > : write_bw = # of pages written to the disk / 200ms > : > : For the aggressive dd dirtiers, the equality holds > : > : dirty_rate == N * task_rate > : == N * task_ratelimit > : == N * task_ratelimit_0 (3) > : Or > : task_ratelimit_0 = dirty_rate / N (4) > : > : So the balanced throttle bandwidth can be estimated by > : > : balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (5) > : > : Because with (4) and (5) we can get the desired equality (1): > : > : balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate) > : == write_bw / N Then using the balance_rate we can compute task pause times like: task_pause = task->nr_dirtied / task_ratelimit [ however all that still misses the primary feedback of: task_ratelimit_(i+1) = task_ratelimit_i * (write_bw / dirty_rate) there's still some confusion in the above due to task_ratelimit and balanced_rate. ] However, while the above gives us means of matching the dirty rate to the writeout bandwidth, it at best provides us with a stable dirty page count (assuming a static system). In order to control the dirty page count such that it is high enough to provide performance, but does not exceed the specified limit we need another control. > So if the dirty pages are ABOVE the setpoints, we throttle each task > a bit more HEAVY than balanced_rate, so that the dirty pages are > created less fast than they are cleaned, thus DROP to the setpoints > (and the reverse). With that positional adjustment, the formula is > transformed from > > task_ratelimit = balanced_rate > > to > > task_ratelimit = balanced_rate * pos_ratio > In terms of the negative feedback control theory, the > bdi_position_ratio() function (control lines) can be expressed as > > 1) f(setpoint) = 1.0 > 2) df/dt < 0 > > 3) optionally, abs(df/dt) should be large on large errors (= dirty - > setpoint) in order to cancel the errors fast, and be smaller when > dirty pages get closer to the setpoints in order to avoid overshooting. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-10 16:17 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-10 16:17 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML How about something like the below, it still needs some more work, but its more or less complete in that is now explains both controls in one story. The actual update bit is still missing. --- balance_dirty_pages() needs to throttle tasks dirtying pages such that the total amount of dirty pages stays below the specified dirty limit in order to avoid memory deadlocks. Furthermore we desire fairness in that tasks get throttled proportionally to the amount of pages they dirty. IOW we want to throttle tasks such that we match the dirty rate to the writeout bandwidth, this yields a stable amount of dirty pages: ratelimit = writeout_bandwidth The fairness requirements gives us: task_ratelimit = write_bandwidth / N > : When started N dd, we would like to throttle each dd at > : > : balanced_rate == write_bw / N (1) > : > : We don't know N beforehand, but still can estimate balanced_rate > : within 200ms. > : > : Start by throttling each dd task at rate > : > : task_ratelimit = task_ratelimit_0 (2) > : (any non-zero initial value is OK) > : > : After 200ms, we got > : > : dirty_rate = # of pages dirtied by all dd's / 200ms > : write_bw = # of pages written to the disk / 200ms > : > : For the aggressive dd dirtiers, the equality holds > : > : dirty_rate == N * task_rate > : == N * task_ratelimit > : == N * task_ratelimit_0 (3) > : Or > : task_ratelimit_0 = dirty_rate / N (4) > : > : So the balanced throttle bandwidth can be estimated by > : > : balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (5) > : > : Because with (4) and (5) we can get the desired equality (1): > : > : balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate) > : == write_bw / N Then using the balance_rate we can compute task pause times like: task_pause = task->nr_dirtied / task_ratelimit [ however all that still misses the primary feedback of: task_ratelimit_(i+1) = task_ratelimit_i * (write_bw / dirty_rate) there's still some confusion in the above due to task_ratelimit and balanced_rate. ] However, while the above gives us means of matching the dirty rate to the writeout bandwidth, it at best provides us with a stable dirty page count (assuming a static system). In order to control the dirty page count such that it is high enough to provide performance, but does not exceed the specified limit we need another control. > So if the dirty pages are ABOVE the setpoints, we throttle each task > a bit more HEAVY than balanced_rate, so that the dirty pages are > created less fast than they are cleaned, thus DROP to the setpoints > (and the reverse). With that positional adjustment, the formula is > transformed from > > task_ratelimit = balanced_rate > > to > > task_ratelimit = balanced_rate * pos_ratio > In terms of the negative feedback control theory, the > bdi_position_ratio() function (control lines) can be expressed as > > 1) f(setpoint) = 1.0 > 2) df/dt < 0 > > 3) optionally, abs(df/dt) should be large on large errors (= dirty - > setpoint) in order to cancel the errors fast, and be smaller when > dirty pages get closer to the setpoints in order to avoid overshooting. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-10 16:17 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-10 16:17 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML How about something like the below, it still needs some more work, but its more or less complete in that is now explains both controls in one story. The actual update bit is still missing. --- balance_dirty_pages() needs to throttle tasks dirtying pages such that the total amount of dirty pages stays below the specified dirty limit in order to avoid memory deadlocks. Furthermore we desire fairness in that tasks get throttled proportionally to the amount of pages they dirty. IOW we want to throttle tasks such that we match the dirty rate to the writeout bandwidth, this yields a stable amount of dirty pages: ratelimit = writeout_bandwidth The fairness requirements gives us: task_ratelimit = write_bandwidth / N > : When started N dd, we would like to throttle each dd at > : > : balanced_rate == write_bw / N (1) > : > : We don't know N beforehand, but still can estimate balanced_rate > : within 200ms. > : > : Start by throttling each dd task at rate > : > : task_ratelimit = task_ratelimit_0 (2) > : (any non-zero initial value is OK) > : > : After 200ms, we got > : > : dirty_rate = # of pages dirtied by all dd's / 200ms > : write_bw = # of pages written to the disk / 200ms > : > : For the aggressive dd dirtiers, the equality holds > : > : dirty_rate == N * task_rate > : == N * task_ratelimit > : == N * task_ratelimit_0 (3) > : Or > : task_ratelimit_0 = dirty_rate / N (4) > : > : So the balanced throttle bandwidth can be estimated by > : > : balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (5) > : > : Because with (4) and (5) we can get the desired equality (1): > : > : balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate) > : == write_bw / N Then using the balance_rate we can compute task pause times like: task_pause = task->nr_dirtied / task_ratelimit [ however all that still misses the primary feedback of: task_ratelimit_(i+1) = task_ratelimit_i * (write_bw / dirty_rate) there's still some confusion in the above due to task_ratelimit and balanced_rate. ] However, while the above gives us means of matching the dirty rate to the writeout bandwidth, it at best provides us with a stable dirty page count (assuming a static system). In order to control the dirty page count such that it is high enough to provide performance, but does not exceed the specified limit we need another control. > So if the dirty pages are ABOVE the setpoints, we throttle each task > a bit more HEAVY than balanced_rate, so that the dirty pages are > created less fast than they are cleaned, thus DROP to the setpoints > (and the reverse). With that positional adjustment, the formula is > transformed from > > task_ratelimit = balanced_rate > > to > > task_ratelimit = balanced_rate * pos_ratio > In terms of the negative feedback control theory, the > bdi_position_ratio() function (control lines) can be expressed as > > 1) f(setpoint) = 1.0 > 2) df/dt < 0 > > 3) optionally, abs(df/dt) should be large on large errors (= dirty - > setpoint) in order to cancel the errors fast, and be smaller when > dirty pages get closer to the setpoints in order to avoid overshooting. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control 2011-08-10 16:17 ` Peter Zijlstra @ 2011-08-15 14:08 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-15 14:08 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Thu, Aug 11, 2011 at 12:17:55AM +0800, Peter Zijlstra wrote: > How about something like the below, it still needs some more work, but > its more or less complete in that is now explains both controls in one > story. The actual update bit is still missing. Looks pretty good, thanks! I'll post the completed version at the bottom. > --- > > balance_dirty_pages() needs to throttle tasks dirtying pages such that > the total amount of dirty pages stays below the specified dirty limit in > order to avoid memory deadlocks. Furthermore we desire fairness in that > tasks get throttled proportionally to the amount of pages they dirty. > > IOW we want to throttle tasks such that we match the dirty rate to the > writeout bandwidth, this yields a stable amount of dirty pages: > > ratelimit = writeout_bandwidth > > The fairness requirements gives us: > > task_ratelimit = write_bandwidth / N > > > : When started N dd, we would like to throttle each dd at > > : > > : balanced_rate == write_bw / N (1) > > : > > : We don't know N beforehand, but still can estimate balanced_rate > > : within 200ms. > > : > > : Start by throttling each dd task at rate > > : > > : task_ratelimit = task_ratelimit_0 (2) > > : (any non-zero initial value is OK) > > : > > : After 200ms, we got > > : > > : dirty_rate = # of pages dirtied by all dd's / 200ms > > : write_bw = # of pages written to the disk / 200ms > > : > > : For the aggressive dd dirtiers, the equality holds > > : > > : dirty_rate == N * task_rate > > : == N * task_ratelimit > > : == N * task_ratelimit_0 (3) > > : Or > > : task_ratelimit_0 = dirty_rate / N (4) > > : > > : So the balanced throttle bandwidth can be estimated by > > : > > : balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (5) > > : > > : Because with (4) and (5) we can get the desired equality (1): > > : > > : balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate) > > : == write_bw / N > > Then using the balance_rate we can compute task pause times like: > > task_pause = task->nr_dirtied / task_ratelimit > > [ however all that still misses the primary feedback of: > > task_ratelimit_(i+1) = task_ratelimit_i * (write_bw / dirty_rate) > > there's still some confusion in the above due to task_ratelimit and > balanced_rate. > ] > > However, while the above gives us means of matching the dirty rate to > the writeout bandwidth, it at best provides us with a stable dirty page > count (assuming a static system). In order to control the dirty page > count such that it is high enough to provide performance, but does not > exceed the specified limit we need another control. > > > So if the dirty pages are ABOVE the setpoints, we throttle each task > > a bit more HEAVY than balanced_rate, so that the dirty pages are > > created less fast than they are cleaned, thus DROP to the setpoints > > (and the reverse). With that positional adjustment, the formula is > > transformed from > > > > task_ratelimit = balanced_rate > > > > to > > > > task_ratelimit = balanced_rate * pos_ratio > > > In terms of the negative feedback control theory, the > > bdi_position_ratio() function (control lines) can be expressed as > > > > 1) f(setpoint) = 1.0 > > 2) df/dt < 0 > > > > 3) optionally, abs(df/dt) should be large on large errors (= dirty - > > setpoint) in order to cancel the errors fast, and be smaller when > > dirty pages get closer to the setpoints in order to avoid overshooting. > > Estimation of balanced bdi->dirty_ratelimit =========================================== balanced task_ratelimit ----------------------- balance_dirty_pages() needs to throttle tasks dirtying pages such that the total amount of dirty pages stays below the specified dirty limit in order to avoid memory deadlocks. Furthermore we desire fairness in that tasks get throttled proportionally to the amount of pages they dirty. IOW we want to throttle tasks such that we match the dirty rate to the writeout bandwidth, this yields a stable amount of dirty pages: ratelimit = write_bw (1) The fairness requirement gives us: task_ratelimit = write_bw / N (2) where N is the number of dd tasks. We don't know N beforehand, but still can estimate the balanced task_ratelimit within 200ms. Start by throttling each dd task at rate task_ratelimit = task_ratelimit_0 (3) (any non-zero initial value is OK) After 200ms, we measured dirty_rate = # of pages dirtied by all dd's / 200ms write_bw = # of pages written to the disk / 200ms For the aggressive dd dirtiers, the equality holds dirty_rate == N * task_rate == N * task_ratelimit == N * task_ratelimit_0 (4) Or task_ratelimit_0 = dirty_rate / N (5) Now we conclude that the balanced task ratelimit can be estimated by task_ratelimit = task_ratelimit_0 * (write_bw / dirty_rate) (6) Because with (4) and (5) we can get the desired equality (1): task_ratelimit == (dirty_rate / N) * (write_bw / dirty_rate) == write_bw / N Then using the balanced task ratelimit we can compute task pause times like: task_pause = task->nr_dirtied / task_ratelimit task_ratelimit with position control ------------------------------------ However, while the above gives us means of matching the dirty rate to the writeout bandwidth, it at best provides us with a stable dirty page count (assuming a static system). In order to control the dirty page count such that it is high enough to provide performance, but does not exceed the specified limit we need another control. The dirty position control works by splitting (6) to task_ratelimit = balanced_rate (7) balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (8) and extend (7) to task_ratelimit = balanced_rate * pos_ratio (9) where pos_ratio is a negative feedback function that subjects to 1) f(setpoint) = 1.0 2) df/dx < 0 That is, if the dirty pages are ABOVE the setpoint, we throttle each task a bit more HEAVY than balanced_rate, so that the dirty pages are created less fast than they are cleaned, thus DROP to the setpoints (and the reverse). bdi->dirty_ratelimit update policy ---------------------------------- The balanced_rate calculated by (8) is not suitable for direct use (*). For the reasons listed below, (9) is further transformed into task_ratelimit = dirty_ratelimit * pos_ratio (10) where dirty_ratelimit will be tracking balanced_rate _conservatively_. --- (*) There are some imperfections in balanced_rate, which make it not suitable for direct use: 1) large fluctuations The dirty_rate used for computing balanced_rate is merely averaged in the past 200ms (very small comparing to the 3s estimation period for write_bw), which makes rather dispersed distribution of balanced_rate. It's pretty hard to average out the singular points by increasing the estimation period. Considering that the averaging technique will introduce very undesirable time lags, I give it up totally. (btw, the 3s write_bw averaging time lag is much more acceptable because its impact is one-way and therefore won't lead to oscillations.) The more practical way is filtering -- most singular balanced_rate points can be filtered out by remembering some prev_balanced_rate and prev_prev_balanced_rate. However the more reliable way is to guard balanced_rate with pos_rate. 2) due to truncates and fs redirties, the (write_bw <=> dirty_rate) match could become unbalanced, which may lead to large systematical errors in balanced_rate. The truncates, due to its possibly bumpy nature, can hardly be compensated smoothly. So let's face it. When some over-estimated balanced_rate brings dirty_ratelimit high, dirty pages will go higher than the setpoint. pos_rate will in turn become lower than dirty_ratelimit. So if we consider both balanced_rate and pos_rate and update dirty_ratelimit only when they are on the same side of dirty_ratelimit, the systematical errors in balanced_rate won't be able to bring dirty_ratelimit far away. The balanced_rate estimation may also be inaccurate when near the max pause and free run areas, however is less an issue. 3) since we ultimately want to - keep the fluctuations of task ratelimit as small as possible - keep the dirty pages around the setpoint as long time as possible the update policy used for (2) also serves the above goals nicely: if for some reason the dirty pages are high (pos_rate < dirty_ratelimit), and dirty_ratelimit is low (dirty_ratelimit < balanced_rate), there is no point to bring up dirty_ratelimit in a hurry only to hurt both the above two goals. In summary, the dirty_ratelimit update policy consists of two constraints: 1) avoid changing dirty rate when it's against the position control target (the adjusted rate will slow down the progress of dirty pages going back to setpoint). 2) limit the step size. pos_rate is changing values step by step, leaving a consistent trace comparing to the randomly jumping balanced_rate. pos_rate also has the nice smaller errors in stable state and typically larger errors when there are big errors in rate. So it's a pretty good limiting factor for the step size of dirty_ratelimit. Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-15 14:08 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-15 14:08 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Thu, Aug 11, 2011 at 12:17:55AM +0800, Peter Zijlstra wrote: > How about something like the below, it still needs some more work, but > its more or less complete in that is now explains both controls in one > story. The actual update bit is still missing. Looks pretty good, thanks! I'll post the completed version at the bottom. > --- > > balance_dirty_pages() needs to throttle tasks dirtying pages such that > the total amount of dirty pages stays below the specified dirty limit in > order to avoid memory deadlocks. Furthermore we desire fairness in that > tasks get throttled proportionally to the amount of pages they dirty. > > IOW we want to throttle tasks such that we match the dirty rate to the > writeout bandwidth, this yields a stable amount of dirty pages: > > ratelimit = writeout_bandwidth > > The fairness requirements gives us: > > task_ratelimit = write_bandwidth / N > > > : When started N dd, we would like to throttle each dd at > > : > > : balanced_rate == write_bw / N (1) > > : > > : We don't know N beforehand, but still can estimate balanced_rate > > : within 200ms. > > : > > : Start by throttling each dd task at rate > > : > > : task_ratelimit = task_ratelimit_0 (2) > > : (any non-zero initial value is OK) > > : > > : After 200ms, we got > > : > > : dirty_rate = # of pages dirtied by all dd's / 200ms > > : write_bw = # of pages written to the disk / 200ms > > : > > : For the aggressive dd dirtiers, the equality holds > > : > > : dirty_rate == N * task_rate > > : == N * task_ratelimit > > : == N * task_ratelimit_0 (3) > > : Or > > : task_ratelimit_0 = dirty_rate / N (4) > > : > > : So the balanced throttle bandwidth can be estimated by > > : > > : balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (5) > > : > > : Because with (4) and (5) we can get the desired equality (1): > > : > > : balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate) > > : == write_bw / N > > Then using the balance_rate we can compute task pause times like: > > task_pause = task->nr_dirtied / task_ratelimit > > [ however all that still misses the primary feedback of: > > task_ratelimit_(i+1) = task_ratelimit_i * (write_bw / dirty_rate) > > there's still some confusion in the above due to task_ratelimit and > balanced_rate. > ] > > However, while the above gives us means of matching the dirty rate to > the writeout bandwidth, it at best provides us with a stable dirty page > count (assuming a static system). In order to control the dirty page > count such that it is high enough to provide performance, but does not > exceed the specified limit we need another control. > > > So if the dirty pages are ABOVE the setpoints, we throttle each task > > a bit more HEAVY than balanced_rate, so that the dirty pages are > > created less fast than they are cleaned, thus DROP to the setpoints > > (and the reverse). With that positional adjustment, the formula is > > transformed from > > > > task_ratelimit = balanced_rate > > > > to > > > > task_ratelimit = balanced_rate * pos_ratio > > > In terms of the negative feedback control theory, the > > bdi_position_ratio() function (control lines) can be expressed as > > > > 1) f(setpoint) = 1.0 > > 2) df/dt < 0 > > > > 3) optionally, abs(df/dt) should be large on large errors (= dirty - > > setpoint) in order to cancel the errors fast, and be smaller when > > dirty pages get closer to the setpoints in order to avoid overshooting. > > Estimation of balanced bdi->dirty_ratelimit =========================================== balanced task_ratelimit ----------------------- balance_dirty_pages() needs to throttle tasks dirtying pages such that the total amount of dirty pages stays below the specified dirty limit in order to avoid memory deadlocks. Furthermore we desire fairness in that tasks get throttled proportionally to the amount of pages they dirty. IOW we want to throttle tasks such that we match the dirty rate to the writeout bandwidth, this yields a stable amount of dirty pages: ratelimit = write_bw (1) The fairness requirement gives us: task_ratelimit = write_bw / N (2) where N is the number of dd tasks. We don't know N beforehand, but still can estimate the balanced task_ratelimit within 200ms. Start by throttling each dd task at rate task_ratelimit = task_ratelimit_0 (3) (any non-zero initial value is OK) After 200ms, we measured dirty_rate = # of pages dirtied by all dd's / 200ms write_bw = # of pages written to the disk / 200ms For the aggressive dd dirtiers, the equality holds dirty_rate == N * task_rate == N * task_ratelimit == N * task_ratelimit_0 (4) Or task_ratelimit_0 = dirty_rate / N (5) Now we conclude that the balanced task ratelimit can be estimated by task_ratelimit = task_ratelimit_0 * (write_bw / dirty_rate) (6) Because with (4) and (5) we can get the desired equality (1): task_ratelimit == (dirty_rate / N) * (write_bw / dirty_rate) == write_bw / N Then using the balanced task ratelimit we can compute task pause times like: task_pause = task->nr_dirtied / task_ratelimit task_ratelimit with position control ------------------------------------ However, while the above gives us means of matching the dirty rate to the writeout bandwidth, it at best provides us with a stable dirty page count (assuming a static system). In order to control the dirty page count such that it is high enough to provide performance, but does not exceed the specified limit we need another control. The dirty position control works by splitting (6) to task_ratelimit = balanced_rate (7) balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (8) and extend (7) to task_ratelimit = balanced_rate * pos_ratio (9) where pos_ratio is a negative feedback function that subjects to 1) f(setpoint) = 1.0 2) df/dx < 0 That is, if the dirty pages are ABOVE the setpoint, we throttle each task a bit more HEAVY than balanced_rate, so that the dirty pages are created less fast than they are cleaned, thus DROP to the setpoints (and the reverse). bdi->dirty_ratelimit update policy ---------------------------------- The balanced_rate calculated by (8) is not suitable for direct use (*). For the reasons listed below, (9) is further transformed into task_ratelimit = dirty_ratelimit * pos_ratio (10) where dirty_ratelimit will be tracking balanced_rate _conservatively_. --- (*) There are some imperfections in balanced_rate, which make it not suitable for direct use: 1) large fluctuations The dirty_rate used for computing balanced_rate is merely averaged in the past 200ms (very small comparing to the 3s estimation period for write_bw), which makes rather dispersed distribution of balanced_rate. It's pretty hard to average out the singular points by increasing the estimation period. Considering that the averaging technique will introduce very undesirable time lags, I give it up totally. (btw, the 3s write_bw averaging time lag is much more acceptable because its impact is one-way and therefore won't lead to oscillations.) The more practical way is filtering -- most singular balanced_rate points can be filtered out by remembering some prev_balanced_rate and prev_prev_balanced_rate. However the more reliable way is to guard balanced_rate with pos_rate. 2) due to truncates and fs redirties, the (write_bw <=> dirty_rate) match could become unbalanced, which may lead to large systematical errors in balanced_rate. The truncates, due to its possibly bumpy nature, can hardly be compensated smoothly. So let's face it. When some over-estimated balanced_rate brings dirty_ratelimit high, dirty pages will go higher than the setpoint. pos_rate will in turn become lower than dirty_ratelimit. So if we consider both balanced_rate and pos_rate and update dirty_ratelimit only when they are on the same side of dirty_ratelimit, the systematical errors in balanced_rate won't be able to bring dirty_ratelimit far away. The balanced_rate estimation may also be inaccurate when near the max pause and free run areas, however is less an issue. 3) since we ultimately want to - keep the fluctuations of task ratelimit as small as possible - keep the dirty pages around the setpoint as long time as possible the update policy used for (2) also serves the above goals nicely: if for some reason the dirty pages are high (pos_rate < dirty_ratelimit), and dirty_ratelimit is low (dirty_ratelimit < balanced_rate), there is no point to bring up dirty_ratelimit in a hurry only to hurt both the above two goals. In summary, the dirty_ratelimit update policy consists of two constraints: 1) avoid changing dirty rate when it's against the position control target (the adjusted rate will slow down the progress of dirty pages going back to setpoint). 2) limit the step size. pos_rate is changing values step by step, leaving a consistent trace comparing to the randomly jumping balanced_rate. pos_rate also has the nice smaller errors in stable state and typically larger errors when there are big errors in rate. So it's a pretty good limiting factor for the step size of dirty_ratelimit. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control 2011-08-06 8:44 ` Wu Fengguang @ 2011-08-09 15:50 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-09 15:50 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote: [..] > +/* > + * Maintain bdi->dirty_ratelimit, the base throttle bandwidth. > + * > + * Normal bdi tasks will be curbed at or below it in long term. > + * Obviously it should be around (write_bw / N) when there are N dd tasks. > + */ Hi Fengguang, So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate limit (based on postion ratio, dirty_bw and write_bw). But this seems to be overall bdi limit and does not seem to take into account the number of tasks doing IO to that bdi (as your comment suggests). So it probably will track write_bw as opposed to write_bw/N. What am I missing? Thanks Vivek > +static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi, > + unsigned long thresh, > + unsigned long dirty, > + unsigned long bdi_thresh, > + unsigned long bdi_dirty, > + unsigned long dirtied, > + unsigned long elapsed) > +{ > + unsigned long bw = bdi->dirty_ratelimit; > + unsigned long dirty_bw; > + unsigned long pos_bw; > + unsigned long ref_bw; > + unsigned long long pos_ratio; > + > + /* > + * The dirty rate will match the writeback rate in long term, except > + * when dirty pages are truncated by userspace or re-dirtied by FS. > + */ > + dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed; > + > + pos_ratio = bdi_position_ratio(bdi, thresh, dirty, > + bdi_thresh, bdi_dirty); > + /* > + * pos_bw reflects each dd's dirty rate enforced for the past 200ms. > + */ > + pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; > + pos_bw++; /* this avoids bdi->dirty_ratelimit get stuck in 0 */ > + > + /* > + * ref_bw = pos_bw * write_bw / dirty_bw > + * > + * It's a linear estimation of the "balanced" throttle bandwidth. > + */ > + pos_ratio *= bdi->avg_write_bandwidth; > + do_div(pos_ratio, dirty_bw | 1); > + ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; > + > + /* > + * dirty_ratelimit will follow ref_bw/pos_bw conservatively iff they > + * are on the same side of dirty_ratelimit. Which not only makes it > + * more stable, but also is essential for preventing it being driven > + * away by possible systematic errors in ref_bw. > + */ > + if (pos_bw < bw) { > + if (ref_bw < bw) > + bw = max(ref_bw, pos_bw); > + } else { > + if (ref_bw > bw) > + bw = min(ref_bw, pos_bw); > + } > + > + bdi->dirty_ratelimit = bw; > +} > + > void __bdi_update_bandwidth(struct backing_dev_info *bdi, > unsigned long thresh, > unsigned long dirty, > @@ -745,6 +805,7 @@ void __bdi_update_bandwidth(struct backi > { > unsigned long now = jiffies; > unsigned long elapsed = now - bdi->bw_time_stamp; > + unsigned long dirtied; > unsigned long written; > > /* > @@ -753,6 +814,7 @@ void __bdi_update_bandwidth(struct backi > if (elapsed < BANDWIDTH_INTERVAL) > return; > > + dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]); > written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]); > > /* > @@ -762,12 +824,15 @@ void __bdi_update_bandwidth(struct backi > if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time)) > goto snapshot; > > - if (thresh) > + if (thresh) { > global_update_bandwidth(thresh, dirty, now); > - > + bdi_update_dirty_ratelimit(bdi, thresh, dirty, bdi_thresh, > + bdi_dirty, dirtied, elapsed); > + } > bdi_update_write_bandwidth(bdi, elapsed, written); > > snapshot: > + bdi->dirtied_stamp = dirtied; > bdi->written_stamp = written; > bdi->bw_time_stamp = now; > } > ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-09 15:50 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-09 15:50 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote: [..] > +/* > + * Maintain bdi->dirty_ratelimit, the base throttle bandwidth. > + * > + * Normal bdi tasks will be curbed at or below it in long term. > + * Obviously it should be around (write_bw / N) when there are N dd tasks. > + */ Hi Fengguang, So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate limit (based on postion ratio, dirty_bw and write_bw). But this seems to be overall bdi limit and does not seem to take into account the number of tasks doing IO to that bdi (as your comment suggests). So it probably will track write_bw as opposed to write_bw/N. What am I missing? Thanks Vivek > +static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi, > + unsigned long thresh, > + unsigned long dirty, > + unsigned long bdi_thresh, > + unsigned long bdi_dirty, > + unsigned long dirtied, > + unsigned long elapsed) > +{ > + unsigned long bw = bdi->dirty_ratelimit; > + unsigned long dirty_bw; > + unsigned long pos_bw; > + unsigned long ref_bw; > + unsigned long long pos_ratio; > + > + /* > + * The dirty rate will match the writeback rate in long term, except > + * when dirty pages are truncated by userspace or re-dirtied by FS. > + */ > + dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed; > + > + pos_ratio = bdi_position_ratio(bdi, thresh, dirty, > + bdi_thresh, bdi_dirty); > + /* > + * pos_bw reflects each dd's dirty rate enforced for the past 200ms. > + */ > + pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; > + pos_bw++; /* this avoids bdi->dirty_ratelimit get stuck in 0 */ > + > + /* > + * ref_bw = pos_bw * write_bw / dirty_bw > + * > + * It's a linear estimation of the "balanced" throttle bandwidth. > + */ > + pos_ratio *= bdi->avg_write_bandwidth; > + do_div(pos_ratio, dirty_bw | 1); > + ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; > + > + /* > + * dirty_ratelimit will follow ref_bw/pos_bw conservatively iff they > + * are on the same side of dirty_ratelimit. Which not only makes it > + * more stable, but also is essential for preventing it being driven > + * away by possible systematic errors in ref_bw. > + */ > + if (pos_bw < bw) { > + if (ref_bw < bw) > + bw = max(ref_bw, pos_bw); > + } else { > + if (ref_bw > bw) > + bw = min(ref_bw, pos_bw); > + } > + > + bdi->dirty_ratelimit = bw; > +} > + > void __bdi_update_bandwidth(struct backing_dev_info *bdi, > unsigned long thresh, > unsigned long dirty, > @@ -745,6 +805,7 @@ void __bdi_update_bandwidth(struct backi > { > unsigned long now = jiffies; > unsigned long elapsed = now - bdi->bw_time_stamp; > + unsigned long dirtied; > unsigned long written; > > /* > @@ -753,6 +814,7 @@ void __bdi_update_bandwidth(struct backi > if (elapsed < BANDWIDTH_INTERVAL) > return; > > + dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]); > written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]); > > /* > @@ -762,12 +824,15 @@ void __bdi_update_bandwidth(struct backi > if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time)) > goto snapshot; > > - if (thresh) > + if (thresh) { > global_update_bandwidth(thresh, dirty, now); > - > + bdi_update_dirty_ratelimit(bdi, thresh, dirty, bdi_thresh, > + bdi_dirty, dirtied, elapsed); > + } > bdi_update_write_bandwidth(bdi, elapsed, written); > > snapshot: > + bdi->dirtied_stamp = dirtied; > bdi->written_stamp = written; > bdi->bw_time_stamp = now; > } > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control 2011-08-09 15:50 ` Vivek Goyal (?) @ 2011-08-09 16:16 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 16:16 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote: > > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate > limit (based on postion ratio, dirty_bw and write_bw). But this seems > to be overall bdi limit and does not seem to take into account the > number of tasks doing IO to that bdi (as your comment suggests). So > it probably will track write_bw as opposed to write_bw/N. What am > I missing? I think the per task thing comes from him using the pages_dirtied argument to balance_dirty_pages() to compute the sleep time. Although I'm not quite sure how he keeps fairness in light of the sleep time bounding to MAX_PAUSE. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-09 16:16 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 16:16 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote: > > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate > limit (based on postion ratio, dirty_bw and write_bw). But this seems > to be overall bdi limit and does not seem to take into account the > number of tasks doing IO to that bdi (as your comment suggests). So > it probably will track write_bw as opposed to write_bw/N. What am > I missing? I think the per task thing comes from him using the pages_dirtied argument to balance_dirty_pages() to compute the sleep time. Although I'm not quite sure how he keeps fairness in light of the sleep time bounding to MAX_PAUSE. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-09 16:16 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 16:16 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote: > > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate > limit (based on postion ratio, dirty_bw and write_bw). But this seems > to be overall bdi limit and does not seem to take into account the > number of tasks doing IO to that bdi (as your comment suggests). So > it probably will track write_bw as opposed to write_bw/N. What am > I missing? I think the per task thing comes from him using the pages_dirtied argument to balance_dirty_pages() to compute the sleep time. Although I'm not quite sure how he keeps fairness in light of the sleep time bounding to MAX_PAUSE. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control 2011-08-09 16:16 ` Peter Zijlstra (?) @ 2011-08-09 16:19 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 16:19 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 18:16 +0200, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote: > > > > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate > > limit (based on postion ratio, dirty_bw and write_bw). But this seems > > to be overall bdi limit and does not seem to take into account the > > number of tasks doing IO to that bdi (as your comment suggests). So > > it probably will track write_bw as opposed to write_bw/N. What am > > I missing? > > I think the per task thing comes from him using the pages_dirtied > argument to balance_dirty_pages() to compute the sleep time. Although > I'm not quite sure how he keeps fairness in light of the sleep time > bounding to MAX_PAUSE. Furthermore, there's of course the issue that current->nr_dirtied is computed over all BDIs it dirtied pages from, and the sleep time is computed for the BDI it happened to do the overflowing write on. Assuming an task (mostly) writes to a single bdi, or equally to all, it should all work out. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-09 16:19 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 16:19 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 18:16 +0200, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote: > > > > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate > > limit (based on postion ratio, dirty_bw and write_bw). But this seems > > to be overall bdi limit and does not seem to take into account the > > number of tasks doing IO to that bdi (as your comment suggests). So > > it probably will track write_bw as opposed to write_bw/N. What am > > I missing? > > I think the per task thing comes from him using the pages_dirtied > argument to balance_dirty_pages() to compute the sleep time. Although > I'm not quite sure how he keeps fairness in light of the sleep time > bounding to MAX_PAUSE. Furthermore, there's of course the issue that current->nr_dirtied is computed over all BDIs it dirtied pages from, and the sleep time is computed for the BDI it happened to do the overflowing write on. Assuming an task (mostly) writes to a single bdi, or equally to all, it should all work out. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-09 16:19 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 16:19 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 18:16 +0200, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote: > > > > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate > > limit (based on postion ratio, dirty_bw and write_bw). But this seems > > to be overall bdi limit and does not seem to take into account the > > number of tasks doing IO to that bdi (as your comment suggests). So > > it probably will track write_bw as opposed to write_bw/N. What am > > I missing? > > I think the per task thing comes from him using the pages_dirtied > argument to balance_dirty_pages() to compute the sleep time. Although > I'm not quite sure how he keeps fairness in light of the sleep time > bounding to MAX_PAUSE. Furthermore, there's of course the issue that current->nr_dirtied is computed over all BDIs it dirtied pages from, and the sleep time is computed for the BDI it happened to do the overflowing write on. Assuming an task (mostly) writes to a single bdi, or equally to all, it should all work out. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control 2011-08-09 16:19 ` Peter Zijlstra @ 2011-08-10 14:07 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 14:07 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 12:19:32AM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 18:16 +0200, Peter Zijlstra wrote: > > On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote: > > > > > > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate > > > limit (based on postion ratio, dirty_bw and write_bw). But this seems > > > to be overall bdi limit and does not seem to take into account the > > > number of tasks doing IO to that bdi (as your comment suggests). So > > > it probably will track write_bw as opposed to write_bw/N. What am > > > I missing? > > > > I think the per task thing comes from him using the pages_dirtied > > argument to balance_dirty_pages() to compute the sleep time. Although > > I'm not quite sure how he keeps fairness in light of the sleep time > > bounding to MAX_PAUSE. > > Furthermore, there's of course the issue that current->nr_dirtied is > computed over all BDIs it dirtied pages from, and the sleep time is > computed for the BDI it happened to do the overflowing write on. > > Assuming an task (mostly) writes to a single bdi, or equally to all, it > should all work out. Right. That's one pitfall I forgot to mention, sorry. If _really_ necessary, the above imperfection can be avoided by adding tsk->last_dirty_bdi and tsk->to_pause, and to do so when switching to another bdi: to_pause += nr_dirtied / task_ratelimit if (to_pause > reasonable_large_pause_time) { sleep(to_pause) to_pause = 0 } nr_dirtied = 0 Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-10 14:07 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 14:07 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 12:19:32AM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 18:16 +0200, Peter Zijlstra wrote: > > On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote: > > > > > > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate > > > limit (based on postion ratio, dirty_bw and write_bw). But this seems > > > to be overall bdi limit and does not seem to take into account the > > > number of tasks doing IO to that bdi (as your comment suggests). So > > > it probably will track write_bw as opposed to write_bw/N. What am > > > I missing? > > > > I think the per task thing comes from him using the pages_dirtied > > argument to balance_dirty_pages() to compute the sleep time. Although > > I'm not quite sure how he keeps fairness in light of the sleep time > > bounding to MAX_PAUSE. > > Furthermore, there's of course the issue that current->nr_dirtied is > computed over all BDIs it dirtied pages from, and the sleep time is > computed for the BDI it happened to do the overflowing write on. > > Assuming an task (mostly) writes to a single bdi, or equally to all, it > should all work out. Right. That's one pitfall I forgot to mention, sorry. If _really_ necessary, the above imperfection can be avoided by adding tsk->last_dirty_bdi and tsk->to_pause, and to do so when switching to another bdi: to_pause += nr_dirtied / task_ratelimit if (to_pause > reasonable_large_pause_time) { sleep(to_pause) to_pause = 0 } nr_dirtied = 0 Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control 2011-08-09 16:16 ` Peter Zijlstra @ 2011-08-10 14:00 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 14:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 12:16:30AM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote: > > > > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate > > limit (based on postion ratio, dirty_bw and write_bw). But this seems > > to be overall bdi limit and does not seem to take into account the > > number of tasks doing IO to that bdi (as your comment suggests). > > So it probably will track write_bw as opposed to write_bw/N. What > > am I missing? In normal situation (near the setpoints), task_ratelimit ~= bdi->dirty_ratelimit ~= write_bw / N Yes, dirty_ratelimit is a per-bdi variable, because all tasks share roughly the same dirty ratelimit for the obvious reason of fairness. > I think the per task thing comes from him using the pages_dirtied > argument to balance_dirty_pages() to compute the sleep time. Yeah. Ultimately it will allow different tasks to be throttled at different (user specified) rates. > Although I'm not quite sure how he keeps fairness in light of the > sleep time bounding to MAX_PAUSE. Firstly, MAX_PAUSE will only be applied when the dirty pages rush high (dirty exceeded). Secondly, the dirty exceeded state is global to all tasks, in which case each task will sleep for MAX_PAUSE equally. So the fairness is still maintained in dirty exceeded state. Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-10 14:00 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 14:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 12:16:30AM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote: > > > > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate > > limit (based on postion ratio, dirty_bw and write_bw). But this seems > > to be overall bdi limit and does not seem to take into account the > > number of tasks doing IO to that bdi (as your comment suggests). > > So it probably will track write_bw as opposed to write_bw/N. What > > am I missing? In normal situation (near the setpoints), task_ratelimit ~= bdi->dirty_ratelimit ~= write_bw / N Yes, dirty_ratelimit is a per-bdi variable, because all tasks share roughly the same dirty ratelimit for the obvious reason of fairness. > I think the per task thing comes from him using the pages_dirtied > argument to balance_dirty_pages() to compute the sleep time. Yeah. Ultimately it will allow different tasks to be throttled at different (user specified) rates. > Although I'm not quite sure how he keeps fairness in light of the > sleep time bounding to MAX_PAUSE. Firstly, MAX_PAUSE will only be applied when the dirty pages rush high (dirty exceeded). Secondly, the dirty exceeded state is global to all tasks, in which case each task will sleep for MAX_PAUSE equally. So the fairness is still maintained in dirty exceeded state. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control 2011-08-10 14:00 ` Wu Fengguang @ 2011-08-10 17:10 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-10 17:10 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, 2011-08-10 at 22:00 +0800, Wu Fengguang wrote: > > > Although I'm not quite sure how he keeps fairness in light of the > > sleep time bounding to MAX_PAUSE. > > Firstly, MAX_PAUSE will only be applied when the dirty pages rush > high (dirty exceeded). Secondly, the dirty exceeded state is global > to all tasks, in which case each task will sleep for MAX_PAUSE equally. > So the fairness is still maintained in dirty exceeded state. Its not immediately apparent how dirty_exceeded and MAX_PAUSE interact, but having everybody sleep MAX_PAUSE doesn't necessarily mean its fair, its only fair if they dirty at the same rate. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-10 17:10 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-10 17:10 UTC (permalink / raw) To: Wu Fengguang Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, 2011-08-10 at 22:00 +0800, Wu Fengguang wrote: > > > Although I'm not quite sure how he keeps fairness in light of the > > sleep time bounding to MAX_PAUSE. > > Firstly, MAX_PAUSE will only be applied when the dirty pages rush > high (dirty exceeded). Secondly, the dirty exceeded state is global > to all tasks, in which case each task will sleep for MAX_PAUSE equally. > So the fairness is still maintained in dirty exceeded state. Its not immediately apparent how dirty_exceeded and MAX_PAUSE interact, but having everybody sleep MAX_PAUSE doesn't necessarily mean its fair, its only fair if they dirty at the same rate. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control 2011-08-10 17:10 ` Peter Zijlstra @ 2011-08-15 14:11 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-15 14:11 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Thu, Aug 11, 2011 at 01:10:26AM +0800, Peter Zijlstra wrote: > On Wed, 2011-08-10 at 22:00 +0800, Wu Fengguang wrote: > > > > > Although I'm not quite sure how he keeps fairness in light of the > > > sleep time bounding to MAX_PAUSE. > > > > Firstly, MAX_PAUSE will only be applied when the dirty pages rush > > high (dirty exceeded). Secondly, the dirty exceeded state is global > > to all tasks, in which case each task will sleep for MAX_PAUSE equally. > > So the fairness is still maintained in dirty exceeded state. > > Its not immediately apparent how dirty_exceeded and MAX_PAUSE interact, > but having everybody sleep MAX_PAUSE doesn't necessarily mean its fair, > its only fair if they dirty at the same rate. Yeah I forget to mention that, but when dirty_exceeded, the tasks will typically sleep for MAX_PAUSE on every 8 pages, so resulting in the same dirty rate :) Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-15 14:11 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-15 14:11 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Thu, Aug 11, 2011 at 01:10:26AM +0800, Peter Zijlstra wrote: > On Wed, 2011-08-10 at 22:00 +0800, Wu Fengguang wrote: > > > > > Although I'm not quite sure how he keeps fairness in light of the > > > sleep time bounding to MAX_PAUSE. > > > > Firstly, MAX_PAUSE will only be applied when the dirty pages rush > > high (dirty exceeded). Secondly, the dirty exceeded state is global > > to all tasks, in which case each task will sleep for MAX_PAUSE equally. > > So the fairness is still maintained in dirty exceeded state. > > Its not immediately apparent how dirty_exceeded and MAX_PAUSE interact, > but having everybody sleep MAX_PAUSE doesn't necessarily mean its fair, > its only fair if they dirty at the same rate. Yeah I forget to mention that, but when dirty_exceeded, the tasks will typically sleep for MAX_PAUSE on every 8 pages, so resulting in the same dirty rate :) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control 2011-08-06 8:44 ` Wu Fengguang (?) @ 2011-08-09 16:56 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 16:56 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4; I can't actually find this low-pass filter in the code.. could be I'm blind from staring at it too long though.. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-09 16:56 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 16:56 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4; I can't actually find this low-pass filter in the code.. could be I'm blind from staring at it too long though.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-09 16:56 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 16:56 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4; I can't actually find this low-pass filter in the code.. could be I'm blind from staring at it too long though.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control 2011-08-09 16:56 ` Peter Zijlstra (?) (?) @ 2011-08-10 14:10 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 14:10 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: Type: text/plain, Size: 600 bytes --] On Wed, Aug 10, 2011 at 12:56:56AM +0800, Peter Zijlstra wrote: > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4; > > I can't actually find this low-pass filter in the code.. could be I'm > blind from staring at it too long though.. Sorry, it's implemented in another patch (attached). I've also removed it from _this_ changelog. Here you can find all the other patches in addition to the core bits. http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=shortlog;h=refs/heads/dirty-throttling-v8%2B Thanks, Fengguang [-- Attachment #2: smooth-base-bw --] [-- Type: text/plain, Size: 2488 bytes --] Subject: writeback: make dirty_ratelimit stable/smooth Date: Thu Aug 04 22:05:05 CST 2011 Half the dirty_ratelimit update step size to avoid overshooting, and further slow down the updates when the tracking error is smaller than (base_rate / 8). It's desirable to have a _constant_ dirty_ratelimit given a stable workload. Because each jolt of dirty_ratelimit will directly show up in all the bdi tasks' dirty rate. The cost will be slightly increased dirty position error, which is pretty acceptable. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- mm/page-writeback.c | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) --- linux-next.orig/mm/page-writeback.c 2011-08-10 21:35:11.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-10 21:35:31.000000000 +0800 @@ -741,6 +741,7 @@ static void bdi_update_dirty_ratelimit(s unsigned long dirty_rate; unsigned long pos_rate; unsigned long balanced_rate; + unsigned long delta; unsigned long long pos_ratio; /* @@ -755,7 +756,6 @@ static void bdi_update_dirty_ratelimit(s * pos_rate reflects each dd's dirty rate enforced for the past 200ms. */ pos_rate = base_rate * pos_ratio >> BANDWIDTH_CALC_SHIFT; - pos_rate++; /* this avoids bdi->dirty_ratelimit get stuck in 0 */ /* * balanced_rate = pos_rate * write_bw / dirty_rate @@ -777,14 +777,32 @@ static void bdi_update_dirty_ratelimit(s * makes it more stable, but also is essential for preventing it being * driven away by possible systematic errors in balanced_rate. */ + delta = 0; if (base_rate > pos_rate) { if (base_rate > balanced_rate) - base_rate = max(balanced_rate, pos_rate); + delta = base_rate - max(balanced_rate, pos_rate); } else { if (base_rate < balanced_rate) - base_rate = min(balanced_rate, pos_rate); + delta = min(balanced_rate, pos_rate) - base_rate; } + /* + * Don't pursue 100% rate matching. It's impossible since the balanced + * rate itself is constantly fluctuating. So decrease the track speed + * when it gets close to the target. Eliminates unnecessary jolting. + */ + delta >>= base_rate / (8 * delta + 1); + /* + * Limit the step size to avoid overshooting. It also implicitly + * prevents dirty_ratelimit from dropping to 0. + */ + delta >>= 2; + + if (base_rate < pos_rate) + base_rate += delta; + else + base_rate -= delta; + bdi->dirty_ratelimit = base_rate; trace_dirty_ratelimit(bdi, dirty_rate, pos_rate, balanced_rate); ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control 2011-08-06 8:44 ` Wu Fengguang (?) @ 2011-08-09 17:02 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 17:02 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > + pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; > + pos_bw++; /* this avoids bdi->dirty_ratelimit get stuck in 0 */ > + > + pos_ratio *= bdi->avg_write_bandwidth; > + do_div(pos_ratio, dirty_bw | 1); > + ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; when written out that results in: bw * pos_ratio * bdi->avg_write_bandwidth ref_bw = ----------------------------------------- dirty_bw which would suggest you write it like: ref_bw = div_u64((u64)pos_bw * bdi->avg_write_bandwidth, dirty_bw | 1); since pos_bw is already bw * pos_ratio per the above. Or am I missing something? ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-09 17:02 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 17:02 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > + pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; > + pos_bw++; /* this avoids bdi->dirty_ratelimit get stuck in 0 */ > + > + pos_ratio *= bdi->avg_write_bandwidth; > + do_div(pos_ratio, dirty_bw | 1); > + ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; when written out that results in: bw * pos_ratio * bdi->avg_write_bandwidth ref_bw = ----------------------------------------- dirty_bw which would suggest you write it like: ref_bw = div_u64((u64)pos_bw * bdi->avg_write_bandwidth, dirty_bw | 1); since pos_bw is already bw * pos_ratio per the above. Or am I missing something? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-09 17:02 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 17:02 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > + pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; > + pos_bw++; /* this avoids bdi->dirty_ratelimit get stuck in 0 */ > + > + pos_ratio *= bdi->avg_write_bandwidth; > + do_div(pos_ratio, dirty_bw | 1); > + ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; when written out that results in: bw * pos_ratio * bdi->avg_write_bandwidth ref_bw = ----------------------------------------- dirty_bw which would suggest you write it like: ref_bw = div_u64((u64)pos_bw * bdi->avg_write_bandwidth, dirty_bw | 1); since pos_bw is already bw * pos_ratio per the above. Or am I missing something? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control 2011-08-09 17:02 ` Peter Zijlstra @ 2011-08-10 14:15 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 14:15 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 01:02:02AM +0800, Peter Zijlstra wrote: > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > > + pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; > > + pos_bw++; /* this avoids bdi->dirty_ratelimit get stuck in 0 */ > > + > > > + pos_ratio *= bdi->avg_write_bandwidth; > > + do_div(pos_ratio, dirty_bw | 1); > > + ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; > > when written out that results in: > > bw * pos_ratio * bdi->avg_write_bandwidth > ref_bw = ----------------------------------------- > dirty_bw > > which would suggest you write it like: > > ref_bw = div_u64((u64)pos_bw * bdi->avg_write_bandwidth, dirty_bw | 1); > > since pos_bw is already bw * pos_ratio per the above. Good point. Oopse I even wrote a comment for the over complex calculation: * balanced_rate = pos_rate * write_bw / dirty_rate Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 3/5] writeback: dirty rate control @ 2011-08-10 14:15 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 14:15 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 01:02:02AM +0800, Peter Zijlstra wrote: > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > > + pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; > > + pos_bw++; /* this avoids bdi->dirty_ratelimit get stuck in 0 */ > > + > > > + pos_ratio *= bdi->avg_write_bandwidth; > > + do_div(pos_ratio, dirty_bw | 1); > > + ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; > > when written out that results in: > > bw * pos_ratio * bdi->avg_write_bandwidth > ref_bw = ----------------------------------------- > dirty_bw > > which would suggest you write it like: > > ref_bw = div_u64((u64)pos_bw * bdi->avg_write_bandwidth, dirty_bw | 1); > > since pos_bw is already bw * pos_ratio per the above. Good point. Oopse I even wrote a comment for the over complex calculation: * balanced_rate = pos_rate * write_bw / dirty_rate Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 4/5] writeback: per task dirty rate limit 2011-08-06 8:44 ` Wu Fengguang (?) @ 2011-08-06 8:44 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: per-task-ratelimit --] [-- Type: text/plain, Size: 7105 bytes --] Add two fields to task_struct. 1) account dirtied pages in the individual tasks, for accuracy 2) per-task balance_dirty_pages() call intervals, for flexibility The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will scale near-sqrt to the safety gap between dirty pages and threshold. XXX: The main problem of per-task nr_dirtied is, if 10k tasks start dirtying pages at exactly the same time, each task will be assigned a large initial nr_dirtied_pause, so that the dirty threshold will be exceeded long before each task reached its nr_dirtied_pause and hence call balance_dirty_pages(). Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/linux/sched.h | 7 ++ mm/memory_hotplug.c | 3 - mm/page-writeback.c | 106 +++++++++------------------------------- 3 files changed, 32 insertions(+), 84 deletions(-) --- linux-next.orig/include/linux/sched.h 2011-08-05 15:36:23.000000000 +0800 +++ linux-next/include/linux/sched.h 2011-08-05 15:39:52.000000000 +0800 @@ -1525,6 +1525,13 @@ struct task_struct { int make_it_fail; #endif struct prop_local_single dirties; + /* + * when (nr_dirtied >= nr_dirtied_pause), it's time to call + * balance_dirty_pages() for some dirty throttling pause + */ + int nr_dirtied; + int nr_dirtied_pause; + #ifdef CONFIG_LATENCYTOP int latency_record_count; struct latency_record latency_record[LT_SAVECOUNT]; --- linux-next.orig/mm/page-writeback.c 2011-08-05 15:39:48.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-05 15:39:52.000000000 +0800 @@ -48,26 +48,6 @@ #define BANDWIDTH_CALC_SHIFT 10 -/* - * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited - * will look to see if it needs to force writeback or throttling. - */ -static long ratelimit_pages = 32; - -/* - * When balance_dirty_pages decides that the caller needs to perform some - * non-background writeback, this is how many pages it will attempt to write. - * It should be somewhat larger than dirtied pages to ensure that reasonably - * large amounts of I/O are submitted. - */ -static inline long sync_writeback_pages(unsigned long dirtied) -{ - if (dirtied < ratelimit_pages) - dirtied = ratelimit_pages; - - return dirtied + dirtied / 2; -} - /* The following parameters are exported via /proc/sys/vm */ /* @@ -868,6 +848,23 @@ static void bdi_update_bandwidth(struct } /* + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr() + * will look to see if it needs to start dirty throttling. + * + * If ratelimit_pages is too low then big NUMA machines will call the expensive + * global_page_state() too often. So scale it near-sqrt to the safety margin + * (the number of pages we may dirty without exceeding the dirty limits). + */ +static unsigned long ratelimit_pages(unsigned long dirty, + unsigned long thresh) +{ + if (thresh > dirty) + return 1UL << (ilog2(thresh - dirty) >> 1); + + return 1; +} + +/* * balance_dirty_pages() must be called by processes which are generating dirty * data. It looks at the number of dirty pages in the machine and will force * the caller to perform writeback if the system is over `vm_dirty_ratio'. @@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a if (clear_dirty_exceeded && bdi->dirty_exceeded) bdi->dirty_exceeded = 0; + current->nr_dirtied = 0; + current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh); + if (writeback_in_progress(bdi)) return; @@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page } } -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0; - /** * balance_dirty_pages_ratelimited_nr - balance dirty memory state * @mapping: address_space which was dirtied @@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr( { struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long ratelimit; - unsigned long *p; if (!bdi_cap_account_dirty(bdi)) return; - ratelimit = ratelimit_pages; - if (mapping->backing_dev_info->dirty_exceeded) + ratelimit = current->nr_dirtied_pause; + if (bdi->dirty_exceeded) ratelimit = 8; - /* - * Check the rate limiting. Also, we do not want to throttle real-time - * tasks in balance_dirty_pages(). Period. - */ - preempt_disable(); - p = &__get_cpu_var(bdp_ratelimits); - *p += nr_pages_dirtied; - if (unlikely(*p >= ratelimit)) { - ratelimit = sync_writeback_pages(*p); - *p = 0; - preempt_enable(); - balance_dirty_pages(mapping, ratelimit); - return; - } - preempt_enable(); + current->nr_dirtied += nr_pages_dirtied; + if (unlikely(current->nr_dirtied >= ratelimit)) + balance_dirty_pages(mapping, current->nr_dirtied); } EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr); @@ -1166,44 +1151,6 @@ void laptop_sync_completion(void) #endif /* - * If ratelimit_pages is too high then we can get into dirty-data overload - * if a large number of processes all perform writes at the same time. - * If it is too low then SMP machines will call the (expensive) - * get_writeback_state too often. - * - * Here we set ratelimit_pages to a level which ensures that when all CPUs are - * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory - * thresholds before writeback cuts in. - * - * But the limit should not be set too high. Because it also controls the - * amount of memory which the balance_dirty_pages() caller has to write back. - * If this is too large then the caller will block on the IO queue all the - * time. So limit it to four megabytes - the balance_dirty_pages() caller - * will write six megabyte chunks, max. - */ - -void writeback_set_ratelimit(void) -{ - ratelimit_pages = vm_total_pages / (num_online_cpus() * 32); - if (ratelimit_pages < 16) - ratelimit_pages = 16; - if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024) - ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE; -} - -static int __cpuinit -ratelimit_handler(struct notifier_block *self, unsigned long u, void *v) -{ - writeback_set_ratelimit(); - return NOTIFY_DONE; -} - -static struct notifier_block __cpuinitdata ratelimit_nb = { - .notifier_call = ratelimit_handler, - .next = NULL, -}; - -/* * Called early on to tune the page writeback dirty limits. * * We used to scale dirty pages according to how total memory @@ -1225,9 +1172,6 @@ void __init page_writeback_init(void) { int shift; - writeback_set_ratelimit(); - register_cpu_notifier(&ratelimit_nb); - shift = calc_period_shift(); prop_descriptor_init(&vm_completions, shift); prop_descriptor_init(&vm_dirties, shift); --- linux-next.orig/mm/memory_hotplug.c 2011-08-05 15:36:23.000000000 +0800 +++ linux-next/mm/memory_hotplug.c 2011-08-05 15:39:52.000000000 +0800 @@ -527,8 +527,6 @@ int __ref online_pages(unsigned long pfn vm_total_pages = nr_free_pagecache_pages(); - writeback_set_ratelimit(); - if (onlined_pages) memory_notify(MEM_ONLINE, &arg); unlock_memory_hotplug(); @@ -970,7 +968,6 @@ repeat: } vm_total_pages = nr_free_pagecache_pages(); - writeback_set_ratelimit(); memory_notify(MEM_OFFLINE, &arg); unlock_memory_hotplug(); ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-06 8:44 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: per-task-ratelimit --] [-- Type: text/plain, Size: 7408 bytes --] Add two fields to task_struct. 1) account dirtied pages in the individual tasks, for accuracy 2) per-task balance_dirty_pages() call intervals, for flexibility The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will scale near-sqrt to the safety gap between dirty pages and threshold. XXX: The main problem of per-task nr_dirtied is, if 10k tasks start dirtying pages at exactly the same time, each task will be assigned a large initial nr_dirtied_pause, so that the dirty threshold will be exceeded long before each task reached its nr_dirtied_pause and hence call balance_dirty_pages(). Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/linux/sched.h | 7 ++ mm/memory_hotplug.c | 3 - mm/page-writeback.c | 106 +++++++++------------------------------- 3 files changed, 32 insertions(+), 84 deletions(-) --- linux-next.orig/include/linux/sched.h 2011-08-05 15:36:23.000000000 +0800 +++ linux-next/include/linux/sched.h 2011-08-05 15:39:52.000000000 +0800 @@ -1525,6 +1525,13 @@ struct task_struct { int make_it_fail; #endif struct prop_local_single dirties; + /* + * when (nr_dirtied >= nr_dirtied_pause), it's time to call + * balance_dirty_pages() for some dirty throttling pause + */ + int nr_dirtied; + int nr_dirtied_pause; + #ifdef CONFIG_LATENCYTOP int latency_record_count; struct latency_record latency_record[LT_SAVECOUNT]; --- linux-next.orig/mm/page-writeback.c 2011-08-05 15:39:48.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-05 15:39:52.000000000 +0800 @@ -48,26 +48,6 @@ #define BANDWIDTH_CALC_SHIFT 10 -/* - * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited - * will look to see if it needs to force writeback or throttling. - */ -static long ratelimit_pages = 32; - -/* - * When balance_dirty_pages decides that the caller needs to perform some - * non-background writeback, this is how many pages it will attempt to write. - * It should be somewhat larger than dirtied pages to ensure that reasonably - * large amounts of I/O are submitted. - */ -static inline long sync_writeback_pages(unsigned long dirtied) -{ - if (dirtied < ratelimit_pages) - dirtied = ratelimit_pages; - - return dirtied + dirtied / 2; -} - /* The following parameters are exported via /proc/sys/vm */ /* @@ -868,6 +848,23 @@ static void bdi_update_bandwidth(struct } /* + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr() + * will look to see if it needs to start dirty throttling. + * + * If ratelimit_pages is too low then big NUMA machines will call the expensive + * global_page_state() too often. So scale it near-sqrt to the safety margin + * (the number of pages we may dirty without exceeding the dirty limits). + */ +static unsigned long ratelimit_pages(unsigned long dirty, + unsigned long thresh) +{ + if (thresh > dirty) + return 1UL << (ilog2(thresh - dirty) >> 1); + + return 1; +} + +/* * balance_dirty_pages() must be called by processes which are generating dirty * data. It looks at the number of dirty pages in the machine and will force * the caller to perform writeback if the system is over `vm_dirty_ratio'. @@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a if (clear_dirty_exceeded && bdi->dirty_exceeded) bdi->dirty_exceeded = 0; + current->nr_dirtied = 0; + current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh); + if (writeback_in_progress(bdi)) return; @@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page } } -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0; - /** * balance_dirty_pages_ratelimited_nr - balance dirty memory state * @mapping: address_space which was dirtied @@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr( { struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long ratelimit; - unsigned long *p; if (!bdi_cap_account_dirty(bdi)) return; - ratelimit = ratelimit_pages; - if (mapping->backing_dev_info->dirty_exceeded) + ratelimit = current->nr_dirtied_pause; + if (bdi->dirty_exceeded) ratelimit = 8; - /* - * Check the rate limiting. Also, we do not want to throttle real-time - * tasks in balance_dirty_pages(). Period. - */ - preempt_disable(); - p = &__get_cpu_var(bdp_ratelimits); - *p += nr_pages_dirtied; - if (unlikely(*p >= ratelimit)) { - ratelimit = sync_writeback_pages(*p); - *p = 0; - preempt_enable(); - balance_dirty_pages(mapping, ratelimit); - return; - } - preempt_enable(); + current->nr_dirtied += nr_pages_dirtied; + if (unlikely(current->nr_dirtied >= ratelimit)) + balance_dirty_pages(mapping, current->nr_dirtied); } EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr); @@ -1166,44 +1151,6 @@ void laptop_sync_completion(void) #endif /* - * If ratelimit_pages is too high then we can get into dirty-data overload - * if a large number of processes all perform writes at the same time. - * If it is too low then SMP machines will call the (expensive) - * get_writeback_state too often. - * - * Here we set ratelimit_pages to a level which ensures that when all CPUs are - * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory - * thresholds before writeback cuts in. - * - * But the limit should not be set too high. Because it also controls the - * amount of memory which the balance_dirty_pages() caller has to write back. - * If this is too large then the caller will block on the IO queue all the - * time. So limit it to four megabytes - the balance_dirty_pages() caller - * will write six megabyte chunks, max. - */ - -void writeback_set_ratelimit(void) -{ - ratelimit_pages = vm_total_pages / (num_online_cpus() * 32); - if (ratelimit_pages < 16) - ratelimit_pages = 16; - if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024) - ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE; -} - -static int __cpuinit -ratelimit_handler(struct notifier_block *self, unsigned long u, void *v) -{ - writeback_set_ratelimit(); - return NOTIFY_DONE; -} - -static struct notifier_block __cpuinitdata ratelimit_nb = { - .notifier_call = ratelimit_handler, - .next = NULL, -}; - -/* * Called early on to tune the page writeback dirty limits. * * We used to scale dirty pages according to how total memory @@ -1225,9 +1172,6 @@ void __init page_writeback_init(void) { int shift; - writeback_set_ratelimit(); - register_cpu_notifier(&ratelimit_nb); - shift = calc_period_shift(); prop_descriptor_init(&vm_completions, shift); prop_descriptor_init(&vm_dirties, shift); --- linux-next.orig/mm/memory_hotplug.c 2011-08-05 15:36:23.000000000 +0800 +++ linux-next/mm/memory_hotplug.c 2011-08-05 15:39:52.000000000 +0800 @@ -527,8 +527,6 @@ int __ref online_pages(unsigned long pfn vm_total_pages = nr_free_pagecache_pages(); - writeback_set_ratelimit(); - if (onlined_pages) memory_notify(MEM_ONLINE, &arg); unlock_memory_hotplug(); @@ -970,7 +968,6 @@ repeat: } vm_total_pages = nr_free_pagecache_pages(); - writeback_set_ratelimit(); memory_notify(MEM_OFFLINE, &arg); unlock_memory_hotplug(); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-06 8:44 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: per-task-ratelimit --] [-- Type: text/plain, Size: 7408 bytes --] Add two fields to task_struct. 1) account dirtied pages in the individual tasks, for accuracy 2) per-task balance_dirty_pages() call intervals, for flexibility The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will scale near-sqrt to the safety gap between dirty pages and threshold. XXX: The main problem of per-task nr_dirtied is, if 10k tasks start dirtying pages at exactly the same time, each task will be assigned a large initial nr_dirtied_pause, so that the dirty threshold will be exceeded long before each task reached its nr_dirtied_pause and hence call balance_dirty_pages(). Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/linux/sched.h | 7 ++ mm/memory_hotplug.c | 3 - mm/page-writeback.c | 106 +++++++++------------------------------- 3 files changed, 32 insertions(+), 84 deletions(-) --- linux-next.orig/include/linux/sched.h 2011-08-05 15:36:23.000000000 +0800 +++ linux-next/include/linux/sched.h 2011-08-05 15:39:52.000000000 +0800 @@ -1525,6 +1525,13 @@ struct task_struct { int make_it_fail; #endif struct prop_local_single dirties; + /* + * when (nr_dirtied >= nr_dirtied_pause), it's time to call + * balance_dirty_pages() for some dirty throttling pause + */ + int nr_dirtied; + int nr_dirtied_pause; + #ifdef CONFIG_LATENCYTOP int latency_record_count; struct latency_record latency_record[LT_SAVECOUNT]; --- linux-next.orig/mm/page-writeback.c 2011-08-05 15:39:48.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-05 15:39:52.000000000 +0800 @@ -48,26 +48,6 @@ #define BANDWIDTH_CALC_SHIFT 10 -/* - * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited - * will look to see if it needs to force writeback or throttling. - */ -static long ratelimit_pages = 32; - -/* - * When balance_dirty_pages decides that the caller needs to perform some - * non-background writeback, this is how many pages it will attempt to write. - * It should be somewhat larger than dirtied pages to ensure that reasonably - * large amounts of I/O are submitted. - */ -static inline long sync_writeback_pages(unsigned long dirtied) -{ - if (dirtied < ratelimit_pages) - dirtied = ratelimit_pages; - - return dirtied + dirtied / 2; -} - /* The following parameters are exported via /proc/sys/vm */ /* @@ -868,6 +848,23 @@ static void bdi_update_bandwidth(struct } /* + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr() + * will look to see if it needs to start dirty throttling. + * + * If ratelimit_pages is too low then big NUMA machines will call the expensive + * global_page_state() too often. So scale it near-sqrt to the safety margin + * (the number of pages we may dirty without exceeding the dirty limits). + */ +static unsigned long ratelimit_pages(unsigned long dirty, + unsigned long thresh) +{ + if (thresh > dirty) + return 1UL << (ilog2(thresh - dirty) >> 1); + + return 1; +} + +/* * balance_dirty_pages() must be called by processes which are generating dirty * data. It looks at the number of dirty pages in the machine and will force * the caller to perform writeback if the system is over `vm_dirty_ratio'. @@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a if (clear_dirty_exceeded && bdi->dirty_exceeded) bdi->dirty_exceeded = 0; + current->nr_dirtied = 0; + current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh); + if (writeback_in_progress(bdi)) return; @@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page } } -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0; - /** * balance_dirty_pages_ratelimited_nr - balance dirty memory state * @mapping: address_space which was dirtied @@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr( { struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long ratelimit; - unsigned long *p; if (!bdi_cap_account_dirty(bdi)) return; - ratelimit = ratelimit_pages; - if (mapping->backing_dev_info->dirty_exceeded) + ratelimit = current->nr_dirtied_pause; + if (bdi->dirty_exceeded) ratelimit = 8; - /* - * Check the rate limiting. Also, we do not want to throttle real-time - * tasks in balance_dirty_pages(). Period. - */ - preempt_disable(); - p = &__get_cpu_var(bdp_ratelimits); - *p += nr_pages_dirtied; - if (unlikely(*p >= ratelimit)) { - ratelimit = sync_writeback_pages(*p); - *p = 0; - preempt_enable(); - balance_dirty_pages(mapping, ratelimit); - return; - } - preempt_enable(); + current->nr_dirtied += nr_pages_dirtied; + if (unlikely(current->nr_dirtied >= ratelimit)) + balance_dirty_pages(mapping, current->nr_dirtied); } EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr); @@ -1166,44 +1151,6 @@ void laptop_sync_completion(void) #endif /* - * If ratelimit_pages is too high then we can get into dirty-data overload - * if a large number of processes all perform writes at the same time. - * If it is too low then SMP machines will call the (expensive) - * get_writeback_state too often. - * - * Here we set ratelimit_pages to a level which ensures that when all CPUs are - * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory - * thresholds before writeback cuts in. - * - * But the limit should not be set too high. Because it also controls the - * amount of memory which the balance_dirty_pages() caller has to write back. - * If this is too large then the caller will block on the IO queue all the - * time. So limit it to four megabytes - the balance_dirty_pages() caller - * will write six megabyte chunks, max. - */ - -void writeback_set_ratelimit(void) -{ - ratelimit_pages = vm_total_pages / (num_online_cpus() * 32); - if (ratelimit_pages < 16) - ratelimit_pages = 16; - if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024) - ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE; -} - -static int __cpuinit -ratelimit_handler(struct notifier_block *self, unsigned long u, void *v) -{ - writeback_set_ratelimit(); - return NOTIFY_DONE; -} - -static struct notifier_block __cpuinitdata ratelimit_nb = { - .notifier_call = ratelimit_handler, - .next = NULL, -}; - -/* * Called early on to tune the page writeback dirty limits. * * We used to scale dirty pages according to how total memory @@ -1225,9 +1172,6 @@ void __init page_writeback_init(void) { int shift; - writeback_set_ratelimit(); - register_cpu_notifier(&ratelimit_nb); - shift = calc_period_shift(); prop_descriptor_init(&vm_completions, shift); prop_descriptor_init(&vm_dirties, shift); --- linux-next.orig/mm/memory_hotplug.c 2011-08-05 15:36:23.000000000 +0800 +++ linux-next/mm/memory_hotplug.c 2011-08-05 15:39:52.000000000 +0800 @@ -527,8 +527,6 @@ int __ref online_pages(unsigned long pfn vm_total_pages = nr_free_pagecache_pages(); - writeback_set_ratelimit(); - if (onlined_pages) memory_notify(MEM_ONLINE, &arg); unlock_memory_hotplug(); @@ -970,7 +968,6 @@ repeat: } vm_total_pages = nr_free_pagecache_pages(); - writeback_set_ratelimit(); memory_notify(MEM_OFFLINE, &arg); unlock_memory_hotplug(); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-06 8:44 ` Wu Fengguang @ 2011-08-06 14:35 ` Andrea Righi -1 siblings, 0 replies; 301+ messages in thread From: Andrea Righi @ 2011-08-06 14:35 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:51PM +0800, Wu Fengguang wrote: > Add two fields to task_struct. > > 1) account dirtied pages in the individual tasks, for accuracy > 2) per-task balance_dirty_pages() call intervals, for flexibility > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > scale near-sqrt to the safety gap between dirty pages and threshold. > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > dirtying pages at exactly the same time, each task will be assigned a > large initial nr_dirtied_pause, so that the dirty threshold will be > exceeded long before each task reached its nr_dirtied_pause and hence > call balance_dirty_pages(). > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> A minor nitpick below. Reviewed-by: Andrea Righi <andrea@betterlinux.com> > --- > include/linux/sched.h | 7 ++ > mm/memory_hotplug.c | 3 - > mm/page-writeback.c | 106 +++++++++------------------------------- > 3 files changed, 32 insertions(+), 84 deletions(-) > > --- linux-next.orig/include/linux/sched.h 2011-08-05 15:36:23.000000000 +0800 > +++ linux-next/include/linux/sched.h 2011-08-05 15:39:52.000000000 +0800 > @@ -1525,6 +1525,13 @@ struct task_struct { > int make_it_fail; > #endif > struct prop_local_single dirties; > + /* > + * when (nr_dirtied >= nr_dirtied_pause), it's time to call > + * balance_dirty_pages() for some dirty throttling pause > + */ > + int nr_dirtied; > + int nr_dirtied_pause; > + > #ifdef CONFIG_LATENCYTOP > int latency_record_count; > struct latency_record latency_record[LT_SAVECOUNT]; > --- linux-next.orig/mm/page-writeback.c 2011-08-05 15:39:48.000000000 +0800 > +++ linux-next/mm/page-writeback.c 2011-08-05 15:39:52.000000000 +0800 > @@ -48,26 +48,6 @@ > > #define BANDWIDTH_CALC_SHIFT 10 > > -/* > - * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited > - * will look to see if it needs to force writeback or throttling. > - */ > -static long ratelimit_pages = 32; > - > -/* > - * When balance_dirty_pages decides that the caller needs to perform some > - * non-background writeback, this is how many pages it will attempt to write. > - * It should be somewhat larger than dirtied pages to ensure that reasonably > - * large amounts of I/O are submitted. > - */ > -static inline long sync_writeback_pages(unsigned long dirtied) > -{ > - if (dirtied < ratelimit_pages) > - dirtied = ratelimit_pages; > - > - return dirtied + dirtied / 2; > -} > - > /* The following parameters are exported via /proc/sys/vm */ > > /* > @@ -868,6 +848,23 @@ static void bdi_update_bandwidth(struct > } > > /* > + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr() > + * will look to see if it needs to start dirty throttling. > + * > + * If ratelimit_pages is too low then big NUMA machines will call the expensive > + * global_page_state() too often. So scale it near-sqrt to the safety margin > + * (the number of pages we may dirty without exceeding the dirty limits). > + */ > +static unsigned long ratelimit_pages(unsigned long dirty, > + unsigned long thresh) > +{ > + if (thresh > dirty) > + return 1UL << (ilog2(thresh - dirty) >> 1); > + > + return 1; > +} > + > +/* > * balance_dirty_pages() must be called by processes which are generating dirty > * data. It looks at the number of dirty pages in the machine and will force > * the caller to perform writeback if the system is over `vm_dirty_ratio'. I think we should also fix the comment of balance_dirty_pages(), now that it's IO-less for the caller. Maybe something like: /* * balance_dirty_pages() must be called by processes which are generating dirty * data. It looks at the number of dirty pages in the machine and will force * the caller to wait once crossing the dirty threshold. If we're over * `background_thresh' then the writeback threads are woken to perform some * writeout. */ > @@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a > if (clear_dirty_exceeded && bdi->dirty_exceeded) > bdi->dirty_exceeded = 0; > > + current->nr_dirtied = 0; > + current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh); > + > if (writeback_in_progress(bdi)) > return; > > @@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page > } > } > > -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0; > - > /** > * balance_dirty_pages_ratelimited_nr - balance dirty memory state > * @mapping: address_space which was dirtied > @@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr( > { > struct backing_dev_info *bdi = mapping->backing_dev_info; > unsigned long ratelimit; > - unsigned long *p; > > if (!bdi_cap_account_dirty(bdi)) > return; > > - ratelimit = ratelimit_pages; > - if (mapping->backing_dev_info->dirty_exceeded) > + ratelimit = current->nr_dirtied_pause; > + if (bdi->dirty_exceeded) > ratelimit = 8; > > - /* > - * Check the rate limiting. Also, we do not want to throttle real-time > - * tasks in balance_dirty_pages(). Period. > - */ > - preempt_disable(); > - p = &__get_cpu_var(bdp_ratelimits); > - *p += nr_pages_dirtied; > - if (unlikely(*p >= ratelimit)) { > - ratelimit = sync_writeback_pages(*p); > - *p = 0; > - preempt_enable(); > - balance_dirty_pages(mapping, ratelimit); > - return; > - } > - preempt_enable(); > + current->nr_dirtied += nr_pages_dirtied; > + if (unlikely(current->nr_dirtied >= ratelimit)) > + balance_dirty_pages(mapping, current->nr_dirtied); > } > EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr); > > @@ -1166,44 +1151,6 @@ void laptop_sync_completion(void) > #endif > > /* > - * If ratelimit_pages is too high then we can get into dirty-data overload > - * if a large number of processes all perform writes at the same time. > - * If it is too low then SMP machines will call the (expensive) > - * get_writeback_state too often. > - * > - * Here we set ratelimit_pages to a level which ensures that when all CPUs are > - * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory > - * thresholds before writeback cuts in. > - * > - * But the limit should not be set too high. Because it also controls the > - * amount of memory which the balance_dirty_pages() caller has to write back. > - * If this is too large then the caller will block on the IO queue all the > - * time. So limit it to four megabytes - the balance_dirty_pages() caller > - * will write six megabyte chunks, max. > - */ > - > -void writeback_set_ratelimit(void) > -{ > - ratelimit_pages = vm_total_pages / (num_online_cpus() * 32); > - if (ratelimit_pages < 16) > - ratelimit_pages = 16; > - if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024) > - ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE; > -} > - > -static int __cpuinit > -ratelimit_handler(struct notifier_block *self, unsigned long u, void *v) > -{ > - writeback_set_ratelimit(); > - return NOTIFY_DONE; > -} > - > -static struct notifier_block __cpuinitdata ratelimit_nb = { > - .notifier_call = ratelimit_handler, > - .next = NULL, > -}; > - > -/* > * Called early on to tune the page writeback dirty limits. > * > * We used to scale dirty pages according to how total memory > @@ -1225,9 +1172,6 @@ void __init page_writeback_init(void) > { > int shift; > > - writeback_set_ratelimit(); > - register_cpu_notifier(&ratelimit_nb); > - > shift = calc_period_shift(); > prop_descriptor_init(&vm_completions, shift); > prop_descriptor_init(&vm_dirties, shift); > --- linux-next.orig/mm/memory_hotplug.c 2011-08-05 15:36:23.000000000 +0800 > +++ linux-next/mm/memory_hotplug.c 2011-08-05 15:39:52.000000000 +0800 > @@ -527,8 +527,6 @@ int __ref online_pages(unsigned long pfn > > vm_total_pages = nr_free_pagecache_pages(); > > - writeback_set_ratelimit(); > - > if (onlined_pages) > memory_notify(MEM_ONLINE, &arg); > unlock_memory_hotplug(); > @@ -970,7 +968,6 @@ repeat: > } > > vm_total_pages = nr_free_pagecache_pages(); > - writeback_set_ratelimit(); > > memory_notify(MEM_OFFLINE, &arg); > unlock_memory_hotplug(); > Thanks, -Andrea ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-06 14:35 ` Andrea Righi 0 siblings, 0 replies; 301+ messages in thread From: Andrea Righi @ 2011-08-06 14:35 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:51PM +0800, Wu Fengguang wrote: > Add two fields to task_struct. > > 1) account dirtied pages in the individual tasks, for accuracy > 2) per-task balance_dirty_pages() call intervals, for flexibility > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > scale near-sqrt to the safety gap between dirty pages and threshold. > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > dirtying pages at exactly the same time, each task will be assigned a > large initial nr_dirtied_pause, so that the dirty threshold will be > exceeded long before each task reached its nr_dirtied_pause and hence > call balance_dirty_pages(). > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> A minor nitpick below. Reviewed-by: Andrea Righi <andrea@betterlinux.com> > --- > include/linux/sched.h | 7 ++ > mm/memory_hotplug.c | 3 - > mm/page-writeback.c | 106 +++++++++------------------------------- > 3 files changed, 32 insertions(+), 84 deletions(-) > > --- linux-next.orig/include/linux/sched.h 2011-08-05 15:36:23.000000000 +0800 > +++ linux-next/include/linux/sched.h 2011-08-05 15:39:52.000000000 +0800 > @@ -1525,6 +1525,13 @@ struct task_struct { > int make_it_fail; > #endif > struct prop_local_single dirties; > + /* > + * when (nr_dirtied >= nr_dirtied_pause), it's time to call > + * balance_dirty_pages() for some dirty throttling pause > + */ > + int nr_dirtied; > + int nr_dirtied_pause; > + > #ifdef CONFIG_LATENCYTOP > int latency_record_count; > struct latency_record latency_record[LT_SAVECOUNT]; > --- linux-next.orig/mm/page-writeback.c 2011-08-05 15:39:48.000000000 +0800 > +++ linux-next/mm/page-writeback.c 2011-08-05 15:39:52.000000000 +0800 > @@ -48,26 +48,6 @@ > > #define BANDWIDTH_CALC_SHIFT 10 > > -/* > - * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited > - * will look to see if it needs to force writeback or throttling. > - */ > -static long ratelimit_pages = 32; > - > -/* > - * When balance_dirty_pages decides that the caller needs to perform some > - * non-background writeback, this is how many pages it will attempt to write. > - * It should be somewhat larger than dirtied pages to ensure that reasonably > - * large amounts of I/O are submitted. > - */ > -static inline long sync_writeback_pages(unsigned long dirtied) > -{ > - if (dirtied < ratelimit_pages) > - dirtied = ratelimit_pages; > - > - return dirtied + dirtied / 2; > -} > - > /* The following parameters are exported via /proc/sys/vm */ > > /* > @@ -868,6 +848,23 @@ static void bdi_update_bandwidth(struct > } > > /* > + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr() > + * will look to see if it needs to start dirty throttling. > + * > + * If ratelimit_pages is too low then big NUMA machines will call the expensive > + * global_page_state() too often. So scale it near-sqrt to the safety margin > + * (the number of pages we may dirty without exceeding the dirty limits). > + */ > +static unsigned long ratelimit_pages(unsigned long dirty, > + unsigned long thresh) > +{ > + if (thresh > dirty) > + return 1UL << (ilog2(thresh - dirty) >> 1); > + > + return 1; > +} > + > +/* > * balance_dirty_pages() must be called by processes which are generating dirty > * data. It looks at the number of dirty pages in the machine and will force > * the caller to perform writeback if the system is over `vm_dirty_ratio'. I think we should also fix the comment of balance_dirty_pages(), now that it's IO-less for the caller. Maybe something like: /* * balance_dirty_pages() must be called by processes which are generating dirty * data. It looks at the number of dirty pages in the machine and will force * the caller to wait once crossing the dirty threshold. If we're over * `background_thresh' then the writeback threads are woken to perform some * writeout. */ > @@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a > if (clear_dirty_exceeded && bdi->dirty_exceeded) > bdi->dirty_exceeded = 0; > > + current->nr_dirtied = 0; > + current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh); > + > if (writeback_in_progress(bdi)) > return; > > @@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page > } > } > > -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0; > - > /** > * balance_dirty_pages_ratelimited_nr - balance dirty memory state > * @mapping: address_space which was dirtied > @@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr( > { > struct backing_dev_info *bdi = mapping->backing_dev_info; > unsigned long ratelimit; > - unsigned long *p; > > if (!bdi_cap_account_dirty(bdi)) > return; > > - ratelimit = ratelimit_pages; > - if (mapping->backing_dev_info->dirty_exceeded) > + ratelimit = current->nr_dirtied_pause; > + if (bdi->dirty_exceeded) > ratelimit = 8; > > - /* > - * Check the rate limiting. Also, we do not want to throttle real-time > - * tasks in balance_dirty_pages(). Period. > - */ > - preempt_disable(); > - p = &__get_cpu_var(bdp_ratelimits); > - *p += nr_pages_dirtied; > - if (unlikely(*p >= ratelimit)) { > - ratelimit = sync_writeback_pages(*p); > - *p = 0; > - preempt_enable(); > - balance_dirty_pages(mapping, ratelimit); > - return; > - } > - preempt_enable(); > + current->nr_dirtied += nr_pages_dirtied; > + if (unlikely(current->nr_dirtied >= ratelimit)) > + balance_dirty_pages(mapping, current->nr_dirtied); > } > EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr); > > @@ -1166,44 +1151,6 @@ void laptop_sync_completion(void) > #endif > > /* > - * If ratelimit_pages is too high then we can get into dirty-data overload > - * if a large number of processes all perform writes at the same time. > - * If it is too low then SMP machines will call the (expensive) > - * get_writeback_state too often. > - * > - * Here we set ratelimit_pages to a level which ensures that when all CPUs are > - * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory > - * thresholds before writeback cuts in. > - * > - * But the limit should not be set too high. Because it also controls the > - * amount of memory which the balance_dirty_pages() caller has to write back. > - * If this is too large then the caller will block on the IO queue all the > - * time. So limit it to four megabytes - the balance_dirty_pages() caller > - * will write six megabyte chunks, max. > - */ > - > -void writeback_set_ratelimit(void) > -{ > - ratelimit_pages = vm_total_pages / (num_online_cpus() * 32); > - if (ratelimit_pages < 16) > - ratelimit_pages = 16; > - if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024) > - ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE; > -} > - > -static int __cpuinit > -ratelimit_handler(struct notifier_block *self, unsigned long u, void *v) > -{ > - writeback_set_ratelimit(); > - return NOTIFY_DONE; > -} > - > -static struct notifier_block __cpuinitdata ratelimit_nb = { > - .notifier_call = ratelimit_handler, > - .next = NULL, > -}; > - > -/* > * Called early on to tune the page writeback dirty limits. > * > * We used to scale dirty pages according to how total memory > @@ -1225,9 +1172,6 @@ void __init page_writeback_init(void) > { > int shift; > > - writeback_set_ratelimit(); > - register_cpu_notifier(&ratelimit_nb); > - > shift = calc_period_shift(); > prop_descriptor_init(&vm_completions, shift); > prop_descriptor_init(&vm_dirties, shift); > --- linux-next.orig/mm/memory_hotplug.c 2011-08-05 15:36:23.000000000 +0800 > +++ linux-next/mm/memory_hotplug.c 2011-08-05 15:39:52.000000000 +0800 > @@ -527,8 +527,6 @@ int __ref online_pages(unsigned long pfn > > vm_total_pages = nr_free_pagecache_pages(); > > - writeback_set_ratelimit(); > - > if (onlined_pages) > memory_notify(MEM_ONLINE, &arg); > unlock_memory_hotplug(); > @@ -970,7 +968,6 @@ repeat: > } > > vm_total_pages = nr_free_pagecache_pages(); > - writeback_set_ratelimit(); > > memory_notify(MEM_OFFLINE, &arg); > unlock_memory_hotplug(); > Thanks, -Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-06 14:35 ` Andrea Righi @ 2011-08-07 6:19 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-07 6:19 UTC (permalink / raw) To: Andrea Righi Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML On Sat, Aug 06, 2011 at 10:35:31PM +0800, Andrea Righi wrote: > On Sat, Aug 06, 2011 at 04:44:51PM +0800, Wu Fengguang wrote: > > Add two fields to task_struct. > > > > 1) account dirtied pages in the individual tasks, for accuracy > > 2) per-task balance_dirty_pages() call intervals, for flexibility > > > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > > scale near-sqrt to the safety gap between dirty pages and threshold. > > > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > > dirtying pages at exactly the same time, each task will be assigned a > > large initial nr_dirtied_pause, so that the dirty threshold will be > > exceeded long before each task reached its nr_dirtied_pause and hence > > call balance_dirty_pages(). > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > > A minor nitpick below. > > Reviewed-by: Andrea Righi <andrea@betterlinux.com> Thank you. > > +/* > > * balance_dirty_pages() must be called by processes which are generating dirty > > * data. It looks at the number of dirty pages in the machine and will force > > * the caller to perform writeback if the system is over `vm_dirty_ratio'. > > I think we should also fix the comment of balance_dirty_pages(), now > that it's IO-less for the caller. Maybe something like: > > /* > * balance_dirty_pages() must be called by processes which are generating dirty > * data. It looks at the number of dirty pages in the machine and will force > * the caller to wait once crossing the dirty threshold. If we're over > * `background_thresh' then the writeback threads are woken to perform some > * writeout. > */ Good catch! I'll add this change to the next patch: /* * balance_dirty_pages() must be called by processes which are generating dirty * data. It looks at the number of dirty pages in the machine and will force - * the caller to perform writeback if the system is over `vm_dirty_ratio'. + * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2. * If we're over `background_thresh' then the writeback threads are woken to * perform some writeout. */ Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-07 6:19 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-07 6:19 UTC (permalink / raw) To: Andrea Righi Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML On Sat, Aug 06, 2011 at 10:35:31PM +0800, Andrea Righi wrote: > On Sat, Aug 06, 2011 at 04:44:51PM +0800, Wu Fengguang wrote: > > Add two fields to task_struct. > > > > 1) account dirtied pages in the individual tasks, for accuracy > > 2) per-task balance_dirty_pages() call intervals, for flexibility > > > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > > scale near-sqrt to the safety gap between dirty pages and threshold. > > > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > > dirtying pages at exactly the same time, each task will be assigned a > > large initial nr_dirtied_pause, so that the dirty threshold will be > > exceeded long before each task reached its nr_dirtied_pause and hence > > call balance_dirty_pages(). > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > > A minor nitpick below. > > Reviewed-by: Andrea Righi <andrea@betterlinux.com> Thank you. > > +/* > > * balance_dirty_pages() must be called by processes which are generating dirty > > * data. It looks at the number of dirty pages in the machine and will force > > * the caller to perform writeback if the system is over `vm_dirty_ratio'. > > I think we should also fix the comment of balance_dirty_pages(), now > that it's IO-less for the caller. Maybe something like: > > /* > * balance_dirty_pages() must be called by processes which are generating dirty > * data. It looks at the number of dirty pages in the machine and will force > * the caller to wait once crossing the dirty threshold. If we're over > * `background_thresh' then the writeback threads are woken to perform some > * writeout. > */ Good catch! I'll add this change to the next patch: /* * balance_dirty_pages() must be called by processes which are generating dirty * data. It looks at the number of dirty pages in the machine and will force - * the caller to perform writeback if the system is over `vm_dirty_ratio'. + * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2. * If we're over `background_thresh' then the writeback threads are woken to * perform some writeout. */ Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-06 8:44 ` Wu Fengguang (?) @ 2011-08-08 13:47 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-08 13:47 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > Add two fields to task_struct. > > 1) account dirtied pages in the individual tasks, for accuracy > 2) per-task balance_dirty_pages() call intervals, for flexibility > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > scale near-sqrt to the safety gap between dirty pages and threshold. > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > dirtying pages at exactly the same time, each task will be assigned a > large initial nr_dirtied_pause, so that the dirty threshold will be > exceeded long before each task reached its nr_dirtied_pause and hence > call balance_dirty_pages(). > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > include/linux/sched.h | 7 ++ > mm/memory_hotplug.c | 3 - > mm/page-writeback.c | 106 +++++++++------------------------------- > 3 files changed, 32 insertions(+), 84 deletions(-) No fork() hooks? This way tasks inherit their parent's dirty count on clone(). ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-08 13:47 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-08 13:47 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > Add two fields to task_struct. > > 1) account dirtied pages in the individual tasks, for accuracy > 2) per-task balance_dirty_pages() call intervals, for flexibility > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > scale near-sqrt to the safety gap between dirty pages and threshold. > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > dirtying pages at exactly the same time, each task will be assigned a > large initial nr_dirtied_pause, so that the dirty threshold will be > exceeded long before each task reached its nr_dirtied_pause and hence > call balance_dirty_pages(). > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > include/linux/sched.h | 7 ++ > mm/memory_hotplug.c | 3 - > mm/page-writeback.c | 106 +++++++++------------------------------- > 3 files changed, 32 insertions(+), 84 deletions(-) No fork() hooks? This way tasks inherit their parent's dirty count on clone(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-08 13:47 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-08 13:47 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > Add two fields to task_struct. > > 1) account dirtied pages in the individual tasks, for accuracy > 2) per-task balance_dirty_pages() call intervals, for flexibility > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > scale near-sqrt to the safety gap between dirty pages and threshold. > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > dirtying pages at exactly the same time, each task will be assigned a > large initial nr_dirtied_pause, so that the dirty threshold will be > exceeded long before each task reached its nr_dirtied_pause and hence > call balance_dirty_pages(). > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > include/linux/sched.h | 7 ++ > mm/memory_hotplug.c | 3 - > mm/page-writeback.c | 106 +++++++++------------------------------- > 3 files changed, 32 insertions(+), 84 deletions(-) No fork() hooks? This way tasks inherit their parent's dirty count on clone(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-08 13:47 ` Peter Zijlstra @ 2011-08-08 14:21 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-08 14:21 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, Aug 08, 2011 at 09:47:14PM +0800, Peter Zijlstra wrote: > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > Add two fields to task_struct. > > > > 1) account dirtied pages in the individual tasks, for accuracy > > 2) per-task balance_dirty_pages() call intervals, for flexibility > > > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > > scale near-sqrt to the safety gap between dirty pages and threshold. > > > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > > dirtying pages at exactly the same time, each task will be assigned a > > large initial nr_dirtied_pause, so that the dirty threshold will be > > exceeded long before each task reached its nr_dirtied_pause and hence > > call balance_dirty_pages(). > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > > --- > > include/linux/sched.h | 7 ++ > > mm/memory_hotplug.c | 3 - > > mm/page-writeback.c | 106 +++++++++------------------------------- > > 3 files changed, 32 insertions(+), 84 deletions(-) > > No fork() hooks? This way tasks inherit their parent's dirty count on > clone(). Ah good point. Here is the quick fix. Thanks, Fengguang --- --- linux-next.orig/kernel/fork.c 2011-08-08 22:11:59.000000000 +0800 +++ linux-next/kernel/fork.c 2011-08-08 22:18:05.000000000 +0800 @@ -1301,6 +1301,9 @@ static struct task_struct *copy_process( p->pdeath_signal = 0; p->exit_state = 0; + p->nr_dirtied = 0; + p->nr_dirtied_pause = 8; + /* * Ok, make it visible to the rest of the system. * We dont wake it up yet. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-08 14:21 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-08 14:21 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, Aug 08, 2011 at 09:47:14PM +0800, Peter Zijlstra wrote: > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > Add two fields to task_struct. > > > > 1) account dirtied pages in the individual tasks, for accuracy > > 2) per-task balance_dirty_pages() call intervals, for flexibility > > > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > > scale near-sqrt to the safety gap between dirty pages and threshold. > > > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > > dirtying pages at exactly the same time, each task will be assigned a > > large initial nr_dirtied_pause, so that the dirty threshold will be > > exceeded long before each task reached its nr_dirtied_pause and hence > > call balance_dirty_pages(). > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > > --- > > include/linux/sched.h | 7 ++ > > mm/memory_hotplug.c | 3 - > > mm/page-writeback.c | 106 +++++++++------------------------------- > > 3 files changed, 32 insertions(+), 84 deletions(-) > > No fork() hooks? This way tasks inherit their parent's dirty count on > clone(). Ah good point. Here is the quick fix. Thanks, Fengguang --- --- linux-next.orig/kernel/fork.c 2011-08-08 22:11:59.000000000 +0800 +++ linux-next/kernel/fork.c 2011-08-08 22:18:05.000000000 +0800 @@ -1301,6 +1301,9 @@ static struct task_struct *copy_process( p->pdeath_signal = 0; p->exit_state = 0; + p->nr_dirtied = 0; + p->nr_dirtied_pause = 8; + /* * Ok, make it visible to the rest of the system. * We dont wake it up yet. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-08 14:21 ` Wu Fengguang @ 2011-08-08 23:32 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-08 23:32 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML > --- linux-next.orig/kernel/fork.c 2011-08-08 22:11:59.000000000 +0800 > +++ linux-next/kernel/fork.c 2011-08-08 22:18:05.000000000 +0800 > @@ -1301,6 +1301,9 @@ static struct task_struct *copy_process( > p->pdeath_signal = 0; > p->exit_state = 0; > > + p->nr_dirtied = 0; > + p->nr_dirtied_pause = 8; Hmm, it looks better to allow a new task to dirty 128KB without being throttled, if the system is not in dirty exceeded state. So changed the last line to this: + p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10); Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-08 23:32 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-08 23:32 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML > --- linux-next.orig/kernel/fork.c 2011-08-08 22:11:59.000000000 +0800 > +++ linux-next/kernel/fork.c 2011-08-08 22:18:05.000000000 +0800 > @@ -1301,6 +1301,9 @@ static struct task_struct *copy_process( > p->pdeath_signal = 0; > p->exit_state = 0; > > + p->nr_dirtied = 0; > + p->nr_dirtied_pause = 8; Hmm, it looks better to allow a new task to dirty 128KB without being throttled, if the system is not in dirty exceeded state. So changed the last line to this: + p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10); Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-08 13:47 ` Peter Zijlstra @ 2011-08-08 14:23 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-08 14:23 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, Aug 08, 2011 at 09:47:14PM +0800, Peter Zijlstra wrote: > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > Add two fields to task_struct. > > > > 1) account dirtied pages in the individual tasks, for accuracy > > 2) per-task balance_dirty_pages() call intervals, for flexibility > > > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > > scale near-sqrt to the safety gap between dirty pages and threshold. > > > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > > dirtying pages at exactly the same time, each task will be assigned a > > large initial nr_dirtied_pause, so that the dirty threshold will be > > exceeded long before each task reached its nr_dirtied_pause and hence > > call balance_dirty_pages(). > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > > --- > > include/linux/sched.h | 7 ++ > > mm/memory_hotplug.c | 3 - > > mm/page-writeback.c | 106 +++++++++------------------------------- > > 3 files changed, 32 insertions(+), 84 deletions(-) > > No fork() hooks? This way tasks inherit their parent's dirty count on > clone(). btw, I do have another patch queued for improving the "leaked dirties on exit" case :) Thanks, Fengguang --- Subject: writeback: charge leaked page dirties to active tasks Date: Tue Apr 05 13:21:19 CST 2011 It's a years long problem that a large number of short-lived dirtiers (eg. gcc instances in a fast kernel build) may starve long-run dirtiers (eg. dd) as well as pushing the dirty pages to the global hard limit. The solution is to charge the pages dirtied by the exited gcc to the other random gcc/dd instances. It sounds not perfect, however should behave good enough in practice. CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/linux/writeback.h | 2 ++ kernel/exit.c | 2 ++ mm/page-writeback.c | 11 +++++++++++ 3 files changed, 15 insertions(+) --- linux-next.orig/include/linux/writeback.h 2011-08-08 21:45:58.000000000 +0800 +++ linux-next/include/linux/writeback.h 2011-08-08 21:45:58.000000000 +0800 @@ -7,6 +7,8 @@ #include <linux/sched.h> #include <linux/fs.h> +DECLARE_PER_CPU(int, dirty_leaks); + /* * The 1/4 region under the global dirty thresh is for smooth dirty throttling: * --- linux-next.orig/mm/page-writeback.c 2011-08-08 21:45:58.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-08 22:21:50.000000000 +0800 @@ -190,6 +190,7 @@ int dirty_ratio_handler(struct ctl_table return ret; } +DEFINE_PER_CPU(int, dirty_leaks) = 0; int dirty_bytes_handler(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, @@ -1150,6 +1151,7 @@ void balance_dirty_pages_ratelimited_nr( { struct backing_dev_info *bdi = mapping->backing_dev_info; int ratelimit; + int *p; if (!bdi_cap_account_dirty(bdi)) return; @@ -1158,6 +1160,15 @@ void balance_dirty_pages_ratelimited_nr( if (bdi->dirty_exceeded) ratelimit = 8; + preempt_disable(); + p = &__get_cpu_var(dirty_leaks); + if (*p > 0 && current->nr_dirtied < ratelimit) { + nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied); + *p -= nr_pages_dirtied; + current->nr_dirtied += nr_pages_dirtied; + } + preempt_enable(); + if (unlikely(current->nr_dirtied >= ratelimit)) balance_dirty_pages(mapping, current->nr_dirtied); } --- linux-next.orig/kernel/exit.c 2011-08-08 21:43:37.000000000 +0800 +++ linux-next/kernel/exit.c 2011-08-08 21:45:58.000000000 +0800 @@ -1039,6 +1039,8 @@ NORET_TYPE void do_exit(long code) validate_creds_for_do_exit(tsk); preempt_disable(); + if (tsk->nr_dirtied) + __this_cpu_add(dirty_leaks, tsk->nr_dirtied); exit_rcu(); /* causes final put_task_struct in finish_task_switch(). */ tsk->state = TASK_DEAD; ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-08 14:23 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-08 14:23 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, Aug 08, 2011 at 09:47:14PM +0800, Peter Zijlstra wrote: > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > Add two fields to task_struct. > > > > 1) account dirtied pages in the individual tasks, for accuracy > > 2) per-task balance_dirty_pages() call intervals, for flexibility > > > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > > scale near-sqrt to the safety gap between dirty pages and threshold. > > > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > > dirtying pages at exactly the same time, each task will be assigned a > > large initial nr_dirtied_pause, so that the dirty threshold will be > > exceeded long before each task reached its nr_dirtied_pause and hence > > call balance_dirty_pages(). > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > > --- > > include/linux/sched.h | 7 ++ > > mm/memory_hotplug.c | 3 - > > mm/page-writeback.c | 106 +++++++++------------------------------- > > 3 files changed, 32 insertions(+), 84 deletions(-) > > No fork() hooks? This way tasks inherit their parent's dirty count on > clone(). btw, I do have another patch queued for improving the "leaked dirties on exit" case :) Thanks, Fengguang --- Subject: writeback: charge leaked page dirties to active tasks Date: Tue Apr 05 13:21:19 CST 2011 It's a years long problem that a large number of short-lived dirtiers (eg. gcc instances in a fast kernel build) may starve long-run dirtiers (eg. dd) as well as pushing the dirty pages to the global hard limit. The solution is to charge the pages dirtied by the exited gcc to the other random gcc/dd instances. It sounds not perfect, however should behave good enough in practice. CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/linux/writeback.h | 2 ++ kernel/exit.c | 2 ++ mm/page-writeback.c | 11 +++++++++++ 3 files changed, 15 insertions(+) --- linux-next.orig/include/linux/writeback.h 2011-08-08 21:45:58.000000000 +0800 +++ linux-next/include/linux/writeback.h 2011-08-08 21:45:58.000000000 +0800 @@ -7,6 +7,8 @@ #include <linux/sched.h> #include <linux/fs.h> +DECLARE_PER_CPU(int, dirty_leaks); + /* * The 1/4 region under the global dirty thresh is for smooth dirty throttling: * --- linux-next.orig/mm/page-writeback.c 2011-08-08 21:45:58.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-08 22:21:50.000000000 +0800 @@ -190,6 +190,7 @@ int dirty_ratio_handler(struct ctl_table return ret; } +DEFINE_PER_CPU(int, dirty_leaks) = 0; int dirty_bytes_handler(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, @@ -1150,6 +1151,7 @@ void balance_dirty_pages_ratelimited_nr( { struct backing_dev_info *bdi = mapping->backing_dev_info; int ratelimit; + int *p; if (!bdi_cap_account_dirty(bdi)) return; @@ -1158,6 +1160,15 @@ void balance_dirty_pages_ratelimited_nr( if (bdi->dirty_exceeded) ratelimit = 8; + preempt_disable(); + p = &__get_cpu_var(dirty_leaks); + if (*p > 0 && current->nr_dirtied < ratelimit) { + nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied); + *p -= nr_pages_dirtied; + current->nr_dirtied += nr_pages_dirtied; + } + preempt_enable(); + if (unlikely(current->nr_dirtied >= ratelimit)) balance_dirty_pages(mapping, current->nr_dirtied); } --- linux-next.orig/kernel/exit.c 2011-08-08 21:43:37.000000000 +0800 +++ linux-next/kernel/exit.c 2011-08-08 21:45:58.000000000 +0800 @@ -1039,6 +1039,8 @@ NORET_TYPE void do_exit(long code) validate_creds_for_do_exit(tsk); preempt_disable(); + if (tsk->nr_dirtied) + __this_cpu_add(dirty_leaks, tsk->nr_dirtied); exit_rcu(); /* causes final put_task_struct in finish_task_switch(). */ tsk->state = TASK_DEAD; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-08 14:23 ` Wu Fengguang (?) @ 2011-08-08 14:26 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-08 14:26 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, 2011-08-08 at 22:23 +0800, Wu Fengguang wrote: > + preempt_disable(); > + p = &__get_cpu_var(dirty_leaks); p = &get_cpu_var(dirty_leaks); > + if (*p > 0 && current->nr_dirtied < ratelimit) { > + nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied); > + *p -= nr_pages_dirtied; > + current->nr_dirtied += nr_pages_dirtied; > + } > + preempt_enable(); put_cpu_var(dirty_leads); ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-08 14:26 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-08 14:26 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, 2011-08-08 at 22:23 +0800, Wu Fengguang wrote: > + preempt_disable(); > + p = &__get_cpu_var(dirty_leaks); p = &get_cpu_var(dirty_leaks); > + if (*p > 0 && current->nr_dirtied < ratelimit) { > + nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied); > + *p -= nr_pages_dirtied; > + current->nr_dirtied += nr_pages_dirtied; > + } > + preempt_enable(); put_cpu_var(dirty_leads); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-08 14:26 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-08 14:26 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, 2011-08-08 at 22:23 +0800, Wu Fengguang wrote: > + preempt_disable(); > + p = &__get_cpu_var(dirty_leaks); p = &get_cpu_var(dirty_leaks); > + if (*p > 0 && current->nr_dirtied < ratelimit) { > + nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied); > + *p -= nr_pages_dirtied; > + current->nr_dirtied += nr_pages_dirtied; > + } > + preempt_enable(); put_cpu_var(dirty_leads); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-08 14:26 ` Peter Zijlstra @ 2011-08-08 22:38 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-08 22:38 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, Aug 08, 2011 at 10:26:52PM +0800, Peter Zijlstra wrote: > On Mon, 2011-08-08 at 22:23 +0800, Wu Fengguang wrote: > > + preempt_disable(); > > + p = &__get_cpu_var(dirty_leaks); > > p = &get_cpu_var(dirty_leaks); > > > + if (*p > 0 && current->nr_dirtied < ratelimit) { > > + nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied); > > + *p -= nr_pages_dirtied; > > + current->nr_dirtied += nr_pages_dirtied; > > + } > > + preempt_enable(); > > put_cpu_var(dirty_leads); Good to know these, thanks! ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-08 22:38 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-08 22:38 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Mon, Aug 08, 2011 at 10:26:52PM +0800, Peter Zijlstra wrote: > On Mon, 2011-08-08 at 22:23 +0800, Wu Fengguang wrote: > > + preempt_disable(); > > + p = &__get_cpu_var(dirty_leaks); > > p = &get_cpu_var(dirty_leaks); > > > + if (*p > 0 && current->nr_dirtied < ratelimit) { > > + nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied); > > + *p -= nr_pages_dirtied; > > + current->nr_dirtied += nr_pages_dirtied; > > + } > > + preempt_enable(); > > put_cpu_var(dirty_leads); Good to know these, thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-08 14:23 ` Wu Fengguang @ 2011-08-13 16:28 ` Andrea Righi -1 siblings, 0 replies; 301+ messages in thread From: Andrea Righi @ 2011-08-13 16:28 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML On Mon, Aug 08, 2011 at 10:23:18PM +0800, Wu Fengguang wrote: > On Mon, Aug 08, 2011 at 09:47:14PM +0800, Peter Zijlstra wrote: > > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > > Add two fields to task_struct. > > > > > > 1) account dirtied pages in the individual tasks, for accuracy > > > 2) per-task balance_dirty_pages() call intervals, for flexibility > > > > > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > > > scale near-sqrt to the safety gap between dirty pages and threshold. > > > > > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > > > dirtying pages at exactly the same time, each task will be assigned a > > > large initial nr_dirtied_pause, so that the dirty threshold will be > > > exceeded long before each task reached its nr_dirtied_pause and hence > > > call balance_dirty_pages(). > > > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > > > --- > > > include/linux/sched.h | 7 ++ > > > mm/memory_hotplug.c | 3 - > > > mm/page-writeback.c | 106 +++++++++------------------------------- > > > 3 files changed, 32 insertions(+), 84 deletions(-) > > > > No fork() hooks? This way tasks inherit their parent's dirty count on > > clone(). > > btw, I do have another patch queued for improving the "leaked dirties > on exit" case :) > > Thanks, > Fengguang > --- > Subject: writeback: charge leaked page dirties to active tasks > Date: Tue Apr 05 13:21:19 CST 2011 > > It's a years long problem that a large number of short-lived dirtiers > (eg. gcc instances in a fast kernel build) may starve long-run dirtiers > (eg. dd) as well as pushing the dirty pages to the global hard limit. > > The solution is to charge the pages dirtied by the exited gcc to the > other random gcc/dd instances. It sounds not perfect, however should > behave good enough in practice. > > CC: Peter Zijlstra <a.p.zijlstra@chello.nl> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > include/linux/writeback.h | 2 ++ > kernel/exit.c | 2 ++ > mm/page-writeback.c | 11 +++++++++++ > 3 files changed, 15 insertions(+) > > --- linux-next.orig/include/linux/writeback.h 2011-08-08 21:45:58.000000000 +0800 > +++ linux-next/include/linux/writeback.h 2011-08-08 21:45:58.000000000 +0800 > @@ -7,6 +7,8 @@ > #include <linux/sched.h> > #include <linux/fs.h> > > +DECLARE_PER_CPU(int, dirty_leaks); > + > /* > * The 1/4 region under the global dirty thresh is for smooth dirty throttling: > * > --- linux-next.orig/mm/page-writeback.c 2011-08-08 21:45:58.000000000 +0800 > +++ linux-next/mm/page-writeback.c 2011-08-08 22:21:50.000000000 +0800 > @@ -190,6 +190,7 @@ int dirty_ratio_handler(struct ctl_table > return ret; > } > > +DEFINE_PER_CPU(int, dirty_leaks) = 0; > > int dirty_bytes_handler(struct ctl_table *table, int write, > void __user *buffer, size_t *lenp, > @@ -1150,6 +1151,7 @@ void balance_dirty_pages_ratelimited_nr( > { > struct backing_dev_info *bdi = mapping->backing_dev_info; > int ratelimit; > + int *p; > > if (!bdi_cap_account_dirty(bdi)) > return; > @@ -1158,6 +1160,15 @@ void balance_dirty_pages_ratelimited_nr( > if (bdi->dirty_exceeded) > ratelimit = 8; > > + preempt_disable(); > + p = &__get_cpu_var(dirty_leaks); > + if (*p > 0 && current->nr_dirtied < ratelimit) { > + nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied); > + *p -= nr_pages_dirtied; > + current->nr_dirtied += nr_pages_dirtied; > + } > + preempt_enable(); > + I think we are still leaking some dirty pages, when the condition is false nr_pages_dirtied is just ignored. Why not doing something like this? current->nr_dirtied += nr_pages_dirtied; if (current->nr_dirtied < ratelimit) { p = &get_cpu_var(dirty_leaks); if (*p > 0) { nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied); *p -= nr_pages_dirtied; } else nr_pages_dirtied = 0; put_cpu_var(dirty_leaks); current->nr_dirtied += nr_pages_dirtied; } Thanks, -Andrea > if (unlikely(current->nr_dirtied >= ratelimit)) > balance_dirty_pages(mapping, current->nr_dirtied); > } > --- linux-next.orig/kernel/exit.c 2011-08-08 21:43:37.000000000 +0800 > +++ linux-next/kernel/exit.c 2011-08-08 21:45:58.000000000 +0800 > @@ -1039,6 +1039,8 @@ NORET_TYPE void do_exit(long code) > validate_creds_for_do_exit(tsk); > > preempt_disable(); > + if (tsk->nr_dirtied) > + __this_cpu_add(dirty_leaks, tsk->nr_dirtied); > exit_rcu(); > /* causes final put_task_struct in finish_task_switch(). */ > tsk->state = TASK_DEAD; ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-13 16:28 ` Andrea Righi 0 siblings, 0 replies; 301+ messages in thread From: Andrea Righi @ 2011-08-13 16:28 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML On Mon, Aug 08, 2011 at 10:23:18PM +0800, Wu Fengguang wrote: > On Mon, Aug 08, 2011 at 09:47:14PM +0800, Peter Zijlstra wrote: > > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > > Add two fields to task_struct. > > > > > > 1) account dirtied pages in the individual tasks, for accuracy > > > 2) per-task balance_dirty_pages() call intervals, for flexibility > > > > > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > > > scale near-sqrt to the safety gap between dirty pages and threshold. > > > > > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > > > dirtying pages at exactly the same time, each task will be assigned a > > > large initial nr_dirtied_pause, so that the dirty threshold will be > > > exceeded long before each task reached its nr_dirtied_pause and hence > > > call balance_dirty_pages(). > > > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > > > --- > > > include/linux/sched.h | 7 ++ > > > mm/memory_hotplug.c | 3 - > > > mm/page-writeback.c | 106 +++++++++------------------------------- > > > 3 files changed, 32 insertions(+), 84 deletions(-) > > > > No fork() hooks? This way tasks inherit their parent's dirty count on > > clone(). > > btw, I do have another patch queued for improving the "leaked dirties > on exit" case :) > > Thanks, > Fengguang > --- > Subject: writeback: charge leaked page dirties to active tasks > Date: Tue Apr 05 13:21:19 CST 2011 > > It's a years long problem that a large number of short-lived dirtiers > (eg. gcc instances in a fast kernel build) may starve long-run dirtiers > (eg. dd) as well as pushing the dirty pages to the global hard limit. > > The solution is to charge the pages dirtied by the exited gcc to the > other random gcc/dd instances. It sounds not perfect, however should > behave good enough in practice. > > CC: Peter Zijlstra <a.p.zijlstra@chello.nl> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > include/linux/writeback.h | 2 ++ > kernel/exit.c | 2 ++ > mm/page-writeback.c | 11 +++++++++++ > 3 files changed, 15 insertions(+) > > --- linux-next.orig/include/linux/writeback.h 2011-08-08 21:45:58.000000000 +0800 > +++ linux-next/include/linux/writeback.h 2011-08-08 21:45:58.000000000 +0800 > @@ -7,6 +7,8 @@ > #include <linux/sched.h> > #include <linux/fs.h> > > +DECLARE_PER_CPU(int, dirty_leaks); > + > /* > * The 1/4 region under the global dirty thresh is for smooth dirty throttling: > * > --- linux-next.orig/mm/page-writeback.c 2011-08-08 21:45:58.000000000 +0800 > +++ linux-next/mm/page-writeback.c 2011-08-08 22:21:50.000000000 +0800 > @@ -190,6 +190,7 @@ int dirty_ratio_handler(struct ctl_table > return ret; > } > > +DEFINE_PER_CPU(int, dirty_leaks) = 0; > > int dirty_bytes_handler(struct ctl_table *table, int write, > void __user *buffer, size_t *lenp, > @@ -1150,6 +1151,7 @@ void balance_dirty_pages_ratelimited_nr( > { > struct backing_dev_info *bdi = mapping->backing_dev_info; > int ratelimit; > + int *p; > > if (!bdi_cap_account_dirty(bdi)) > return; > @@ -1158,6 +1160,15 @@ void balance_dirty_pages_ratelimited_nr( > if (bdi->dirty_exceeded) > ratelimit = 8; > > + preempt_disable(); > + p = &__get_cpu_var(dirty_leaks); > + if (*p > 0 && current->nr_dirtied < ratelimit) { > + nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied); > + *p -= nr_pages_dirtied; > + current->nr_dirtied += nr_pages_dirtied; > + } > + preempt_enable(); > + I think we are still leaking some dirty pages, when the condition is false nr_pages_dirtied is just ignored. Why not doing something like this? current->nr_dirtied += nr_pages_dirtied; if (current->nr_dirtied < ratelimit) { p = &get_cpu_var(dirty_leaks); if (*p > 0) { nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied); *p -= nr_pages_dirtied; } else nr_pages_dirtied = 0; put_cpu_var(dirty_leaks); current->nr_dirtied += nr_pages_dirtied; } Thanks, -Andrea > if (unlikely(current->nr_dirtied >= ratelimit)) > balance_dirty_pages(mapping, current->nr_dirtied); > } > --- linux-next.orig/kernel/exit.c 2011-08-08 21:43:37.000000000 +0800 > +++ linux-next/kernel/exit.c 2011-08-08 21:45:58.000000000 +0800 > @@ -1039,6 +1039,8 @@ NORET_TYPE void do_exit(long code) > validate_creds_for_do_exit(tsk); > > preempt_disable(); > + if (tsk->nr_dirtied) > + __this_cpu_add(dirty_leaks, tsk->nr_dirtied); > exit_rcu(); > /* causes final put_task_struct in finish_task_switch(). */ > tsk->state = TASK_DEAD; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-13 16:28 ` Andrea Righi (?) @ 2011-08-15 14:21 ` Wu Fengguang 2011-08-15 14:26 ` Andrea Righi -1 siblings, 1 reply; 301+ messages in thread From: Wu Fengguang @ 2011-08-15 14:21 UTC (permalink / raw) To: Andrea Righi Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML [-- Attachment #1: Type: text/plain, Size: 1759 bytes --] Andrea, > > @@ -1158,6 +1160,15 @@ void balance_dirty_pages_ratelimited_nr( > > if (bdi->dirty_exceeded) > > ratelimit = 8; > > > > + preempt_disable(); > > + p = &__get_cpu_var(dirty_leaks); > > + if (*p > 0 && current->nr_dirtied < ratelimit) { > > + nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied); > > + *p -= nr_pages_dirtied; > > + current->nr_dirtied += nr_pages_dirtied; > > + } > > + preempt_enable(); > > + > > I think we are still leaking some dirty pages, when the condition is > false nr_pages_dirtied is just ignored. > > Why not doing something like this? > > current->nr_dirtied += nr_pages_dirtied; You must mean the above line. Sorry I failed to provide another patch before this one (attached this time). With that preparation patch, it effectively become equal to the logic below :) > if (current->nr_dirtied < ratelimit) { > p = &get_cpu_var(dirty_leaks); > if (*p > 0) { > nr_pages_dirtied = min(*p, ratelimit - > current->nr_dirtied); > *p -= nr_pages_dirtied; > } else > nr_pages_dirtied = 0; > put_cpu_var(dirty_leaks); > > current->nr_dirtied += nr_pages_dirtied; > } Thanks, Fengguang > > if (unlikely(current->nr_dirtied >= ratelimit)) > > balance_dirty_pages(mapping, current->nr_dirtied); > > } > > --- linux-next.orig/kernel/exit.c 2011-08-08 21:43:37.000000000 +0800 > > +++ linux-next/kernel/exit.c 2011-08-08 21:45:58.000000000 +0800 > > @@ -1039,6 +1039,8 @@ NORET_TYPE void do_exit(long code) > > validate_creds_for_do_exit(tsk); > > > > preempt_disable(); > > + if (tsk->nr_dirtied) > > + __this_cpu_add(dirty_leaks, tsk->nr_dirtied); > > exit_rcu(); > > /* causes final put_task_struct in finish_task_switch(). */ > > tsk->state = TASK_DEAD; [-- Attachment #2: writeback-accurate-task-dirtied.patch --] [-- Type: text/x-diff, Size: 1226 bytes --] Subject: writeback: fix dirtied pages accounting on sub-page writes Date: Thu Apr 14 07:52:37 CST 2011 When dd in 512bytes, generic_perform_write() calls balance_dirty_pages_ratelimited() 8 times for the same page, but obviously the page is only dirtied once. Fix it by accounting nr_dirtied at page dirty time. This will allow further simplification of the balance_dirty_pages_ratelimited_nr() calls. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- mm/page-writeback.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) --- linux-next.orig/mm/page-writeback.c 2011-08-15 22:12:14.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-15 22:12:27.000000000 +0800 @@ -1211,8 +1211,6 @@ void balance_dirty_pages_ratelimited_nr( else ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10)); - current->nr_dirtied += nr_pages_dirtied; - preempt_disable(); /* * This prevents one CPU to accumulate too many dirtied pages without @@ -1711,6 +1709,7 @@ void account_page_dirtied(struct page *p __inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED); task_dirty_inc(current); task_io_account_write(PAGE_CACHE_SIZE); + current->nr_dirtied++; } } EXPORT_SYMBOL(account_page_dirtied); ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-15 14:21 ` Wu Fengguang @ 2011-08-15 14:26 ` Andrea Righi 0 siblings, 0 replies; 301+ messages in thread From: Andrea Righi @ 2011-08-15 14:26 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML On Mon, Aug 15, 2011 at 10:21:41PM +0800, Wu Fengguang wrote: > Andrea, > > > > @@ -1158,6 +1160,15 @@ void balance_dirty_pages_ratelimited_nr( > > > if (bdi->dirty_exceeded) > > > ratelimit = 8; > > > > > > + preempt_disable(); > > > + p = &__get_cpu_var(dirty_leaks); > > > + if (*p > 0 && current->nr_dirtied < ratelimit) { > > > + nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied); > > > + *p -= nr_pages_dirtied; > > > + current->nr_dirtied += nr_pages_dirtied; > > > + } > > > + preempt_enable(); > > > + > > > > I think we are still leaking some dirty pages, when the condition is > > false nr_pages_dirtied is just ignored. > > > > Why not doing something like this? > > > > current->nr_dirtied += nr_pages_dirtied; > > You must mean the above line. Sorry I failed to provide another patch > before this one (attached this time). With that preparation patch, it > effectively become equal to the logic below :) OK. This is even better than my proposal, because it doesn't charge pages that are dirtied multiple times. Sounds good. Thanks, -Andrea > > > if (current->nr_dirtied < ratelimit) { > > p = &get_cpu_var(dirty_leaks); > > if (*p > 0) { > > nr_pages_dirtied = min(*p, ratelimit - > > current->nr_dirtied); > > *p -= nr_pages_dirtied; > > } else > > nr_pages_dirtied = 0; > > put_cpu_var(dirty_leaks); > > > > current->nr_dirtied += nr_pages_dirtied; > > } > > Thanks, > Fengguang > > > > if (unlikely(current->nr_dirtied >= ratelimit)) > > > balance_dirty_pages(mapping, current->nr_dirtied); > > > } > > > --- linux-next.orig/kernel/exit.c 2011-08-08 21:43:37.000000000 +0800 > > > +++ linux-next/kernel/exit.c 2011-08-08 21:45:58.000000000 +0800 > > > @@ -1039,6 +1039,8 @@ NORET_TYPE void do_exit(long code) > > > validate_creds_for_do_exit(tsk); > > > > > > preempt_disable(); > > > + if (tsk->nr_dirtied) > > > + __this_cpu_add(dirty_leaks, tsk->nr_dirtied); > > > exit_rcu(); > > > /* causes final put_task_struct in finish_task_switch(). */ > > > tsk->state = TASK_DEAD; > Subject: writeback: fix dirtied pages accounting on sub-page writes > Date: Thu Apr 14 07:52:37 CST 2011 > > When dd in 512bytes, generic_perform_write() calls > balance_dirty_pages_ratelimited() 8 times for the same page, but > obviously the page is only dirtied once. > > Fix it by accounting nr_dirtied at page dirty time. > > This will allow further simplification of the > balance_dirty_pages_ratelimited_nr() calls. > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > mm/page-writeback.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > --- linux-next.orig/mm/page-writeback.c 2011-08-15 22:12:14.000000000 +0800 > +++ linux-next/mm/page-writeback.c 2011-08-15 22:12:27.000000000 +0800 > @@ -1211,8 +1211,6 @@ void balance_dirty_pages_ratelimited_nr( > else > ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10)); > > - current->nr_dirtied += nr_pages_dirtied; > - > preempt_disable(); > /* > * This prevents one CPU to accumulate too many dirtied pages without > @@ -1711,6 +1709,7 @@ void account_page_dirtied(struct page *p > __inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED); > task_dirty_inc(current); > task_io_account_write(PAGE_CACHE_SIZE); > + current->nr_dirtied++; > } > } > EXPORT_SYMBOL(account_page_dirtied); ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-15 14:26 ` Andrea Righi 0 siblings, 0 replies; 301+ messages in thread From: Andrea Righi @ 2011-08-15 14:26 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML On Mon, Aug 15, 2011 at 10:21:41PM +0800, Wu Fengguang wrote: > Andrea, > > > > @@ -1158,6 +1160,15 @@ void balance_dirty_pages_ratelimited_nr( > > > if (bdi->dirty_exceeded) > > > ratelimit = 8; > > > > > > + preempt_disable(); > > > + p = &__get_cpu_var(dirty_leaks); > > > + if (*p > 0 && current->nr_dirtied < ratelimit) { > > > + nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied); > > > + *p -= nr_pages_dirtied; > > > + current->nr_dirtied += nr_pages_dirtied; > > > + } > > > + preempt_enable(); > > > + > > > > I think we are still leaking some dirty pages, when the condition is > > false nr_pages_dirtied is just ignored. > > > > Why not doing something like this? > > > > current->nr_dirtied += nr_pages_dirtied; > > You must mean the above line. Sorry I failed to provide another patch > before this one (attached this time). With that preparation patch, it > effectively become equal to the logic below :) OK. This is even better than my proposal, because it doesn't charge pages that are dirtied multiple times. Sounds good. Thanks, -Andrea > > > if (current->nr_dirtied < ratelimit) { > > p = &get_cpu_var(dirty_leaks); > > if (*p > 0) { > > nr_pages_dirtied = min(*p, ratelimit - > > current->nr_dirtied); > > *p -= nr_pages_dirtied; > > } else > > nr_pages_dirtied = 0; > > put_cpu_var(dirty_leaks); > > > > current->nr_dirtied += nr_pages_dirtied; > > } > > Thanks, > Fengguang > > > > if (unlikely(current->nr_dirtied >= ratelimit)) > > > balance_dirty_pages(mapping, current->nr_dirtied); > > > } > > > --- linux-next.orig/kernel/exit.c 2011-08-08 21:43:37.000000000 +0800 > > > +++ linux-next/kernel/exit.c 2011-08-08 21:45:58.000000000 +0800 > > > @@ -1039,6 +1039,8 @@ NORET_TYPE void do_exit(long code) > > > validate_creds_for_do_exit(tsk); > > > > > > preempt_disable(); > > > + if (tsk->nr_dirtied) > > > + __this_cpu_add(dirty_leaks, tsk->nr_dirtied); > > > exit_rcu(); > > > /* causes final put_task_struct in finish_task_switch(). */ > > > tsk->state = TASK_DEAD; > Subject: writeback: fix dirtied pages accounting on sub-page writes > Date: Thu Apr 14 07:52:37 CST 2011 > > When dd in 512bytes, generic_perform_write() calls > balance_dirty_pages_ratelimited() 8 times for the same page, but > obviously the page is only dirtied once. > > Fix it by accounting nr_dirtied at page dirty time. > > This will allow further simplification of the > balance_dirty_pages_ratelimited_nr() calls. > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > mm/page-writeback.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > --- linux-next.orig/mm/page-writeback.c 2011-08-15 22:12:14.000000000 +0800 > +++ linux-next/mm/page-writeback.c 2011-08-15 22:12:27.000000000 +0800 > @@ -1211,8 +1211,6 @@ void balance_dirty_pages_ratelimited_nr( > else > ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10)); > > - current->nr_dirtied += nr_pages_dirtied; > - > preempt_disable(); > /* > * This prevents one CPU to accumulate too many dirtied pages without > @@ -1711,6 +1709,7 @@ void account_page_dirtied(struct page *p > __inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED); > task_dirty_inc(current); > task_io_account_write(PAGE_CACHE_SIZE); > + current->nr_dirtied++; > } > } > EXPORT_SYMBOL(account_page_dirtied); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-06 8:44 ` Wu Fengguang @ 2011-08-09 17:46 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-09 17:46 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:51PM +0800, Wu Fengguang wrote: [..] > * balance_dirty_pages() must be called by processes which are generating dirty > * data. It looks at the number of dirty pages in the machine and will force > * the caller to perform writeback if the system is over `vm_dirty_ratio'. > @@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a > if (clear_dirty_exceeded && bdi->dirty_exceeded) > bdi->dirty_exceeded = 0; > > + current->nr_dirtied = 0; > + current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh); > + > if (writeback_in_progress(bdi)) > return; > > @@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page > } > } > > -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0; > - > /** > * balance_dirty_pages_ratelimited_nr - balance dirty memory state > * @mapping: address_space which was dirtied > @@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr( > { > struct backing_dev_info *bdi = mapping->backing_dev_info; > unsigned long ratelimit; > - unsigned long *p; > > if (!bdi_cap_account_dirty(bdi)) > return; > > - ratelimit = ratelimit_pages; > - if (mapping->backing_dev_info->dirty_exceeded) > + ratelimit = current->nr_dirtied_pause; > + if (bdi->dirty_exceeded) > ratelimit = 8; Should we make sure that ratelimit is more than 8? It could be that ratelimit is 1 and we set it higher (just reverse of what we wanted?) Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-09 17:46 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-09 17:46 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:51PM +0800, Wu Fengguang wrote: [..] > * balance_dirty_pages() must be called by processes which are generating dirty > * data. It looks at the number of dirty pages in the machine and will force > * the caller to perform writeback if the system is over `vm_dirty_ratio'. > @@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a > if (clear_dirty_exceeded && bdi->dirty_exceeded) > bdi->dirty_exceeded = 0; > > + current->nr_dirtied = 0; > + current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh); > + > if (writeback_in_progress(bdi)) > return; > > @@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page > } > } > > -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0; > - > /** > * balance_dirty_pages_ratelimited_nr - balance dirty memory state > * @mapping: address_space which was dirtied > @@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr( > { > struct backing_dev_info *bdi = mapping->backing_dev_info; > unsigned long ratelimit; > - unsigned long *p; > > if (!bdi_cap_account_dirty(bdi)) > return; > > - ratelimit = ratelimit_pages; > - if (mapping->backing_dev_info->dirty_exceeded) > + ratelimit = current->nr_dirtied_pause; > + if (bdi->dirty_exceeded) > ratelimit = 8; Should we make sure that ratelimit is more than 8? It could be that ratelimit is 1 and we set it higher (just reverse of what we wanted?) Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-09 17:46 ` Vivek Goyal @ 2011-08-10 3:29 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 3:29 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 01:46:21AM +0800, Vivek Goyal wrote: > On Sat, Aug 06, 2011 at 04:44:51PM +0800, Wu Fengguang wrote: > > [..] > > * balance_dirty_pages() must be called by processes which are generating dirty > > * data. It looks at the number of dirty pages in the machine and will force > > * the caller to perform writeback if the system is over `vm_dirty_ratio'. > > @@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a > > if (clear_dirty_exceeded && bdi->dirty_exceeded) > > bdi->dirty_exceeded = 0; > > > > + current->nr_dirtied = 0; > > + current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh); > > + > > if (writeback_in_progress(bdi)) > > return; > > > > @@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page > > } > > } > > > > -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0; > > - > > /** > > * balance_dirty_pages_ratelimited_nr - balance dirty memory state > > * @mapping: address_space which was dirtied > > @@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr( > > { > > struct backing_dev_info *bdi = mapping->backing_dev_info; > > unsigned long ratelimit; > > - unsigned long *p; > > > > if (!bdi_cap_account_dirty(bdi)) > > return; > > > > - ratelimit = ratelimit_pages; > > - if (mapping->backing_dev_info->dirty_exceeded) > > + ratelimit = current->nr_dirtied_pause; > > + if (bdi->dirty_exceeded) > > ratelimit = 8; > > Should we make sure that ratelimit is more than 8? It could be that > ratelimit is 1 and we set it higher (just reverse of what we wanted?) Good catch! I actually just fixed it in that direction :) if (bdi->dirty_exceeded) - ratelimit = 8; + ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10)); Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-10 3:29 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 3:29 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 01:46:21AM +0800, Vivek Goyal wrote: > On Sat, Aug 06, 2011 at 04:44:51PM +0800, Wu Fengguang wrote: > > [..] > > * balance_dirty_pages() must be called by processes which are generating dirty > > * data. It looks at the number of dirty pages in the machine and will force > > * the caller to perform writeback if the system is over `vm_dirty_ratio'. > > @@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a > > if (clear_dirty_exceeded && bdi->dirty_exceeded) > > bdi->dirty_exceeded = 0; > > > > + current->nr_dirtied = 0; > > + current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh); > > + > > if (writeback_in_progress(bdi)) > > return; > > > > @@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page > > } > > } > > > > -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0; > > - > > /** > > * balance_dirty_pages_ratelimited_nr - balance dirty memory state > > * @mapping: address_space which was dirtied > > @@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr( > > { > > struct backing_dev_info *bdi = mapping->backing_dev_info; > > unsigned long ratelimit; > > - unsigned long *p; > > > > if (!bdi_cap_account_dirty(bdi)) > > return; > > > > - ratelimit = ratelimit_pages; > > - if (mapping->backing_dev_info->dirty_exceeded) > > + ratelimit = current->nr_dirtied_pause; > > + if (bdi->dirty_exceeded) > > ratelimit = 8; > > Should we make sure that ratelimit is more than 8? It could be that > ratelimit is 1 and we set it higher (just reverse of what we wanted?) Good catch! I actually just fixed it in that direction :) if (bdi->dirty_exceeded) - ratelimit = 8; + ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10)); Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-10 3:29 ` Wu Fengguang @ 2011-08-10 18:18 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-10 18:18 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 11:29:54AM +0800, Wu Fengguang wrote: [..] > > > - ratelimit = ratelimit_pages; > > > - if (mapping->backing_dev_info->dirty_exceeded) > > > + ratelimit = current->nr_dirtied_pause; > > > + if (bdi->dirty_exceeded) > > > ratelimit = 8; > > > > Should we make sure that ratelimit is more than 8? It could be that > > ratelimit is 1 and we set it higher (just reverse of what we wanted?) > > Good catch! I actually just fixed it in that direction :) > > if (bdi->dirty_exceeded) > - ratelimit = 8; > + ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10)); With page size 64K, will above lead to retelimit 0? Is that what you want. I wouldn't think so. Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-10 18:18 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-10 18:18 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 11:29:54AM +0800, Wu Fengguang wrote: [..] > > > - ratelimit = ratelimit_pages; > > > - if (mapping->backing_dev_info->dirty_exceeded) > > > + ratelimit = current->nr_dirtied_pause; > > > + if (bdi->dirty_exceeded) > > > ratelimit = 8; > > > > Should we make sure that ratelimit is more than 8? It could be that > > ratelimit is 1 and we set it higher (just reverse of what we wanted?) > > Good catch! I actually just fixed it in that direction :) > > if (bdi->dirty_exceeded) > - ratelimit = 8; > + ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10)); With page size 64K, will above lead to retelimit 0? Is that what you want. I wouldn't think so. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-10 18:18 ` Vivek Goyal @ 2011-08-11 0:55 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-11 0:55 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Thu, Aug 11, 2011 at 02:18:54AM +0800, Vivek Goyal wrote: > On Wed, Aug 10, 2011 at 11:29:54AM +0800, Wu Fengguang wrote: > > [..] > > > > - ratelimit = ratelimit_pages; > > > > - if (mapping->backing_dev_info->dirty_exceeded) > > > > + ratelimit = current->nr_dirtied_pause; > > > > + if (bdi->dirty_exceeded) > > > > ratelimit = 8; > > > > > > Should we make sure that ratelimit is more than 8? It could be that > > > ratelimit is 1 and we set it higher (just reverse of what we wanted?) > > > > Good catch! I actually just fixed it in that direction :) > > > > if (bdi->dirty_exceeded) > > - ratelimit = 8; > > + ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10)); > > With page size 64K, will above lead to retelimit 0? Is that what you want. > I wouldn't think so. Yeah, it looks a bit weird.. however ratelimit=0 would behave the same with ratelimit=1 because balance_dirty_pages_ratelimited_nr() is always called with (nr_pages_dirtied >= 1). Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-11 0:55 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-11 0:55 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Thu, Aug 11, 2011 at 02:18:54AM +0800, Vivek Goyal wrote: > On Wed, Aug 10, 2011 at 11:29:54AM +0800, Wu Fengguang wrote: > > [..] > > > > - ratelimit = ratelimit_pages; > > > > - if (mapping->backing_dev_info->dirty_exceeded) > > > > + ratelimit = current->nr_dirtied_pause; > > > > + if (bdi->dirty_exceeded) > > > > ratelimit = 8; > > > > > > Should we make sure that ratelimit is more than 8? It could be that > > > ratelimit is 1 and we set it higher (just reverse of what we wanted?) > > > > Good catch! I actually just fixed it in that direction :) > > > > if (bdi->dirty_exceeded) > > - ratelimit = 8; > > + ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10)); > > With page size 64K, will above lead to retelimit 0? Is that what you want. > I wouldn't think so. Yeah, it looks a bit weird.. however ratelimit=0 would behave the same with ratelimit=1 because balance_dirty_pages_ratelimited_nr() is always called with (nr_pages_dirtied >= 1). Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-06 8:44 ` Wu Fengguang (?) @ 2011-08-09 18:35 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 18:35 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > Add two fields to task_struct. > > 1) account dirtied pages in the individual tasks, for accuracy > 2) per-task balance_dirty_pages() call intervals, for flexibility > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > scale near-sqrt to the safety gap between dirty pages and threshold. > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > dirtying pages at exactly the same time, each task will be assigned a > large initial nr_dirtied_pause, so that the dirty threshold will be > exceeded long before each task reached its nr_dirtied_pause and hence > call balance_dirty_pages(). Right, so why remove the per-cpu threshold? you can keep that as a bound on the number of out-standing dirty pages. Loosing that bound is actually a bad thing (TM), since you could have configured a tight dirty limit and lock up your machine this way. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-09 18:35 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 18:35 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > Add two fields to task_struct. > > 1) account dirtied pages in the individual tasks, for accuracy > 2) per-task balance_dirty_pages() call intervals, for flexibility > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > scale near-sqrt to the safety gap between dirty pages and threshold. > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > dirtying pages at exactly the same time, each task will be assigned a > large initial nr_dirtied_pause, so that the dirty threshold will be > exceeded long before each task reached its nr_dirtied_pause and hence > call balance_dirty_pages(). Right, so why remove the per-cpu threshold? you can keep that as a bound on the number of out-standing dirty pages. Loosing that bound is actually a bad thing (TM), since you could have configured a tight dirty limit and lock up your machine this way. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-09 18:35 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 18:35 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > Add two fields to task_struct. > > 1) account dirtied pages in the individual tasks, for accuracy > 2) per-task balance_dirty_pages() call intervals, for flexibility > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > scale near-sqrt to the safety gap between dirty pages and threshold. > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > dirtying pages at exactly the same time, each task will be assigned a > large initial nr_dirtied_pause, so that the dirty threshold will be > exceeded long before each task reached its nr_dirtied_pause and hence > call balance_dirty_pages(). Right, so why remove the per-cpu threshold? you can keep that as a bound on the number of out-standing dirty pages. Loosing that bound is actually a bad thing (TM), since you could have configured a tight dirty limit and lock up your machine this way. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-09 18:35 ` Peter Zijlstra @ 2011-08-10 3:40 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 3:40 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 02:35:06AM +0800, Peter Zijlstra wrote: > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > > > Add two fields to task_struct. > > > > 1) account dirtied pages in the individual tasks, for accuracy > > 2) per-task balance_dirty_pages() call intervals, for flexibility > > > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > > scale near-sqrt to the safety gap between dirty pages and threshold. > > > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > > dirtying pages at exactly the same time, each task will be assigned a > > large initial nr_dirtied_pause, so that the dirty threshold will be > > exceeded long before each task reached its nr_dirtied_pause and hence > > call balance_dirty_pages(). > > Right, so why remove the per-cpu threshold? you can keep that as a bound > on the number of out-standing dirty pages. Right, I also have the vague feeling that the per-cpu threshold can somehow backup the per-task threshold in case there are too many tasks. > Loosing that bound is actually a bad thing (TM), since you could have > configured a tight dirty limit and lock up your machine this way. It seems good enough to only remove the 4MB upper limit for ratelimit_pages, so that the per-cpu limit won't kick in too frequently in typical machines. * Here we set ratelimit_pages to a level which ensures that when all CPUs are * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory * thresholds before writeback cuts in. - * - * But the limit should not be set too high. Because it also controls the - * amount of memory which the balance_dirty_pages() caller has to write back. - * If this is too large then the caller will block on the IO queue all the - * time. So limit it to four megabytes - the balance_dirty_pages() caller - * will write six megabyte chunks, max. - */ - void writeback_set_ratelimit(void) { ratelimit_pages = vm_total_pages / (num_online_cpus() * 32); if (ratelimit_pages < 16) ratelimit_pages = 16; - if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024) - ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE; } Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-10 3:40 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 3:40 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 02:35:06AM +0800, Peter Zijlstra wrote: > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > > > Add two fields to task_struct. > > > > 1) account dirtied pages in the individual tasks, for accuracy > > 2) per-task balance_dirty_pages() call intervals, for flexibility > > > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > > scale near-sqrt to the safety gap between dirty pages and threshold. > > > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > > dirtying pages at exactly the same time, each task will be assigned a > > large initial nr_dirtied_pause, so that the dirty threshold will be > > exceeded long before each task reached its nr_dirtied_pause and hence > > call balance_dirty_pages(). > > Right, so why remove the per-cpu threshold? you can keep that as a bound > on the number of out-standing dirty pages. Right, I also have the vague feeling that the per-cpu threshold can somehow backup the per-task threshold in case there are too many tasks. > Loosing that bound is actually a bad thing (TM), since you could have > configured a tight dirty limit and lock up your machine this way. It seems good enough to only remove the 4MB upper limit for ratelimit_pages, so that the per-cpu limit won't kick in too frequently in typical machines. * Here we set ratelimit_pages to a level which ensures that when all CPUs are * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory * thresholds before writeback cuts in. - * - * But the limit should not be set too high. Because it also controls the - * amount of memory which the balance_dirty_pages() caller has to write back. - * If this is too large then the caller will block on the IO queue all the - * time. So limit it to four megabytes - the balance_dirty_pages() caller - * will write six megabyte chunks, max. - */ - void writeback_set_ratelimit(void) { ratelimit_pages = vm_total_pages / (num_online_cpus() * 32); if (ratelimit_pages < 16) ratelimit_pages = 16; - if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024) - ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE; } Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-10 3:40 ` Wu Fengguang (?) @ 2011-08-10 10:25 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-10 10:25 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Wed, 2011-08-10 at 11:40 +0800, Wu Fengguang wrote: > On Wed, Aug 10, 2011 at 02:35:06AM +0800, Peter Zijlstra wrote: > > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > > > > > Add two fields to task_struct. > > > > > > 1) account dirtied pages in the individual tasks, for accuracy > > > 2) per-task balance_dirty_pages() call intervals, for flexibility > > > > > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > > > scale near-sqrt to the safety gap between dirty pages and threshold. > > > > > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > > > dirtying pages at exactly the same time, each task will be assigned a > > > large initial nr_dirtied_pause, so that the dirty threshold will be > > > exceeded long before each task reached its nr_dirtied_pause and hence > > > call balance_dirty_pages(). > > > > Right, so why remove the per-cpu threshold? you can keep that as a bound > > on the number of out-standing dirty pages. > > Right, I also have the vague feeling that the per-cpu threshold can > somehow backup the per-task threshold in case there are too many tasks. > > > Loosing that bound is actually a bad thing (TM), since you could have > > configured a tight dirty limit and lock up your machine this way. > > It seems good enough to only remove the 4MB upper limit for > ratelimit_pages, so that the per-cpu limit won't kick in too > frequently in typical machines. > > * Here we set ratelimit_pages to a level which ensures that when all CPUs are > * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory > * thresholds before writeback cuts in. > - * > - * But the limit should not be set too high. Because it also controls the > - * amount of memory which the balance_dirty_pages() caller has to write back. > - * If this is too large then the caller will block on the IO queue all the > - * time. So limit it to four megabytes - the balance_dirty_pages() caller > - * will write six megabyte chunks, max. > - */ > - > void writeback_set_ratelimit(void) > { > ratelimit_pages = vm_total_pages / (num_online_cpus() * 32); > if (ratelimit_pages < 16) > ratelimit_pages = 16; > - if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024) > - ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE; > } Uhm, so what's your bound then? 1/32 of the per-cpu memory seems rather a lot. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-10 10:25 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-10 10:25 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Wed, 2011-08-10 at 11:40 +0800, Wu Fengguang wrote: > On Wed, Aug 10, 2011 at 02:35:06AM +0800, Peter Zijlstra wrote: > > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > > > > > Add two fields to task_struct. > > > > > > 1) account dirtied pages in the individual tasks, for accuracy > > > 2) per-task balance_dirty_pages() call intervals, for flexibility > > > > > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > > > scale near-sqrt to the safety gap between dirty pages and threshold. > > > > > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > > > dirtying pages at exactly the same time, each task will be assigned a > > > large initial nr_dirtied_pause, so that the dirty threshold will be > > > exceeded long before each task reached its nr_dirtied_pause and hence > > > call balance_dirty_pages(). > > > > Right, so why remove the per-cpu threshold? you can keep that as a bound > > on the number of out-standing dirty pages. > > Right, I also have the vague feeling that the per-cpu threshold can > somehow backup the per-task threshold in case there are too many tasks. > > > Loosing that bound is actually a bad thing (TM), since you could have > > configured a tight dirty limit and lock up your machine this way. > > It seems good enough to only remove the 4MB upper limit for > ratelimit_pages, so that the per-cpu limit won't kick in too > frequently in typical machines. > > * Here we set ratelimit_pages to a level which ensures that when all CPUs are > * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory > * thresholds before writeback cuts in. > - * > - * But the limit should not be set too high. Because it also controls the > - * amount of memory which the balance_dirty_pages() caller has to write back. > - * If this is too large then the caller will block on the IO queue all the > - * time. So limit it to four megabytes - the balance_dirty_pages() caller > - * will write six megabyte chunks, max. > - */ > - > void writeback_set_ratelimit(void) > { > ratelimit_pages = vm_total_pages / (num_online_cpus() * 32); > if (ratelimit_pages < 16) > ratelimit_pages = 16; > - if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024) > - ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE; > } Uhm, so what's your bound then? 1/32 of the per-cpu memory seems rather a lot. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-10 10:25 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-10 10:25 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Wed, 2011-08-10 at 11:40 +0800, Wu Fengguang wrote: > On Wed, Aug 10, 2011 at 02:35:06AM +0800, Peter Zijlstra wrote: > > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > > > > > Add two fields to task_struct. > > > > > > 1) account dirtied pages in the individual tasks, for accuracy > > > 2) per-task balance_dirty_pages() call intervals, for flexibility > > > > > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > > > scale near-sqrt to the safety gap between dirty pages and threshold. > > > > > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > > > dirtying pages at exactly the same time, each task will be assigned a > > > large initial nr_dirtied_pause, so that the dirty threshold will be > > > exceeded long before each task reached its nr_dirtied_pause and hence > > > call balance_dirty_pages(). > > > > Right, so why remove the per-cpu threshold? you can keep that as a bound > > on the number of out-standing dirty pages. > > Right, I also have the vague feeling that the per-cpu threshold can > somehow backup the per-task threshold in case there are too many tasks. > > > Loosing that bound is actually a bad thing (TM), since you could have > > configured a tight dirty limit and lock up your machine this way. > > It seems good enough to only remove the 4MB upper limit for > ratelimit_pages, so that the per-cpu limit won't kick in too > frequently in typical machines. > > * Here we set ratelimit_pages to a level which ensures that when all CPUs are > * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory > * thresholds before writeback cuts in. > - * > - * But the limit should not be set too high. Because it also controls the > - * amount of memory which the balance_dirty_pages() caller has to write back. > - * If this is too large then the caller will block on the IO queue all the > - * time. So limit it to four megabytes - the balance_dirty_pages() caller > - * will write six megabyte chunks, max. > - */ > - > void writeback_set_ratelimit(void) > { > ratelimit_pages = vm_total_pages / (num_online_cpus() * 32); > if (ratelimit_pages < 16) > ratelimit_pages = 16; > - if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024) > - ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE; > } Uhm, so what's your bound then? 1/32 of the per-cpu memory seems rather a lot. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit 2011-08-10 10:25 ` Peter Zijlstra @ 2011-08-10 11:13 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 11:13 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 06:25:48PM +0800, Peter Zijlstra wrote: > On Wed, 2011-08-10 at 11:40 +0800, Wu Fengguang wrote: > > On Wed, Aug 10, 2011 at 02:35:06AM +0800, Peter Zijlstra wrote: > > > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > > > > > > > Add two fields to task_struct. > > > > > > > > 1) account dirtied pages in the individual tasks, for accuracy > > > > 2) per-task balance_dirty_pages() call intervals, for flexibility > > > > > > > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > > > > scale near-sqrt to the safety gap between dirty pages and threshold. > > > > > > > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > > > > dirtying pages at exactly the same time, each task will be assigned a > > > > large initial nr_dirtied_pause, so that the dirty threshold will be > > > > exceeded long before each task reached its nr_dirtied_pause and hence > > > > call balance_dirty_pages(). > > > > > > Right, so why remove the per-cpu threshold? you can keep that as a bound > > > on the number of out-standing dirty pages. > > > > Right, I also have the vague feeling that the per-cpu threshold can > > somehow backup the per-task threshold in case there are too many tasks. > > > > > Loosing that bound is actually a bad thing (TM), since you could have > > > configured a tight dirty limit and lock up your machine this way. > > > > It seems good enough to only remove the 4MB upper limit for > > ratelimit_pages, so that the per-cpu limit won't kick in too > > frequently in typical machines. > > > > * Here we set ratelimit_pages to a level which ensures that when all CPUs are > > * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory > > * thresholds before writeback cuts in. > > - * > > - * But the limit should not be set too high. Because it also controls the > > - * amount of memory which the balance_dirty_pages() caller has to write back. > > - * If this is too large then the caller will block on the IO queue all the > > - * time. So limit it to four megabytes - the balance_dirty_pages() caller > > - * will write six megabyte chunks, max. > > - */ > > - > > void writeback_set_ratelimit(void) > > { > > ratelimit_pages = vm_total_pages / (num_online_cpus() * 32); > > if (ratelimit_pages < 16) > > ratelimit_pages = 16; > > - if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024) > > - ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE; > > } > > Uhm, so what's your bound then? 1/32 of the per-cpu memory seems rather > a lot. Ah yes, vm_total_pages is not longer suitable here, may use ratelimit_pages = dirty_threshold / (num_online_cpus() * 32); We just need to ensure the dirty_threshold won't be exceeded too much in the rare case tsk->nr_dirtied_pause cannot keep dirty pages under control when there are >10k dirtier tasks. Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 4/5] writeback: per task dirty rate limit @ 2011-08-10 11:13 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 11:13 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 06:25:48PM +0800, Peter Zijlstra wrote: > On Wed, 2011-08-10 at 11:40 +0800, Wu Fengguang wrote: > > On Wed, Aug 10, 2011 at 02:35:06AM +0800, Peter Zijlstra wrote: > > > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote: > > > > > > > > Add two fields to task_struct. > > > > > > > > 1) account dirtied pages in the individual tasks, for accuracy > > > > 2) per-task balance_dirty_pages() call intervals, for flexibility > > > > > > > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will > > > > scale near-sqrt to the safety gap between dirty pages and threshold. > > > > > > > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start > > > > dirtying pages at exactly the same time, each task will be assigned a > > > > large initial nr_dirtied_pause, so that the dirty threshold will be > > > > exceeded long before each task reached its nr_dirtied_pause and hence > > > > call balance_dirty_pages(). > > > > > > Right, so why remove the per-cpu threshold? you can keep that as a bound > > > on the number of out-standing dirty pages. > > > > Right, I also have the vague feeling that the per-cpu threshold can > > somehow backup the per-task threshold in case there are too many tasks. > > > > > Loosing that bound is actually a bad thing (TM), since you could have > > > configured a tight dirty limit and lock up your machine this way. > > > > It seems good enough to only remove the 4MB upper limit for > > ratelimit_pages, so that the per-cpu limit won't kick in too > > frequently in typical machines. > > > > * Here we set ratelimit_pages to a level which ensures that when all CPUs are > > * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory > > * thresholds before writeback cuts in. > > - * > > - * But the limit should not be set too high. Because it also controls the > > - * amount of memory which the balance_dirty_pages() caller has to write back. > > - * If this is too large then the caller will block on the IO queue all the > > - * time. So limit it to four megabytes - the balance_dirty_pages() caller > > - * will write six megabyte chunks, max. > > - */ > > - > > void writeback_set_ratelimit(void) > > { > > ratelimit_pages = vm_total_pages / (num_online_cpus() * 32); > > if (ratelimit_pages < 16) > > ratelimit_pages = 16; > > - if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024) > > - ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE; > > } > > Uhm, so what's your bound then? 1/32 of the per-cpu memory seems rather > a lot. Ah yes, vm_total_pages is not longer suitable here, may use ratelimit_pages = dirty_threshold / (num_online_cpus() * 32); We just need to ensure the dirty_threshold won't be exceeded too much in the rare case tsk->nr_dirtied_pause cannot keep dirty pages under control when there are >10k dirtier tasks. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-06 8:44 ` Wu Fengguang (?) @ 2011-08-06 8:44 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --] [-- Type: text/plain, Size: 14513 bytes --] As proposed by Chris, Dave and Jan, don't start foreground writeback IO inside balance_dirty_pages(). Instead, simply let it idle sleep for some time to throttle the dirtying task. In the mean while, kick off the per-bdi flusher thread to do background writeback IO. RATIONALS ========= - disk seeks on concurrent writeback of multiple inodes (Dave Chinner) If every thread doing writes and being throttled start foreground writeback, it leads to N IO submitters from at least N different inodes at the same time, end up with N different sets of IO being issued with potentially zero locality to each other, resulting in much lower elevator sort/merge efficiency and hence we seek the disk all over the place to service the different sets of IO. OTOH, if there is only one submission thread, it doesn't jump between inodes in the same way when congestion clears - it keeps writing to the same inode, resulting in large related chunks of sequential IOs being issued to the disk. This is more efficient than the above foreground writeback because the elevator works better and the disk seeks less. - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner) With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". * "CPU usage has dropped by ~55%", "it certainly appears that most of the CPU time saving comes from the removal of contention on the inode_wb_list_lock" (IMHO at least 10% comes from the reduction of cacheline bouncing, because the new code is able to call much less frequently into balance_dirty_pages() and hence access the global page states) * the user space "App overhead" is reduced by 20%, by avoiding the cacheline pollution by the complex writeback code path * "for a ~5% throughput reduction", "the number of write IOs have dropped by ~25%", and the elapsed time reduced from 41:42.17 to 40:53.23. * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and improves IO throughput from 38MB/s to 42MB/s. - IO size too small for fast arrays and too large for slow USB sticks The write_chunk used by current balance_dirty_pages() cannot be directly set to some large value (eg. 128MB) for better IO efficiency. Because it could lead to more than 1 second user perceivable stalls. Even the current 4MB write size may be too large for slow USB sticks. The fact that balance_dirty_pages() starts IO on itself couples the IO size to wait time, which makes it hard to do suitable IO size while keeping the wait time under control. Now it's possible to increase writeback chunk size proportional to the disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram, the larger writeback size dramatically reduces the seek count to 1/10 (far beyond my expectation) and improves the write throughput by 24%. - long block time in balance_dirty_pages() hurts desktop responsiveness Many of us may have the experience: it often takes a couple of seconds or even long time to stop a heavy writing dd/cp/tar command with Ctrl-C or "kill -9". - IO pipeline broken by bumpy write() progress There are a broad class of "loop {read(buf); write(buf);}" applications whose read() pipeline will be under-utilized or even come to a stop if the write()s have long latencies _or_ don't progress in a constant rate. The current threshold based throttling inherently transfers the large low level IO completion fluctuations to bumpy application write()s, and further deteriorates with increasing number of dirtiers and/or bdi's. For example, when doing 50 dd's + 1 remote rsync to an XFS partition, the rsync progresses very bumpy in legacy kernel, and throughput is improved by 67% by this patchset. (plus the larger write chunk size, it will be 93% speedup). The new rate based throttling can support 1000+ dd's with excellent smoothness, low latency and low overheads. For the above reasons, it's much better to do IO-less and low latency pauses in balance_dirty_pages(). Jan Kara, Dave Chinner and me explored the scheme to let balance_dirty_pages() wait for enough writeback IO completions to safeguard the dirty limit. However it's found to have two problems: - in large NUMA systems, the per-cpu counters may have big accounting errors, leading to big throttle wait time and jitters. - NFS may kill large amount of unstable pages with one single COMMIT. Because NFS server serves COMMIT with expensive fsync() IOs, it is desirable to delay and reduce the number of COMMITs. So it's not likely to optimize away such kind of bursty IO completions, and the resulted large (and tiny) stall times in IO completion based throttling. So here is a pause time oriented approach, which tries to control the pause time in each balance_dirty_pages() invocations, by controlling the number of pages dirtied before calling balance_dirty_pages(), for smooth and efficient dirty throttling: - avoid useless (eg. zero pause time) balance_dirty_pages() calls - avoid too small pause time (less than 4ms, which burns CPU power) - avoid too large pause time (more than 200ms, which hurts responsiveness) - avoid big fluctuations of pause times It can control pause times at will. The default policy will be to do ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case. BEHAVIOR CHANGE =============== (1) dirty threshold Users will notice that the applications will get throttled once crossing the global (background + dirty)/2=15% threshold, and then balanced around 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable memory in 1-dd case. Since the task will be soft throttled earlier than before, it may be perceived by end users as performance "slow down" if his application happens to dirty more than 15% dirtyable memory. (2) smoothness/responsiveness Users will notice a more responsive system during heavy writeback. "killall dd" will take effect instantly. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/trace/events/writeback.h | 24 ---- mm/page-writeback.c | 142 +++++++---------------------- 2 files changed, 37 insertions(+), 129 deletions(-) --- linux-next.orig/mm/page-writeback.c 2011-08-06 11:17:26.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-06 16:16:30.000000000 +0800 @@ -242,50 +242,6 @@ static void bdi_writeout_fraction(struct numerator, denominator); } -static inline void task_dirties_fraction(struct task_struct *tsk, - long *numerator, long *denominator) -{ - prop_fraction_single(&vm_dirties, &tsk->dirties, - numerator, denominator); -} - -/* - * task_dirty_limit - scale down dirty throttling threshold for one task - * - * task specific dirty limit: - * - * dirty -= (dirty/8) * p_{t} - * - * To protect light/slow dirtying tasks from heavier/fast ones, we start - * throttling individual tasks before reaching the bdi dirty limit. - * Relatively low thresholds will be allocated to heavy dirtiers. So when - * dirty pages grow large, heavy dirtiers will be throttled first, which will - * effectively curb the growth of dirty pages. Light dirtiers with high enough - * dirty threshold may never get throttled. - */ -#define TASK_LIMIT_FRACTION 8 -static unsigned long task_dirty_limit(struct task_struct *tsk, - unsigned long bdi_dirty) -{ - long numerator, denominator; - unsigned long dirty = bdi_dirty; - u64 inv = dirty / TASK_LIMIT_FRACTION; - - task_dirties_fraction(tsk, &numerator, &denominator); - inv *= numerator; - do_div(inv, denominator); - - dirty -= inv; - - return max(dirty, bdi_dirty/2); -} - -/* Minimum limit for any task */ -static unsigned long task_min_dirty_limit(unsigned long bdi_dirty) -{ - return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION; -} - /* * */ @@ -855,24 +811,28 @@ static unsigned long ratelimit_pages(uns * perform some writeout. */ static void balance_dirty_pages(struct address_space *mapping, - unsigned long write_chunk) + unsigned long pages_dirtied) { - unsigned long nr_reclaimable, bdi_nr_reclaimable; + unsigned long nr_reclaimable; unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */ unsigned long bdi_dirty; unsigned long background_thresh; unsigned long dirty_thresh; unsigned long bdi_thresh; - unsigned long task_bdi_thresh; - unsigned long min_task_bdi_thresh; - unsigned long pages_written = 0; - unsigned long pause = 1; + unsigned long pause = 0; bool dirty_exceeded = false; - bool clear_dirty_exceeded = true; + unsigned long bw; + unsigned long base_bw; struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long start_time = jiffies; for (;;) { + /* + * Unstable writes are a feature of certain networked + * filesystems (i.e. NFS) in which data may have been + * written to the server's write cache, but has not yet + * been flushed to permanent storage. + */ nr_reclaimable = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); @@ -888,8 +848,6 @@ static void balance_dirty_pages(struct a break; bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); - min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh); - task_bdi_thresh = task_dirty_limit(current, bdi_thresh); /* * In order to avoid the stacked BDI deadlock we need @@ -901,56 +859,38 @@ static void balance_dirty_pages(struct a * actually dirty; with m+n sitting in the percpu * deltas. */ - if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) { - bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE); - bdi_dirty = bdi_nr_reclaimable + + if (bdi_thresh < 2 * bdi_stat_error(bdi)) + bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) + bdi_stat_sum(bdi, BDI_WRITEBACK); - } else { - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); - bdi_dirty = bdi_nr_reclaimable + + else + bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) + bdi_stat(bdi, BDI_WRITEBACK); - } - /* - * The bdi thresh is somehow "soft" limit derived from the - * global "hard" limit. The former helps to prevent heavy IO - * bdi or process from holding back light ones; The latter is - * the last resort safeguard. - */ - dirty_exceeded = (bdi_dirty > task_bdi_thresh) || + dirty_exceeded = (bdi_dirty > bdi_thresh) || (nr_dirty > dirty_thresh); - clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) && - (nr_dirty <= dirty_thresh); - - if (!dirty_exceeded) - break; - - if (!bdi->dirty_exceeded) + if (dirty_exceeded && !bdi->dirty_exceeded) bdi->dirty_exceeded = 1; bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty, bdi_thresh, bdi_dirty, start_time); - /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. - * Unstable writes are a feature of certain networked - * filesystems (i.e. NFS) in which data may have been - * written to the server's write cache, but has not yet - * been flushed to permanent storage. - * Only move pages to writeback if this bdi is over its - * threshold otherwise wait until the disk writes catch - * up. - */ - trace_balance_dirty_start(bdi); - if (bdi_nr_reclaimable > task_bdi_thresh) { - pages_written += writeback_inodes_wb(&bdi->wb, - write_chunk); - trace_balance_dirty_written(bdi, pages_written); - if (pages_written >= write_chunk) - break; /* We've done our duty */ + if (unlikely(!writeback_in_progress(bdi))) + bdi_start_background_writeback(bdi); + + base_bw = bdi->dirty_ratelimit; + bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty, + bdi_thresh, bdi_dirty); + if (unlikely(bw == 0)) { + pause = MAX_PAUSE; + goto pause; } + bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; + pause = (HZ * pages_dirtied + bw / 2) / (bw | 1); + pause = min(pause, MAX_PAUSE); + +pause: __set_current_state(TASK_UNINTERRUPTIBLE); io_schedule_timeout(pause); - trace_balance_dirty_wait(bdi); dirty_thresh = hard_dirty_limit(dirty_thresh); /* @@ -960,8 +900,7 @@ static void balance_dirty_pages(struct a * (b) the pause time limit makes the dirtiers more responsive. */ if (nr_dirty < dirty_thresh + - dirty_thresh / DIRTY_MAXPAUSE_AREA && - time_after(jiffies, start_time + MAX_PAUSE)) + dirty_thresh / DIRTY_MAXPAUSE_AREA) break; /* * pass-good area. When some bdi gets blocked (eg. NFS server @@ -974,18 +913,9 @@ static void balance_dirty_pages(struct a dirty_thresh / DIRTY_PASSGOOD_AREA && bdi_dirty < bdi_thresh) break; - - /* - * Increase the delay for each loop, up to our previous - * default of taking a 100ms nap. - */ - pause <<= 1; - if (pause > HZ / 10) - pause = HZ / 10; } - /* Clear dirty_exceeded flag only when no task can exceed the limit */ - if (clear_dirty_exceeded && bdi->dirty_exceeded) + if (!dirty_exceeded && bdi->dirty_exceeded) bdi->dirty_exceeded = 0; current->nr_dirtied = 0; @@ -1002,8 +932,10 @@ static void balance_dirty_pages(struct a * In normal mode, we start background writeout at the lower * background_thresh, to keep the amount of dirty memory low. */ - if ((laptop_mode && pages_written) || - (!laptop_mode && (nr_reclaimable > background_thresh))) + if (laptop_mode) + return; + + if (nr_reclaimable > background_thresh) bdi_start_background_writeback(bdi); } --- linux-next.orig/include/trace/events/writeback.h 2011-08-06 11:08:34.000000000 +0800 +++ linux-next/include/trace/events/writeback.h 2011-08-06 11:17:29.000000000 +0800 @@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister); DEFINE_WRITEBACK_EVENT(writeback_thread_start); DEFINE_WRITEBACK_EVENT(writeback_thread_stop); -DEFINE_WRITEBACK_EVENT(balance_dirty_start); -DEFINE_WRITEBACK_EVENT(balance_dirty_wait); - -TRACE_EVENT(balance_dirty_written, - - TP_PROTO(struct backing_dev_info *bdi, int written), - - TP_ARGS(bdi, written), - - TP_STRUCT__entry( - __array(char, name, 32) - __field(int, written) - ), - - TP_fast_assign( - strncpy(__entry->name, dev_name(bdi->dev), 32); - __entry->written = written; - ), - - TP_printk("bdi %s written %d", - __entry->name, - __entry->written - ) -); DECLARE_EVENT_CLASS(wbc_class, TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-06 8:44 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --] [-- Type: text/plain, Size: 14816 bytes --] As proposed by Chris, Dave and Jan, don't start foreground writeback IO inside balance_dirty_pages(). Instead, simply let it idle sleep for some time to throttle the dirtying task. In the mean while, kick off the per-bdi flusher thread to do background writeback IO. RATIONALS ========= - disk seeks on concurrent writeback of multiple inodes (Dave Chinner) If every thread doing writes and being throttled start foreground writeback, it leads to N IO submitters from at least N different inodes at the same time, end up with N different sets of IO being issued with potentially zero locality to each other, resulting in much lower elevator sort/merge efficiency and hence we seek the disk all over the place to service the different sets of IO. OTOH, if there is only one submission thread, it doesn't jump between inodes in the same way when congestion clears - it keeps writing to the same inode, resulting in large related chunks of sequential IOs being issued to the disk. This is more efficient than the above foreground writeback because the elevator works better and the disk seeks less. - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner) With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". * "CPU usage has dropped by ~55%", "it certainly appears that most of the CPU time saving comes from the removal of contention on the inode_wb_list_lock" (IMHO at least 10% comes from the reduction of cacheline bouncing, because the new code is able to call much less frequently into balance_dirty_pages() and hence access the global page states) * the user space "App overhead" is reduced by 20%, by avoiding the cacheline pollution by the complex writeback code path * "for a ~5% throughput reduction", "the number of write IOs have dropped by ~25%", and the elapsed time reduced from 41:42.17 to 40:53.23. * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and improves IO throughput from 38MB/s to 42MB/s. - IO size too small for fast arrays and too large for slow USB sticks The write_chunk used by current balance_dirty_pages() cannot be directly set to some large value (eg. 128MB) for better IO efficiency. Because it could lead to more than 1 second user perceivable stalls. Even the current 4MB write size may be too large for slow USB sticks. The fact that balance_dirty_pages() starts IO on itself couples the IO size to wait time, which makes it hard to do suitable IO size while keeping the wait time under control. Now it's possible to increase writeback chunk size proportional to the disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram, the larger writeback size dramatically reduces the seek count to 1/10 (far beyond my expectation) and improves the write throughput by 24%. - long block time in balance_dirty_pages() hurts desktop responsiveness Many of us may have the experience: it often takes a couple of seconds or even long time to stop a heavy writing dd/cp/tar command with Ctrl-C or "kill -9". - IO pipeline broken by bumpy write() progress There are a broad class of "loop {read(buf); write(buf);}" applications whose read() pipeline will be under-utilized or even come to a stop if the write()s have long latencies _or_ don't progress in a constant rate. The current threshold based throttling inherently transfers the large low level IO completion fluctuations to bumpy application write()s, and further deteriorates with increasing number of dirtiers and/or bdi's. For example, when doing 50 dd's + 1 remote rsync to an XFS partition, the rsync progresses very bumpy in legacy kernel, and throughput is improved by 67% by this patchset. (plus the larger write chunk size, it will be 93% speedup). The new rate based throttling can support 1000+ dd's with excellent smoothness, low latency and low overheads. For the above reasons, it's much better to do IO-less and low latency pauses in balance_dirty_pages(). Jan Kara, Dave Chinner and me explored the scheme to let balance_dirty_pages() wait for enough writeback IO completions to safeguard the dirty limit. However it's found to have two problems: - in large NUMA systems, the per-cpu counters may have big accounting errors, leading to big throttle wait time and jitters. - NFS may kill large amount of unstable pages with one single COMMIT. Because NFS server serves COMMIT with expensive fsync() IOs, it is desirable to delay and reduce the number of COMMITs. So it's not likely to optimize away such kind of bursty IO completions, and the resulted large (and tiny) stall times in IO completion based throttling. So here is a pause time oriented approach, which tries to control the pause time in each balance_dirty_pages() invocations, by controlling the number of pages dirtied before calling balance_dirty_pages(), for smooth and efficient dirty throttling: - avoid useless (eg. zero pause time) balance_dirty_pages() calls - avoid too small pause time (less than 4ms, which burns CPU power) - avoid too large pause time (more than 200ms, which hurts responsiveness) - avoid big fluctuations of pause times It can control pause times at will. The default policy will be to do ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case. BEHAVIOR CHANGE =============== (1) dirty threshold Users will notice that the applications will get throttled once crossing the global (background + dirty)/2=15% threshold, and then balanced around 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable memory in 1-dd case. Since the task will be soft throttled earlier than before, it may be perceived by end users as performance "slow down" if his application happens to dirty more than 15% dirtyable memory. (2) smoothness/responsiveness Users will notice a more responsive system during heavy writeback. "killall dd" will take effect instantly. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/trace/events/writeback.h | 24 ---- mm/page-writeback.c | 142 +++++++---------------------- 2 files changed, 37 insertions(+), 129 deletions(-) --- linux-next.orig/mm/page-writeback.c 2011-08-06 11:17:26.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-06 16:16:30.000000000 +0800 @@ -242,50 +242,6 @@ static void bdi_writeout_fraction(struct numerator, denominator); } -static inline void task_dirties_fraction(struct task_struct *tsk, - long *numerator, long *denominator) -{ - prop_fraction_single(&vm_dirties, &tsk->dirties, - numerator, denominator); -} - -/* - * task_dirty_limit - scale down dirty throttling threshold for one task - * - * task specific dirty limit: - * - * dirty -= (dirty/8) * p_{t} - * - * To protect light/slow dirtying tasks from heavier/fast ones, we start - * throttling individual tasks before reaching the bdi dirty limit. - * Relatively low thresholds will be allocated to heavy dirtiers. So when - * dirty pages grow large, heavy dirtiers will be throttled first, which will - * effectively curb the growth of dirty pages. Light dirtiers with high enough - * dirty threshold may never get throttled. - */ -#define TASK_LIMIT_FRACTION 8 -static unsigned long task_dirty_limit(struct task_struct *tsk, - unsigned long bdi_dirty) -{ - long numerator, denominator; - unsigned long dirty = bdi_dirty; - u64 inv = dirty / TASK_LIMIT_FRACTION; - - task_dirties_fraction(tsk, &numerator, &denominator); - inv *= numerator; - do_div(inv, denominator); - - dirty -= inv; - - return max(dirty, bdi_dirty/2); -} - -/* Minimum limit for any task */ -static unsigned long task_min_dirty_limit(unsigned long bdi_dirty) -{ - return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION; -} - /* * */ @@ -855,24 +811,28 @@ static unsigned long ratelimit_pages(uns * perform some writeout. */ static void balance_dirty_pages(struct address_space *mapping, - unsigned long write_chunk) + unsigned long pages_dirtied) { - unsigned long nr_reclaimable, bdi_nr_reclaimable; + unsigned long nr_reclaimable; unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */ unsigned long bdi_dirty; unsigned long background_thresh; unsigned long dirty_thresh; unsigned long bdi_thresh; - unsigned long task_bdi_thresh; - unsigned long min_task_bdi_thresh; - unsigned long pages_written = 0; - unsigned long pause = 1; + unsigned long pause = 0; bool dirty_exceeded = false; - bool clear_dirty_exceeded = true; + unsigned long bw; + unsigned long base_bw; struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long start_time = jiffies; for (;;) { + /* + * Unstable writes are a feature of certain networked + * filesystems (i.e. NFS) in which data may have been + * written to the server's write cache, but has not yet + * been flushed to permanent storage. + */ nr_reclaimable = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); @@ -888,8 +848,6 @@ static void balance_dirty_pages(struct a break; bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); - min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh); - task_bdi_thresh = task_dirty_limit(current, bdi_thresh); /* * In order to avoid the stacked BDI deadlock we need @@ -901,56 +859,38 @@ static void balance_dirty_pages(struct a * actually dirty; with m+n sitting in the percpu * deltas. */ - if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) { - bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE); - bdi_dirty = bdi_nr_reclaimable + + if (bdi_thresh < 2 * bdi_stat_error(bdi)) + bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) + bdi_stat_sum(bdi, BDI_WRITEBACK); - } else { - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); - bdi_dirty = bdi_nr_reclaimable + + else + bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) + bdi_stat(bdi, BDI_WRITEBACK); - } - /* - * The bdi thresh is somehow "soft" limit derived from the - * global "hard" limit. The former helps to prevent heavy IO - * bdi or process from holding back light ones; The latter is - * the last resort safeguard. - */ - dirty_exceeded = (bdi_dirty > task_bdi_thresh) || + dirty_exceeded = (bdi_dirty > bdi_thresh) || (nr_dirty > dirty_thresh); - clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) && - (nr_dirty <= dirty_thresh); - - if (!dirty_exceeded) - break; - - if (!bdi->dirty_exceeded) + if (dirty_exceeded && !bdi->dirty_exceeded) bdi->dirty_exceeded = 1; bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty, bdi_thresh, bdi_dirty, start_time); - /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. - * Unstable writes are a feature of certain networked - * filesystems (i.e. NFS) in which data may have been - * written to the server's write cache, but has not yet - * been flushed to permanent storage. - * Only move pages to writeback if this bdi is over its - * threshold otherwise wait until the disk writes catch - * up. - */ - trace_balance_dirty_start(bdi); - if (bdi_nr_reclaimable > task_bdi_thresh) { - pages_written += writeback_inodes_wb(&bdi->wb, - write_chunk); - trace_balance_dirty_written(bdi, pages_written); - if (pages_written >= write_chunk) - break; /* We've done our duty */ + if (unlikely(!writeback_in_progress(bdi))) + bdi_start_background_writeback(bdi); + + base_bw = bdi->dirty_ratelimit; + bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty, + bdi_thresh, bdi_dirty); + if (unlikely(bw == 0)) { + pause = MAX_PAUSE; + goto pause; } + bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; + pause = (HZ * pages_dirtied + bw / 2) / (bw | 1); + pause = min(pause, MAX_PAUSE); + +pause: __set_current_state(TASK_UNINTERRUPTIBLE); io_schedule_timeout(pause); - trace_balance_dirty_wait(bdi); dirty_thresh = hard_dirty_limit(dirty_thresh); /* @@ -960,8 +900,7 @@ static void balance_dirty_pages(struct a * (b) the pause time limit makes the dirtiers more responsive. */ if (nr_dirty < dirty_thresh + - dirty_thresh / DIRTY_MAXPAUSE_AREA && - time_after(jiffies, start_time + MAX_PAUSE)) + dirty_thresh / DIRTY_MAXPAUSE_AREA) break; /* * pass-good area. When some bdi gets blocked (eg. NFS server @@ -974,18 +913,9 @@ static void balance_dirty_pages(struct a dirty_thresh / DIRTY_PASSGOOD_AREA && bdi_dirty < bdi_thresh) break; - - /* - * Increase the delay for each loop, up to our previous - * default of taking a 100ms nap. - */ - pause <<= 1; - if (pause > HZ / 10) - pause = HZ / 10; } - /* Clear dirty_exceeded flag only when no task can exceed the limit */ - if (clear_dirty_exceeded && bdi->dirty_exceeded) + if (!dirty_exceeded && bdi->dirty_exceeded) bdi->dirty_exceeded = 0; current->nr_dirtied = 0; @@ -1002,8 +932,10 @@ static void balance_dirty_pages(struct a * In normal mode, we start background writeout at the lower * background_thresh, to keep the amount of dirty memory low. */ - if ((laptop_mode && pages_written) || - (!laptop_mode && (nr_reclaimable > background_thresh))) + if (laptop_mode) + return; + + if (nr_reclaimable > background_thresh) bdi_start_background_writeback(bdi); } --- linux-next.orig/include/trace/events/writeback.h 2011-08-06 11:08:34.000000000 +0800 +++ linux-next/include/trace/events/writeback.h 2011-08-06 11:17:29.000000000 +0800 @@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister); DEFINE_WRITEBACK_EVENT(writeback_thread_start); DEFINE_WRITEBACK_EVENT(writeback_thread_stop); -DEFINE_WRITEBACK_EVENT(balance_dirty_start); -DEFINE_WRITEBACK_EVENT(balance_dirty_wait); - -TRACE_EVENT(balance_dirty_written, - - TP_PROTO(struct backing_dev_info *bdi, int written), - - TP_ARGS(bdi, written), - - TP_STRUCT__entry( - __array(char, name, 32) - __field(int, written) - ), - - TP_fast_assign( - strncpy(__entry->name, dev_name(bdi->dev), 32); - __entry->written = written; - ), - - TP_printk("bdi %s written %d", - __entry->name, - __entry->written - ) -); DECLARE_EVENT_CLASS(wbc_class, TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-06 8:44 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-06 8:44 UTC (permalink / raw) To: linux-fsdevel Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --] [-- Type: text/plain, Size: 14816 bytes --] As proposed by Chris, Dave and Jan, don't start foreground writeback IO inside balance_dirty_pages(). Instead, simply let it idle sleep for some time to throttle the dirtying task. In the mean while, kick off the per-bdi flusher thread to do background writeback IO. RATIONALS ========= - disk seeks on concurrent writeback of multiple inodes (Dave Chinner) If every thread doing writes and being throttled start foreground writeback, it leads to N IO submitters from at least N different inodes at the same time, end up with N different sets of IO being issued with potentially zero locality to each other, resulting in much lower elevator sort/merge efficiency and hence we seek the disk all over the place to service the different sets of IO. OTOH, if there is only one submission thread, it doesn't jump between inodes in the same way when congestion clears - it keeps writing to the same inode, resulting in large related chunks of sequential IOs being issued to the disk. This is more efficient than the above foreground writeback because the elevator works better and the disk seeks less. - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner) With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". * "CPU usage has dropped by ~55%", "it certainly appears that most of the CPU time saving comes from the removal of contention on the inode_wb_list_lock" (IMHO at least 10% comes from the reduction of cacheline bouncing, because the new code is able to call much less frequently into balance_dirty_pages() and hence access the global page states) * the user space "App overhead" is reduced by 20%, by avoiding the cacheline pollution by the complex writeback code path * "for a ~5% throughput reduction", "the number of write IOs have dropped by ~25%", and the elapsed time reduced from 41:42.17 to 40:53.23. * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and improves IO throughput from 38MB/s to 42MB/s. - IO size too small for fast arrays and too large for slow USB sticks The write_chunk used by current balance_dirty_pages() cannot be directly set to some large value (eg. 128MB) for better IO efficiency. Because it could lead to more than 1 second user perceivable stalls. Even the current 4MB write size may be too large for slow USB sticks. The fact that balance_dirty_pages() starts IO on itself couples the IO size to wait time, which makes it hard to do suitable IO size while keeping the wait time under control. Now it's possible to increase writeback chunk size proportional to the disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram, the larger writeback size dramatically reduces the seek count to 1/10 (far beyond my expectation) and improves the write throughput by 24%. - long block time in balance_dirty_pages() hurts desktop responsiveness Many of us may have the experience: it often takes a couple of seconds or even long time to stop a heavy writing dd/cp/tar command with Ctrl-C or "kill -9". - IO pipeline broken by bumpy write() progress There are a broad class of "loop {read(buf); write(buf);}" applications whose read() pipeline will be under-utilized or even come to a stop if the write()s have long latencies _or_ don't progress in a constant rate. The current threshold based throttling inherently transfers the large low level IO completion fluctuations to bumpy application write()s, and further deteriorates with increasing number of dirtiers and/or bdi's. For example, when doing 50 dd's + 1 remote rsync to an XFS partition, the rsync progresses very bumpy in legacy kernel, and throughput is improved by 67% by this patchset. (plus the larger write chunk size, it will be 93% speedup). The new rate based throttling can support 1000+ dd's with excellent smoothness, low latency and low overheads. For the above reasons, it's much better to do IO-less and low latency pauses in balance_dirty_pages(). Jan Kara, Dave Chinner and me explored the scheme to let balance_dirty_pages() wait for enough writeback IO completions to safeguard the dirty limit. However it's found to have two problems: - in large NUMA systems, the per-cpu counters may have big accounting errors, leading to big throttle wait time and jitters. - NFS may kill large amount of unstable pages with one single COMMIT. Because NFS server serves COMMIT with expensive fsync() IOs, it is desirable to delay and reduce the number of COMMITs. So it's not likely to optimize away such kind of bursty IO completions, and the resulted large (and tiny) stall times in IO completion based throttling. So here is a pause time oriented approach, which tries to control the pause time in each balance_dirty_pages() invocations, by controlling the number of pages dirtied before calling balance_dirty_pages(), for smooth and efficient dirty throttling: - avoid useless (eg. zero pause time) balance_dirty_pages() calls - avoid too small pause time (less than 4ms, which burns CPU power) - avoid too large pause time (more than 200ms, which hurts responsiveness) - avoid big fluctuations of pause times It can control pause times at will. The default policy will be to do ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case. BEHAVIOR CHANGE =============== (1) dirty threshold Users will notice that the applications will get throttled once crossing the global (background + dirty)/2=15% threshold, and then balanced around 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable memory in 1-dd case. Since the task will be soft throttled earlier than before, it may be perceived by end users as performance "slow down" if his application happens to dirty more than 15% dirtyable memory. (2) smoothness/responsiveness Users will notice a more responsive system during heavy writeback. "killall dd" will take effect instantly. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/trace/events/writeback.h | 24 ---- mm/page-writeback.c | 142 +++++++---------------------- 2 files changed, 37 insertions(+), 129 deletions(-) --- linux-next.orig/mm/page-writeback.c 2011-08-06 11:17:26.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-06 16:16:30.000000000 +0800 @@ -242,50 +242,6 @@ static void bdi_writeout_fraction(struct numerator, denominator); } -static inline void task_dirties_fraction(struct task_struct *tsk, - long *numerator, long *denominator) -{ - prop_fraction_single(&vm_dirties, &tsk->dirties, - numerator, denominator); -} - -/* - * task_dirty_limit - scale down dirty throttling threshold for one task - * - * task specific dirty limit: - * - * dirty -= (dirty/8) * p_{t} - * - * To protect light/slow dirtying tasks from heavier/fast ones, we start - * throttling individual tasks before reaching the bdi dirty limit. - * Relatively low thresholds will be allocated to heavy dirtiers. So when - * dirty pages grow large, heavy dirtiers will be throttled first, which will - * effectively curb the growth of dirty pages. Light dirtiers with high enough - * dirty threshold may never get throttled. - */ -#define TASK_LIMIT_FRACTION 8 -static unsigned long task_dirty_limit(struct task_struct *tsk, - unsigned long bdi_dirty) -{ - long numerator, denominator; - unsigned long dirty = bdi_dirty; - u64 inv = dirty / TASK_LIMIT_FRACTION; - - task_dirties_fraction(tsk, &numerator, &denominator); - inv *= numerator; - do_div(inv, denominator); - - dirty -= inv; - - return max(dirty, bdi_dirty/2); -} - -/* Minimum limit for any task */ -static unsigned long task_min_dirty_limit(unsigned long bdi_dirty) -{ - return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION; -} - /* * */ @@ -855,24 +811,28 @@ static unsigned long ratelimit_pages(uns * perform some writeout. */ static void balance_dirty_pages(struct address_space *mapping, - unsigned long write_chunk) + unsigned long pages_dirtied) { - unsigned long nr_reclaimable, bdi_nr_reclaimable; + unsigned long nr_reclaimable; unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */ unsigned long bdi_dirty; unsigned long background_thresh; unsigned long dirty_thresh; unsigned long bdi_thresh; - unsigned long task_bdi_thresh; - unsigned long min_task_bdi_thresh; - unsigned long pages_written = 0; - unsigned long pause = 1; + unsigned long pause = 0; bool dirty_exceeded = false; - bool clear_dirty_exceeded = true; + unsigned long bw; + unsigned long base_bw; struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long start_time = jiffies; for (;;) { + /* + * Unstable writes are a feature of certain networked + * filesystems (i.e. NFS) in which data may have been + * written to the server's write cache, but has not yet + * been flushed to permanent storage. + */ nr_reclaimable = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); @@ -888,8 +848,6 @@ static void balance_dirty_pages(struct a break; bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); - min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh); - task_bdi_thresh = task_dirty_limit(current, bdi_thresh); /* * In order to avoid the stacked BDI deadlock we need @@ -901,56 +859,38 @@ static void balance_dirty_pages(struct a * actually dirty; with m+n sitting in the percpu * deltas. */ - if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) { - bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE); - bdi_dirty = bdi_nr_reclaimable + + if (bdi_thresh < 2 * bdi_stat_error(bdi)) + bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) + bdi_stat_sum(bdi, BDI_WRITEBACK); - } else { - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); - bdi_dirty = bdi_nr_reclaimable + + else + bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) + bdi_stat(bdi, BDI_WRITEBACK); - } - /* - * The bdi thresh is somehow "soft" limit derived from the - * global "hard" limit. The former helps to prevent heavy IO - * bdi or process from holding back light ones; The latter is - * the last resort safeguard. - */ - dirty_exceeded = (bdi_dirty > task_bdi_thresh) || + dirty_exceeded = (bdi_dirty > bdi_thresh) || (nr_dirty > dirty_thresh); - clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) && - (nr_dirty <= dirty_thresh); - - if (!dirty_exceeded) - break; - - if (!bdi->dirty_exceeded) + if (dirty_exceeded && !bdi->dirty_exceeded) bdi->dirty_exceeded = 1; bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty, bdi_thresh, bdi_dirty, start_time); - /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. - * Unstable writes are a feature of certain networked - * filesystems (i.e. NFS) in which data may have been - * written to the server's write cache, but has not yet - * been flushed to permanent storage. - * Only move pages to writeback if this bdi is over its - * threshold otherwise wait until the disk writes catch - * up. - */ - trace_balance_dirty_start(bdi); - if (bdi_nr_reclaimable > task_bdi_thresh) { - pages_written += writeback_inodes_wb(&bdi->wb, - write_chunk); - trace_balance_dirty_written(bdi, pages_written); - if (pages_written >= write_chunk) - break; /* We've done our duty */ + if (unlikely(!writeback_in_progress(bdi))) + bdi_start_background_writeback(bdi); + + base_bw = bdi->dirty_ratelimit; + bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty, + bdi_thresh, bdi_dirty); + if (unlikely(bw == 0)) { + pause = MAX_PAUSE; + goto pause; } + bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; + pause = (HZ * pages_dirtied + bw / 2) / (bw | 1); + pause = min(pause, MAX_PAUSE); + +pause: __set_current_state(TASK_UNINTERRUPTIBLE); io_schedule_timeout(pause); - trace_balance_dirty_wait(bdi); dirty_thresh = hard_dirty_limit(dirty_thresh); /* @@ -960,8 +900,7 @@ static void balance_dirty_pages(struct a * (b) the pause time limit makes the dirtiers more responsive. */ if (nr_dirty < dirty_thresh + - dirty_thresh / DIRTY_MAXPAUSE_AREA && - time_after(jiffies, start_time + MAX_PAUSE)) + dirty_thresh / DIRTY_MAXPAUSE_AREA) break; /* * pass-good area. When some bdi gets blocked (eg. NFS server @@ -974,18 +913,9 @@ static void balance_dirty_pages(struct a dirty_thresh / DIRTY_PASSGOOD_AREA && bdi_dirty < bdi_thresh) break; - - /* - * Increase the delay for each loop, up to our previous - * default of taking a 100ms nap. - */ - pause <<= 1; - if (pause > HZ / 10) - pause = HZ / 10; } - /* Clear dirty_exceeded flag only when no task can exceed the limit */ - if (clear_dirty_exceeded && bdi->dirty_exceeded) + if (!dirty_exceeded && bdi->dirty_exceeded) bdi->dirty_exceeded = 0; current->nr_dirtied = 0; @@ -1002,8 +932,10 @@ static void balance_dirty_pages(struct a * In normal mode, we start background writeout at the lower * background_thresh, to keep the amount of dirty memory low. */ - if ((laptop_mode && pages_written) || - (!laptop_mode && (nr_reclaimable > background_thresh))) + if (laptop_mode) + return; + + if (nr_reclaimable > background_thresh) bdi_start_background_writeback(bdi); } --- linux-next.orig/include/trace/events/writeback.h 2011-08-06 11:08:34.000000000 +0800 +++ linux-next/include/trace/events/writeback.h 2011-08-06 11:17:29.000000000 +0800 @@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister); DEFINE_WRITEBACK_EVENT(writeback_thread_start); DEFINE_WRITEBACK_EVENT(writeback_thread_stop); -DEFINE_WRITEBACK_EVENT(balance_dirty_start); -DEFINE_WRITEBACK_EVENT(balance_dirty_wait); - -TRACE_EVENT(balance_dirty_written, - - TP_PROTO(struct backing_dev_info *bdi, int written), - - TP_ARGS(bdi, written), - - TP_STRUCT__entry( - __array(char, name, 32) - __field(int, written) - ), - - TP_fast_assign( - strncpy(__entry->name, dev_name(bdi->dev), 32); - __entry->written = written; - ), - - TP_printk("bdi %s written %d", - __entry->name, - __entry->written - ) -); DECLARE_EVENT_CLASS(wbc_class, TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-06 8:44 ` Wu Fengguang (?) @ 2011-08-06 14:48 ` Andrea Righi -1 siblings, 0 replies; 301+ messages in thread From: Andrea Righi @ 2011-08-06 14:48 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote: > As proposed by Chris, Dave and Jan, don't start foreground writeback IO > inside balance_dirty_pages(). Instead, simply let it idle sleep for some > time to throttle the dirtying task. In the mean while, kick off the > per-bdi flusher thread to do background writeback IO. > > RATIONALS > ========= > > - disk seeks on concurrent writeback of multiple inodes (Dave Chinner) > > If every thread doing writes and being throttled start foreground > writeback, it leads to N IO submitters from at least N different > inodes at the same time, end up with N different sets of IO being > issued with potentially zero locality to each other, resulting in > much lower elevator sort/merge efficiency and hence we seek the disk > all over the place to service the different sets of IO. > OTOH, if there is only one submission thread, it doesn't jump between > inodes in the same way when congestion clears - it keeps writing to > the same inode, resulting in large related chunks of sequential IOs > being issued to the disk. This is more efficient than the above > foreground writeback because the elevator works better and the disk > seeks less. > > - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner) > > With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes > from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". > > * "CPU usage has dropped by ~55%", "it certainly appears that most of > the CPU time saving comes from the removal of contention on the > inode_wb_list_lock" (IMHO at least 10% comes from the reduction of > cacheline bouncing, because the new code is able to call much less > frequently into balance_dirty_pages() and hence access the global > page states) > > * the user space "App overhead" is reduced by 20%, by avoiding the > cacheline pollution by the complex writeback code path > > * "for a ~5% throughput reduction", "the number of write IOs have > dropped by ~25%", and the elapsed time reduced from 41:42.17 to > 40:53.23. > > * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, > and improves IO throughput from 38MB/s to 42MB/s. > > - IO size too small for fast arrays and too large for slow USB sticks > > The write_chunk used by current balance_dirty_pages() cannot be > directly set to some large value (eg. 128MB) for better IO efficiency. > Because it could lead to more than 1 second user perceivable stalls. > Even the current 4MB write size may be too large for slow USB sticks. > The fact that balance_dirty_pages() starts IO on itself couples the > IO size to wait time, which makes it hard to do suitable IO size while > keeping the wait time under control. > > Now it's possible to increase writeback chunk size proportional to the > disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram, > the larger writeback size dramatically reduces the seek count to 1/10 > (far beyond my expectation) and improves the write throughput by 24%. > > - long block time in balance_dirty_pages() hurts desktop responsiveness > > Many of us may have the experience: it often takes a couple of seconds > or even long time to stop a heavy writing dd/cp/tar command with > Ctrl-C or "kill -9". > > - IO pipeline broken by bumpy write() progress > > There are a broad class of "loop {read(buf); write(buf);}" applications > whose read() pipeline will be under-utilized or even come to a stop if > the write()s have long latencies _or_ don't progress in a constant rate. > The current threshold based throttling inherently transfers the large > low level IO completion fluctuations to bumpy application write()s, > and further deteriorates with increasing number of dirtiers and/or bdi's. > > For example, when doing 50 dd's + 1 remote rsync to an XFS partition, > the rsync progresses very bumpy in legacy kernel, and throughput is > improved by 67% by this patchset. (plus the larger write chunk size, > it will be 93% speedup). > > The new rate based throttling can support 1000+ dd's with excellent > smoothness, low latency and low overheads. > > For the above reasons, it's much better to do IO-less and low latency > pauses in balance_dirty_pages(). > > Jan Kara, Dave Chinner and me explored the scheme to let > balance_dirty_pages() wait for enough writeback IO completions to > safeguard the dirty limit. However it's found to have two problems: > > - in large NUMA systems, the per-cpu counters may have big accounting > errors, leading to big throttle wait time and jitters. > > - NFS may kill large amount of unstable pages with one single COMMIT. > Because NFS server serves COMMIT with expensive fsync() IOs, it is > desirable to delay and reduce the number of COMMITs. So it's not > likely to optimize away such kind of bursty IO completions, and the > resulted large (and tiny) stall times in IO completion based throttling. > > So here is a pause time oriented approach, which tries to control the > pause time in each balance_dirty_pages() invocations, by controlling > the number of pages dirtied before calling balance_dirty_pages(), for > smooth and efficient dirty throttling: > > - avoid useless (eg. zero pause time) balance_dirty_pages() calls > - avoid too small pause time (less than 4ms, which burns CPU power) > - avoid too large pause time (more than 200ms, which hurts responsiveness) > - avoid big fluctuations of pause times > > It can control pause times at will. The default policy will be to do > ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case. > > BEHAVIOR CHANGE > =============== > > (1) dirty threshold > > Users will notice that the applications will get throttled once crossing > the global (background + dirty)/2=15% threshold, and then balanced around > 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable > memory in 1-dd case. > > Since the task will be soft throttled earlier than before, it may be > perceived by end users as performance "slow down" if his application > happens to dirty more than 15% dirtyable memory. > > (2) smoothness/responsiveness > > Users will notice a more responsive system during heavy writeback. > "killall dd" will take effect instantly. > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- Another minor nit below. > include/trace/events/writeback.h | 24 ---- > mm/page-writeback.c | 142 +++++++---------------------- > 2 files changed, 37 insertions(+), 129 deletions(-) > > --- linux-next.orig/mm/page-writeback.c 2011-08-06 11:17:26.000000000 +0800 > +++ linux-next/mm/page-writeback.c 2011-08-06 16:16:30.000000000 +0800 > @@ -242,50 +242,6 @@ static void bdi_writeout_fraction(struct > numerator, denominator); > } > > -static inline void task_dirties_fraction(struct task_struct *tsk, > - long *numerator, long *denominator) > -{ > - prop_fraction_single(&vm_dirties, &tsk->dirties, > - numerator, denominator); > -} > - > -/* > - * task_dirty_limit - scale down dirty throttling threshold for one task > - * > - * task specific dirty limit: > - * > - * dirty -= (dirty/8) * p_{t} > - * > - * To protect light/slow dirtying tasks from heavier/fast ones, we start > - * throttling individual tasks before reaching the bdi dirty limit. > - * Relatively low thresholds will be allocated to heavy dirtiers. So when > - * dirty pages grow large, heavy dirtiers will be throttled first, which will > - * effectively curb the growth of dirty pages. Light dirtiers with high enough > - * dirty threshold may never get throttled. > - */ > -#define TASK_LIMIT_FRACTION 8 > -static unsigned long task_dirty_limit(struct task_struct *tsk, > - unsigned long bdi_dirty) > -{ > - long numerator, denominator; > - unsigned long dirty = bdi_dirty; > - u64 inv = dirty / TASK_LIMIT_FRACTION; > - > - task_dirties_fraction(tsk, &numerator, &denominator); > - inv *= numerator; > - do_div(inv, denominator); > - > - dirty -= inv; > - > - return max(dirty, bdi_dirty/2); > -} > - > -/* Minimum limit for any task */ > -static unsigned long task_min_dirty_limit(unsigned long bdi_dirty) > -{ > - return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION; > -} > - > /* > * > */ > @@ -855,24 +811,28 @@ static unsigned long ratelimit_pages(uns > * perform some writeout. > */ > static void balance_dirty_pages(struct address_space *mapping, > - unsigned long write_chunk) > + unsigned long pages_dirtied) > { > - unsigned long nr_reclaimable, bdi_nr_reclaimable; > + unsigned long nr_reclaimable; > unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */ > unsigned long bdi_dirty; > unsigned long background_thresh; > unsigned long dirty_thresh; > unsigned long bdi_thresh; > - unsigned long task_bdi_thresh; > - unsigned long min_task_bdi_thresh; > - unsigned long pages_written = 0; > - unsigned long pause = 1; > + unsigned long pause = 0; > bool dirty_exceeded = false; > - bool clear_dirty_exceeded = true; > + unsigned long bw; > + unsigned long base_bw; > struct backing_dev_info *bdi = mapping->backing_dev_info; > unsigned long start_time = jiffies; > > for (;;) { > + /* > + * Unstable writes are a feature of certain networked > + * filesystems (i.e. NFS) in which data may have been > + * written to the server's write cache, but has not yet > + * been flushed to permanent storage. > + */ > nr_reclaimable = global_page_state(NR_FILE_DIRTY) + > global_page_state(NR_UNSTABLE_NFS); > nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); > @@ -888,8 +848,6 @@ static void balance_dirty_pages(struct a > break; > > bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); > - min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh); > - task_bdi_thresh = task_dirty_limit(current, bdi_thresh); > > /* > * In order to avoid the stacked BDI deadlock we need > @@ -901,56 +859,38 @@ static void balance_dirty_pages(struct a > * actually dirty; with m+n sitting in the percpu > * deltas. > */ > - if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) { > - bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE); > - bdi_dirty = bdi_nr_reclaimable + > + if (bdi_thresh < 2 * bdi_stat_error(bdi)) > + bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) + > bdi_stat_sum(bdi, BDI_WRITEBACK); > - } else { > - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); > - bdi_dirty = bdi_nr_reclaimable + > + else > + bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) + > bdi_stat(bdi, BDI_WRITEBACK); > - } > > - /* > - * The bdi thresh is somehow "soft" limit derived from the > - * global "hard" limit. The former helps to prevent heavy IO > - * bdi or process from holding back light ones; The latter is > - * the last resort safeguard. > - */ > - dirty_exceeded = (bdi_dirty > task_bdi_thresh) || > + dirty_exceeded = (bdi_dirty > bdi_thresh) || > (nr_dirty > dirty_thresh); > - clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) && > - (nr_dirty <= dirty_thresh); > - > - if (!dirty_exceeded) > - break; > - > - if (!bdi->dirty_exceeded) > + if (dirty_exceeded && !bdi->dirty_exceeded) > bdi->dirty_exceeded = 1; > > bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty, > bdi_thresh, bdi_dirty, start_time); > > - /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. > - * Unstable writes are a feature of certain networked > - * filesystems (i.e. NFS) in which data may have been > - * written to the server's write cache, but has not yet > - * been flushed to permanent storage. > - * Only move pages to writeback if this bdi is over its > - * threshold otherwise wait until the disk writes catch > - * up. > - */ > - trace_balance_dirty_start(bdi); > - if (bdi_nr_reclaimable > task_bdi_thresh) { > - pages_written += writeback_inodes_wb(&bdi->wb, > - write_chunk); > - trace_balance_dirty_written(bdi, pages_written); > - if (pages_written >= write_chunk) > - break; /* We've done our duty */ > + if (unlikely(!writeback_in_progress(bdi))) > + bdi_start_background_writeback(bdi); > + > + base_bw = bdi->dirty_ratelimit; > + bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty, > + bdi_thresh, bdi_dirty); > + if (unlikely(bw == 0)) { > + pause = MAX_PAUSE; > + goto pause; > } > + bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; > + pause = (HZ * pages_dirtied + bw / 2) / (bw | 1); > + pause = min(pause, MAX_PAUSE); Fix this build warning: mm/page-writeback.c: In function ‘balance_dirty_pages’: mm/page-writeback.c:889:11: warning: comparison of distinct pointer types lacks a cast Signed-off-by: Andrea Righi <andrea@betterlinux.com> --- mm/page-writeback.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index a36f83d..a998931 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -886,7 +886,7 @@ static void balance_dirty_pages(struct address_space *mapping, } bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; pause = (HZ * pages_dirtied + bw / 2) / (bw | 1); - pause = min(pause, MAX_PAUSE); + pause = min_t(unsigned long, pause, MAX_PAUSE); pause: __set_current_state(TASK_UNINTERRUPTIBLE); ^ permalink raw reply related [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-06 14:48 ` Andrea Righi 0 siblings, 0 replies; 301+ messages in thread From: Andrea Righi @ 2011-08-06 14:48 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote: > As proposed by Chris, Dave and Jan, don't start foreground writeback IO > inside balance_dirty_pages(). Instead, simply let it idle sleep for some > time to throttle the dirtying task. In the mean while, kick off the > per-bdi flusher thread to do background writeback IO. > > RATIONALS > ========= > > - disk seeks on concurrent writeback of multiple inodes (Dave Chinner) > > If every thread doing writes and being throttled start foreground > writeback, it leads to N IO submitters from at least N different > inodes at the same time, end up with N different sets of IO being > issued with potentially zero locality to each other, resulting in > much lower elevator sort/merge efficiency and hence we seek the disk > all over the place to service the different sets of IO. > OTOH, if there is only one submission thread, it doesn't jump between > inodes in the same way when congestion clears - it keeps writing to > the same inode, resulting in large related chunks of sequential IOs > being issued to the disk. This is more efficient than the above > foreground writeback because the elevator works better and the disk > seeks less. > > - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner) > > With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes > from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". > > * "CPU usage has dropped by ~55%", "it certainly appears that most of > the CPU time saving comes from the removal of contention on the > inode_wb_list_lock" (IMHO at least 10% comes from the reduction of > cacheline bouncing, because the new code is able to call much less > frequently into balance_dirty_pages() and hence access the global > page states) > > * the user space "App overhead" is reduced by 20%, by avoiding the > cacheline pollution by the complex writeback code path > > * "for a ~5% throughput reduction", "the number of write IOs have > dropped by ~25%", and the elapsed time reduced from 41:42.17 to > 40:53.23. > > * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, > and improves IO throughput from 38MB/s to 42MB/s. > > - IO size too small for fast arrays and too large for slow USB sticks > > The write_chunk used by current balance_dirty_pages() cannot be > directly set to some large value (eg. 128MB) for better IO efficiency. > Because it could lead to more than 1 second user perceivable stalls. > Even the current 4MB write size may be too large for slow USB sticks. > The fact that balance_dirty_pages() starts IO on itself couples the > IO size to wait time, which makes it hard to do suitable IO size while > keeping the wait time under control. > > Now it's possible to increase writeback chunk size proportional to the > disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram, > the larger writeback size dramatically reduces the seek count to 1/10 > (far beyond my expectation) and improves the write throughput by 24%. > > - long block time in balance_dirty_pages() hurts desktop responsiveness > > Many of us may have the experience: it often takes a couple of seconds > or even long time to stop a heavy writing dd/cp/tar command with > Ctrl-C or "kill -9". > > - IO pipeline broken by bumpy write() progress > > There are a broad class of "loop {read(buf); write(buf);}" applications > whose read() pipeline will be under-utilized or even come to a stop if > the write()s have long latencies _or_ don't progress in a constant rate. > The current threshold based throttling inherently transfers the large > low level IO completion fluctuations to bumpy application write()s, > and further deteriorates with increasing number of dirtiers and/or bdi's. > > For example, when doing 50 dd's + 1 remote rsync to an XFS partition, > the rsync progresses very bumpy in legacy kernel, and throughput is > improved by 67% by this patchset. (plus the larger write chunk size, > it will be 93% speedup). > > The new rate based throttling can support 1000+ dd's with excellent > smoothness, low latency and low overheads. > > For the above reasons, it's much better to do IO-less and low latency > pauses in balance_dirty_pages(). > > Jan Kara, Dave Chinner and me explored the scheme to let > balance_dirty_pages() wait for enough writeback IO completions to > safeguard the dirty limit. However it's found to have two problems: > > - in large NUMA systems, the per-cpu counters may have big accounting > errors, leading to big throttle wait time and jitters. > > - NFS may kill large amount of unstable pages with one single COMMIT. > Because NFS server serves COMMIT with expensive fsync() IOs, it is > desirable to delay and reduce the number of COMMITs. So it's not > likely to optimize away such kind of bursty IO completions, and the > resulted large (and tiny) stall times in IO completion based throttling. > > So here is a pause time oriented approach, which tries to control the > pause time in each balance_dirty_pages() invocations, by controlling > the number of pages dirtied before calling balance_dirty_pages(), for > smooth and efficient dirty throttling: > > - avoid useless (eg. zero pause time) balance_dirty_pages() calls > - avoid too small pause time (less than 4ms, which burns CPU power) > - avoid too large pause time (more than 200ms, which hurts responsiveness) > - avoid big fluctuations of pause times > > It can control pause times at will. The default policy will be to do > ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case. > > BEHAVIOR CHANGE > =============== > > (1) dirty threshold > > Users will notice that the applications will get throttled once crossing > the global (background + dirty)/2=15% threshold, and then balanced around > 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable > memory in 1-dd case. > > Since the task will be soft throttled earlier than before, it may be > perceived by end users as performance "slow down" if his application > happens to dirty more than 15% dirtyable memory. > > (2) smoothness/responsiveness > > Users will notice a more responsive system during heavy writeback. > "killall dd" will take effect instantly. > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- Another minor nit below. > include/trace/events/writeback.h | 24 ---- > mm/page-writeback.c | 142 +++++++---------------------- > 2 files changed, 37 insertions(+), 129 deletions(-) > > --- linux-next.orig/mm/page-writeback.c 2011-08-06 11:17:26.000000000 +0800 > +++ linux-next/mm/page-writeback.c 2011-08-06 16:16:30.000000000 +0800 > @@ -242,50 +242,6 @@ static void bdi_writeout_fraction(struct > numerator, denominator); > } > > -static inline void task_dirties_fraction(struct task_struct *tsk, > - long *numerator, long *denominator) > -{ > - prop_fraction_single(&vm_dirties, &tsk->dirties, > - numerator, denominator); > -} > - > -/* > - * task_dirty_limit - scale down dirty throttling threshold for one task > - * > - * task specific dirty limit: > - * > - * dirty -= (dirty/8) * p_{t} > - * > - * To protect light/slow dirtying tasks from heavier/fast ones, we start > - * throttling individual tasks before reaching the bdi dirty limit. > - * Relatively low thresholds will be allocated to heavy dirtiers. So when > - * dirty pages grow large, heavy dirtiers will be throttled first, which will > - * effectively curb the growth of dirty pages. Light dirtiers with high enough > - * dirty threshold may never get throttled. > - */ > -#define TASK_LIMIT_FRACTION 8 > -static unsigned long task_dirty_limit(struct task_struct *tsk, > - unsigned long bdi_dirty) > -{ > - long numerator, denominator; > - unsigned long dirty = bdi_dirty; > - u64 inv = dirty / TASK_LIMIT_FRACTION; > - > - task_dirties_fraction(tsk, &numerator, &denominator); > - inv *= numerator; > - do_div(inv, denominator); > - > - dirty -= inv; > - > - return max(dirty, bdi_dirty/2); > -} > - > -/* Minimum limit for any task */ > -static unsigned long task_min_dirty_limit(unsigned long bdi_dirty) > -{ > - return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION; > -} > - > /* > * > */ > @@ -855,24 +811,28 @@ static unsigned long ratelimit_pages(uns > * perform some writeout. > */ > static void balance_dirty_pages(struct address_space *mapping, > - unsigned long write_chunk) > + unsigned long pages_dirtied) > { > - unsigned long nr_reclaimable, bdi_nr_reclaimable; > + unsigned long nr_reclaimable; > unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */ > unsigned long bdi_dirty; > unsigned long background_thresh; > unsigned long dirty_thresh; > unsigned long bdi_thresh; > - unsigned long task_bdi_thresh; > - unsigned long min_task_bdi_thresh; > - unsigned long pages_written = 0; > - unsigned long pause = 1; > + unsigned long pause = 0; > bool dirty_exceeded = false; > - bool clear_dirty_exceeded = true; > + unsigned long bw; > + unsigned long base_bw; > struct backing_dev_info *bdi = mapping->backing_dev_info; > unsigned long start_time = jiffies; > > for (;;) { > + /* > + * Unstable writes are a feature of certain networked > + * filesystems (i.e. NFS) in which data may have been > + * written to the server's write cache, but has not yet > + * been flushed to permanent storage. > + */ > nr_reclaimable = global_page_state(NR_FILE_DIRTY) + > global_page_state(NR_UNSTABLE_NFS); > nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); > @@ -888,8 +848,6 @@ static void balance_dirty_pages(struct a > break; > > bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); > - min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh); > - task_bdi_thresh = task_dirty_limit(current, bdi_thresh); > > /* > * In order to avoid the stacked BDI deadlock we need > @@ -901,56 +859,38 @@ static void balance_dirty_pages(struct a > * actually dirty; with m+n sitting in the percpu > * deltas. > */ > - if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) { > - bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE); > - bdi_dirty = bdi_nr_reclaimable + > + if (bdi_thresh < 2 * bdi_stat_error(bdi)) > + bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) + > bdi_stat_sum(bdi, BDI_WRITEBACK); > - } else { > - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); > - bdi_dirty = bdi_nr_reclaimable + > + else > + bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) + > bdi_stat(bdi, BDI_WRITEBACK); > - } > > - /* > - * The bdi thresh is somehow "soft" limit derived from the > - * global "hard" limit. The former helps to prevent heavy IO > - * bdi or process from holding back light ones; The latter is > - * the last resort safeguard. > - */ > - dirty_exceeded = (bdi_dirty > task_bdi_thresh) || > + dirty_exceeded = (bdi_dirty > bdi_thresh) || > (nr_dirty > dirty_thresh); > - clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) && > - (nr_dirty <= dirty_thresh); > - > - if (!dirty_exceeded) > - break; > - > - if (!bdi->dirty_exceeded) > + if (dirty_exceeded && !bdi->dirty_exceeded) > bdi->dirty_exceeded = 1; > > bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty, > bdi_thresh, bdi_dirty, start_time); > > - /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. > - * Unstable writes are a feature of certain networked > - * filesystems (i.e. NFS) in which data may have been > - * written to the server's write cache, but has not yet > - * been flushed to permanent storage. > - * Only move pages to writeback if this bdi is over its > - * threshold otherwise wait until the disk writes catch > - * up. > - */ > - trace_balance_dirty_start(bdi); > - if (bdi_nr_reclaimable > task_bdi_thresh) { > - pages_written += writeback_inodes_wb(&bdi->wb, > - write_chunk); > - trace_balance_dirty_written(bdi, pages_written); > - if (pages_written >= write_chunk) > - break; /* We've done our duty */ > + if (unlikely(!writeback_in_progress(bdi))) > + bdi_start_background_writeback(bdi); > + > + base_bw = bdi->dirty_ratelimit; > + bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty, > + bdi_thresh, bdi_dirty); > + if (unlikely(bw == 0)) { > + pause = MAX_PAUSE; > + goto pause; > } > + bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; > + pause = (HZ * pages_dirtied + bw / 2) / (bw | 1); > + pause = min(pause, MAX_PAUSE); Fix this build warning: mm/page-writeback.c: In function a??balance_dirty_pagesa??: mm/page-writeback.c:889:11: warning: comparison of distinct pointer types lacks a cast Signed-off-by: Andrea Righi <andrea@betterlinux.com> --- mm/page-writeback.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index a36f83d..a998931 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -886,7 +886,7 @@ static void balance_dirty_pages(struct address_space *mapping, } bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; pause = (HZ * pages_dirtied + bw / 2) / (bw | 1); - pause = min(pause, MAX_PAUSE); + pause = min_t(unsigned long, pause, MAX_PAUSE); pause: __set_current_state(TASK_UNINTERRUPTIBLE); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-06 14:48 ` Andrea Righi 0 siblings, 0 replies; 301+ messages in thread From: Andrea Righi @ 2011-08-06 14:48 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote: > As proposed by Chris, Dave and Jan, don't start foreground writeback IO > inside balance_dirty_pages(). Instead, simply let it idle sleep for some > time to throttle the dirtying task. In the mean while, kick off the > per-bdi flusher thread to do background writeback IO. > > RATIONALS > ========= > > - disk seeks on concurrent writeback of multiple inodes (Dave Chinner) > > If every thread doing writes and being throttled start foreground > writeback, it leads to N IO submitters from at least N different > inodes at the same time, end up with N different sets of IO being > issued with potentially zero locality to each other, resulting in > much lower elevator sort/merge efficiency and hence we seek the disk > all over the place to service the different sets of IO. > OTOH, if there is only one submission thread, it doesn't jump between > inodes in the same way when congestion clears - it keeps writing to > the same inode, resulting in large related chunks of sequential IOs > being issued to the disk. This is more efficient than the above > foreground writeback because the elevator works better and the disk > seeks less. > > - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner) > > With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes > from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". > > * "CPU usage has dropped by ~55%", "it certainly appears that most of > the CPU time saving comes from the removal of contention on the > inode_wb_list_lock" (IMHO at least 10% comes from the reduction of > cacheline bouncing, because the new code is able to call much less > frequently into balance_dirty_pages() and hence access the global > page states) > > * the user space "App overhead" is reduced by 20%, by avoiding the > cacheline pollution by the complex writeback code path > > * "for a ~5% throughput reduction", "the number of write IOs have > dropped by ~25%", and the elapsed time reduced from 41:42.17 to > 40:53.23. > > * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, > and improves IO throughput from 38MB/s to 42MB/s. > > - IO size too small for fast arrays and too large for slow USB sticks > > The write_chunk used by current balance_dirty_pages() cannot be > directly set to some large value (eg. 128MB) for better IO efficiency. > Because it could lead to more than 1 second user perceivable stalls. > Even the current 4MB write size may be too large for slow USB sticks. > The fact that balance_dirty_pages() starts IO on itself couples the > IO size to wait time, which makes it hard to do suitable IO size while > keeping the wait time under control. > > Now it's possible to increase writeback chunk size proportional to the > disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram, > the larger writeback size dramatically reduces the seek count to 1/10 > (far beyond my expectation) and improves the write throughput by 24%. > > - long block time in balance_dirty_pages() hurts desktop responsiveness > > Many of us may have the experience: it often takes a couple of seconds > or even long time to stop a heavy writing dd/cp/tar command with > Ctrl-C or "kill -9". > > - IO pipeline broken by bumpy write() progress > > There are a broad class of "loop {read(buf); write(buf);}" applications > whose read() pipeline will be under-utilized or even come to a stop if > the write()s have long latencies _or_ don't progress in a constant rate. > The current threshold based throttling inherently transfers the large > low level IO completion fluctuations to bumpy application write()s, > and further deteriorates with increasing number of dirtiers and/or bdi's. > > For example, when doing 50 dd's + 1 remote rsync to an XFS partition, > the rsync progresses very bumpy in legacy kernel, and throughput is > improved by 67% by this patchset. (plus the larger write chunk size, > it will be 93% speedup). > > The new rate based throttling can support 1000+ dd's with excellent > smoothness, low latency and low overheads. > > For the above reasons, it's much better to do IO-less and low latency > pauses in balance_dirty_pages(). > > Jan Kara, Dave Chinner and me explored the scheme to let > balance_dirty_pages() wait for enough writeback IO completions to > safeguard the dirty limit. However it's found to have two problems: > > - in large NUMA systems, the per-cpu counters may have big accounting > errors, leading to big throttle wait time and jitters. > > - NFS may kill large amount of unstable pages with one single COMMIT. > Because NFS server serves COMMIT with expensive fsync() IOs, it is > desirable to delay and reduce the number of COMMITs. So it's not > likely to optimize away such kind of bursty IO completions, and the > resulted large (and tiny) stall times in IO completion based throttling. > > So here is a pause time oriented approach, which tries to control the > pause time in each balance_dirty_pages() invocations, by controlling > the number of pages dirtied before calling balance_dirty_pages(), for > smooth and efficient dirty throttling: > > - avoid useless (eg. zero pause time) balance_dirty_pages() calls > - avoid too small pause time (less than 4ms, which burns CPU power) > - avoid too large pause time (more than 200ms, which hurts responsiveness) > - avoid big fluctuations of pause times > > It can control pause times at will. The default policy will be to do > ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case. > > BEHAVIOR CHANGE > =============== > > (1) dirty threshold > > Users will notice that the applications will get throttled once crossing > the global (background + dirty)/2=15% threshold, and then balanced around > 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable > memory in 1-dd case. > > Since the task will be soft throttled earlier than before, it may be > perceived by end users as performance "slow down" if his application > happens to dirty more than 15% dirtyable memory. > > (2) smoothness/responsiveness > > Users will notice a more responsive system during heavy writeback. > "killall dd" will take effect instantly. > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- Another minor nit below. > include/trace/events/writeback.h | 24 ---- > mm/page-writeback.c | 142 +++++++---------------------- > 2 files changed, 37 insertions(+), 129 deletions(-) > > --- linux-next.orig/mm/page-writeback.c 2011-08-06 11:17:26.000000000 +0800 > +++ linux-next/mm/page-writeback.c 2011-08-06 16:16:30.000000000 +0800 > @@ -242,50 +242,6 @@ static void bdi_writeout_fraction(struct > numerator, denominator); > } > > -static inline void task_dirties_fraction(struct task_struct *tsk, > - long *numerator, long *denominator) > -{ > - prop_fraction_single(&vm_dirties, &tsk->dirties, > - numerator, denominator); > -} > - > -/* > - * task_dirty_limit - scale down dirty throttling threshold for one task > - * > - * task specific dirty limit: > - * > - * dirty -= (dirty/8) * p_{t} > - * > - * To protect light/slow dirtying tasks from heavier/fast ones, we start > - * throttling individual tasks before reaching the bdi dirty limit. > - * Relatively low thresholds will be allocated to heavy dirtiers. So when > - * dirty pages grow large, heavy dirtiers will be throttled first, which will > - * effectively curb the growth of dirty pages. Light dirtiers with high enough > - * dirty threshold may never get throttled. > - */ > -#define TASK_LIMIT_FRACTION 8 > -static unsigned long task_dirty_limit(struct task_struct *tsk, > - unsigned long bdi_dirty) > -{ > - long numerator, denominator; > - unsigned long dirty = bdi_dirty; > - u64 inv = dirty / TASK_LIMIT_FRACTION; > - > - task_dirties_fraction(tsk, &numerator, &denominator); > - inv *= numerator; > - do_div(inv, denominator); > - > - dirty -= inv; > - > - return max(dirty, bdi_dirty/2); > -} > - > -/* Minimum limit for any task */ > -static unsigned long task_min_dirty_limit(unsigned long bdi_dirty) > -{ > - return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION; > -} > - > /* > * > */ > @@ -855,24 +811,28 @@ static unsigned long ratelimit_pages(uns > * perform some writeout. > */ > static void balance_dirty_pages(struct address_space *mapping, > - unsigned long write_chunk) > + unsigned long pages_dirtied) > { > - unsigned long nr_reclaimable, bdi_nr_reclaimable; > + unsigned long nr_reclaimable; > unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */ > unsigned long bdi_dirty; > unsigned long background_thresh; > unsigned long dirty_thresh; > unsigned long bdi_thresh; > - unsigned long task_bdi_thresh; > - unsigned long min_task_bdi_thresh; > - unsigned long pages_written = 0; > - unsigned long pause = 1; > + unsigned long pause = 0; > bool dirty_exceeded = false; > - bool clear_dirty_exceeded = true; > + unsigned long bw; > + unsigned long base_bw; > struct backing_dev_info *bdi = mapping->backing_dev_info; > unsigned long start_time = jiffies; > > for (;;) { > + /* > + * Unstable writes are a feature of certain networked > + * filesystems (i.e. NFS) in which data may have been > + * written to the server's write cache, but has not yet > + * been flushed to permanent storage. > + */ > nr_reclaimable = global_page_state(NR_FILE_DIRTY) + > global_page_state(NR_UNSTABLE_NFS); > nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); > @@ -888,8 +848,6 @@ static void balance_dirty_pages(struct a > break; > > bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); > - min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh); > - task_bdi_thresh = task_dirty_limit(current, bdi_thresh); > > /* > * In order to avoid the stacked BDI deadlock we need > @@ -901,56 +859,38 @@ static void balance_dirty_pages(struct a > * actually dirty; with m+n sitting in the percpu > * deltas. > */ > - if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) { > - bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE); > - bdi_dirty = bdi_nr_reclaimable + > + if (bdi_thresh < 2 * bdi_stat_error(bdi)) > + bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) + > bdi_stat_sum(bdi, BDI_WRITEBACK); > - } else { > - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); > - bdi_dirty = bdi_nr_reclaimable + > + else > + bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) + > bdi_stat(bdi, BDI_WRITEBACK); > - } > > - /* > - * The bdi thresh is somehow "soft" limit derived from the > - * global "hard" limit. The former helps to prevent heavy IO > - * bdi or process from holding back light ones; The latter is > - * the last resort safeguard. > - */ > - dirty_exceeded = (bdi_dirty > task_bdi_thresh) || > + dirty_exceeded = (bdi_dirty > bdi_thresh) || > (nr_dirty > dirty_thresh); > - clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) && > - (nr_dirty <= dirty_thresh); > - > - if (!dirty_exceeded) > - break; > - > - if (!bdi->dirty_exceeded) > + if (dirty_exceeded && !bdi->dirty_exceeded) > bdi->dirty_exceeded = 1; > > bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty, > bdi_thresh, bdi_dirty, start_time); > > - /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. > - * Unstable writes are a feature of certain networked > - * filesystems (i.e. NFS) in which data may have been > - * written to the server's write cache, but has not yet > - * been flushed to permanent storage. > - * Only move pages to writeback if this bdi is over its > - * threshold otherwise wait until the disk writes catch > - * up. > - */ > - trace_balance_dirty_start(bdi); > - if (bdi_nr_reclaimable > task_bdi_thresh) { > - pages_written += writeback_inodes_wb(&bdi->wb, > - write_chunk); > - trace_balance_dirty_written(bdi, pages_written); > - if (pages_written >= write_chunk) > - break; /* We've done our duty */ > + if (unlikely(!writeback_in_progress(bdi))) > + bdi_start_background_writeback(bdi); > + > + base_bw = bdi->dirty_ratelimit; > + bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty, > + bdi_thresh, bdi_dirty); > + if (unlikely(bw == 0)) { > + pause = MAX_PAUSE; > + goto pause; > } > + bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; > + pause = (HZ * pages_dirtied + bw / 2) / (bw | 1); > + pause = min(pause, MAX_PAUSE); Fix this build warning: mm/page-writeback.c: In function ‘balance_dirty_pages’: mm/page-writeback.c:889:11: warning: comparison of distinct pointer types lacks a cast Signed-off-by: Andrea Righi <andrea@betterlinux.com> --- mm/page-writeback.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index a36f83d..a998931 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -886,7 +886,7 @@ static void balance_dirty_pages(struct address_space *mapping, } bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; pause = (HZ * pages_dirtied + bw / 2) / (bw | 1); - pause = min(pause, MAX_PAUSE); + pause = min_t(unsigned long, pause, MAX_PAUSE); pause: __set_current_state(TASK_UNINTERRUPTIBLE); -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-06 14:48 ` Andrea Righi (?) @ 2011-08-07 6:44 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-07 6:44 UTC (permalink / raw) To: Andrea Righi Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML > > + bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; > > + pause = (HZ * pages_dirtied + bw / 2) / (bw | 1); > > + pause = min(pause, MAX_PAUSE); > > Fix this build warning: > > mm/page-writeback.c: In function ‘balance_dirty_pages’: > mm/page-writeback.c:889:11: warning: comparison of distinct pointer types lacks a cast Thanks! I'll fix it by changing `pause' to "long", since we'll have negative pause time anyway when considering think time compensation. Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-07 6:44 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-07 6:44 UTC (permalink / raw) To: Andrea Righi Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML > > + bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; > > + pause = (HZ * pages_dirtied + bw / 2) / (bw | 1); > > + pause = min(pause, MAX_PAUSE); > > Fix this build warning: > > mm/page-writeback.c: In function a??balance_dirty_pagesa??: > mm/page-writeback.c:889:11: warning: comparison of distinct pointer types lacks a cast Thanks! I'll fix it by changing `pause' to "long", since we'll have negative pause time anyway when considering think time compensation. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-07 6:44 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-07 6:44 UTC (permalink / raw) To: Andrea Righi Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML > > + bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; > > + pause = (HZ * pages_dirtied + bw / 2) / (bw | 1); > > + pause = min(pause, MAX_PAUSE); > > Fix this build warning: > > mm/page-writeback.c: In function ‘balance_dirty_pages’: > mm/page-writeback.c:889:11: warning: comparison of distinct pointer types lacks a cast Thanks! I'll fix it by changing `pause' to "long", since we'll have negative pause time anyway when considering think time compensation. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-06 8:44 ` Wu Fengguang @ 2011-08-06 16:46 ` Andrea Righi -1 siblings, 0 replies; 301+ messages in thread From: Andrea Righi @ 2011-08-06 16:46 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote: > As proposed by Chris, Dave and Jan, don't start foreground writeback IO > inside balance_dirty_pages(). Instead, simply let it idle sleep for some > time to throttle the dirtying task. In the mean while, kick off the > per-bdi flusher thread to do background writeback IO. > > RATIONALS > ========= > > - disk seeks on concurrent writeback of multiple inodes (Dave Chinner) > > If every thread doing writes and being throttled start foreground > writeback, it leads to N IO submitters from at least N different > inodes at the same time, end up with N different sets of IO being > issued with potentially zero locality to each other, resulting in > much lower elevator sort/merge efficiency and hence we seek the disk > all over the place to service the different sets of IO. > OTOH, if there is only one submission thread, it doesn't jump between > inodes in the same way when congestion clears - it keeps writing to > the same inode, resulting in large related chunks of sequential IOs > being issued to the disk. This is more efficient than the above > foreground writeback because the elevator works better and the disk > seeks less. > > - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner) > > With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes > from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". > > * "CPU usage has dropped by ~55%", "it certainly appears that most of > the CPU time saving comes from the removal of contention on the > inode_wb_list_lock" (IMHO at least 10% comes from the reduction of > cacheline bouncing, because the new code is able to call much less > frequently into balance_dirty_pages() and hence access the global > page states) > > * the user space "App overhead" is reduced by 20%, by avoiding the > cacheline pollution by the complex writeback code path > > * "for a ~5% throughput reduction", "the number of write IOs have > dropped by ~25%", and the elapsed time reduced from 41:42.17 to > 40:53.23. > > * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, > and improves IO throughput from 38MB/s to 42MB/s. > > - IO size too small for fast arrays and too large for slow USB sticks > > The write_chunk used by current balance_dirty_pages() cannot be > directly set to some large value (eg. 128MB) for better IO efficiency. > Because it could lead to more than 1 second user perceivable stalls. > Even the current 4MB write size may be too large for slow USB sticks. > The fact that balance_dirty_pages() starts IO on itself couples the > IO size to wait time, which makes it hard to do suitable IO size while > keeping the wait time under control. > > Now it's possible to increase writeback chunk size proportional to the > disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram, > the larger writeback size dramatically reduces the seek count to 1/10 > (far beyond my expectation) and improves the write throughput by 24%. > > - long block time in balance_dirty_pages() hurts desktop responsiveness > > Many of us may have the experience: it often takes a couple of seconds > or even long time to stop a heavy writing dd/cp/tar command with > Ctrl-C or "kill -9". > > - IO pipeline broken by bumpy write() progress > > There are a broad class of "loop {read(buf); write(buf);}" applications > whose read() pipeline will be under-utilized or even come to a stop if > the write()s have long latencies _or_ don't progress in a constant rate. > The current threshold based throttling inherently transfers the large > low level IO completion fluctuations to bumpy application write()s, > and further deteriorates with increasing number of dirtiers and/or bdi's. > > For example, when doing 50 dd's + 1 remote rsync to an XFS partition, > the rsync progresses very bumpy in legacy kernel, and throughput is > improved by 67% by this patchset. (plus the larger write chunk size, > it will be 93% speedup). > > The new rate based throttling can support 1000+ dd's with excellent > smoothness, low latency and low overheads. > > For the above reasons, it's much better to do IO-less and low latency > pauses in balance_dirty_pages(). > > Jan Kara, Dave Chinner and me explored the scheme to let > balance_dirty_pages() wait for enough writeback IO completions to > safeguard the dirty limit. However it's found to have two problems: > > - in large NUMA systems, the per-cpu counters may have big accounting > errors, leading to big throttle wait time and jitters. > > - NFS may kill large amount of unstable pages with one single COMMIT. > Because NFS server serves COMMIT with expensive fsync() IOs, it is > desirable to delay and reduce the number of COMMITs. So it's not > likely to optimize away such kind of bursty IO completions, and the > resulted large (and tiny) stall times in IO completion based throttling. > > So here is a pause time oriented approach, which tries to control the > pause time in each balance_dirty_pages() invocations, by controlling > the number of pages dirtied before calling balance_dirty_pages(), for > smooth and efficient dirty throttling: > > - avoid useless (eg. zero pause time) balance_dirty_pages() calls > - avoid too small pause time (less than 4ms, which burns CPU power) > - avoid too large pause time (more than 200ms, which hurts responsiveness) > - avoid big fluctuations of pause times I definitely agree that too small pauses must be avoided. However, I don't understand very well from the code how the minimum sleep time is regulated. I've added a simple tracepoint (see below) to monitor the pause times in balance_dirty_pages(). Sometimes I see very small pause time if I set a low dirty threshold (<=32MB). Example: # echo $((16 * 1024 * 1024)) > /proc/sys/vm/dirty_bytes # iozone -A >/dev/null & # cat /sys/kernel/debug/tracing/trace_pipe ... iozone-2075 [001] 380.604961: writeback_dirty_throttle: 1 iozone-2075 [001] 380.605966: writeback_dirty_throttle: 2 iozone-2075 [001] 380.608405: writeback_dirty_throttle: 0 iozone-2075 [001] 380.608980: writeback_dirty_throttle: 1 iozone-2075 [001] 380.609952: writeback_dirty_throttle: 1 iozone-2075 [001] 380.610952: writeback_dirty_throttle: 2 iozone-2075 [001] 380.612662: writeback_dirty_throttle: 0 iozone-2075 [000] 380.613799: writeback_dirty_throttle: 1 iozone-2075 [000] 380.614771: writeback_dirty_throttle: 1 iozone-2075 [000] 380.615767: writeback_dirty_throttle: 2 ... BTW, I can see this behavior only in the first minute while iozone is running. Ater ~1min things seem to get stable (sleeps are usually between 50ms and 200ms). I wonder if we shouldn't add an explicit check also for the minimum sleep time. Thanks, -Andrea Signed-off-by: Andrea Righi <andrea@betterlinux.com> --- include/trace/events/writeback.h | 12 ++++++++++++ mm/page-writeback.c | 1 + 2 files changed, 13 insertions(+), 0 deletions(-) diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 9c2cc8a..22b04b9 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -78,6 +78,18 @@ TRACE_EVENT(writeback_pages_written, TP_printk("%ld", __entry->pages) ); +TRACE_EVENT(writeback_dirty_throttle, + TP_PROTO(unsigned long sleep), + TP_ARGS(sleep), + TP_STRUCT__entry( + __field(unsigned long, sleep) + ), + TP_fast_assign( + __entry->sleep = sleep; + ), + TP_printk("%u", jiffies_to_msecs(__entry->sleep)) +); + DECLARE_EVENT_CLASS(writeback_class, TP_PROTO(struct backing_dev_info *bdi), TP_ARGS(bdi), diff --git a/mm/page-writeback.c b/mm/page-writeback.c index a998931..e5a2664 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -889,6 +889,7 @@ static void balance_dirty_pages(struct address_space *mapping, pause = min_t(unsigned long, pause, MAX_PAUSE); pause: + trace_writeback_dirty_throttle(pause); __set_current_state(TASK_UNINTERRUPTIBLE); io_schedule_timeout(pause); > > It can control pause times at will. The default policy will be to do > ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case. > > BEHAVIOR CHANGE > =============== > > (1) dirty threshold > > Users will notice that the applications will get throttled once crossing > the global (background + dirty)/2=15% threshold, and then balanced around > 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable > memory in 1-dd case. > > Since the task will be soft throttled earlier than before, it may be > perceived by end users as performance "slow down" if his application > happens to dirty more than 15% dirtyable memory. > > (2) smoothness/responsiveness > > Users will notice a more responsive system during heavy writeback. > "killall dd" will take effect instantly. > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > include/trace/events/writeback.h | 24 ---- > mm/page-writeback.c | 142 +++++++---------------------- > 2 files changed, 37 insertions(+), 129 deletions(-) > > --- linux-next.orig/mm/page-writeback.c 2011-08-06 11:17:26.000000000 +0800 > +++ linux-next/mm/page-writeback.c 2011-08-06 16:16:30.000000000 +0800 > @@ -242,50 +242,6 @@ static void bdi_writeout_fraction(struct > numerator, denominator); > } > > -static inline void task_dirties_fraction(struct task_struct *tsk, > - long *numerator, long *denominator) > -{ > - prop_fraction_single(&vm_dirties, &tsk->dirties, > - numerator, denominator); > -} > - > -/* > - * task_dirty_limit - scale down dirty throttling threshold for one task > - * > - * task specific dirty limit: > - * > - * dirty -= (dirty/8) * p_{t} > - * > - * To protect light/slow dirtying tasks from heavier/fast ones, we start > - * throttling individual tasks before reaching the bdi dirty limit. > - * Relatively low thresholds will be allocated to heavy dirtiers. So when > - * dirty pages grow large, heavy dirtiers will be throttled first, which will > - * effectively curb the growth of dirty pages. Light dirtiers with high enough > - * dirty threshold may never get throttled. > - */ > -#define TASK_LIMIT_FRACTION 8 > -static unsigned long task_dirty_limit(struct task_struct *tsk, > - unsigned long bdi_dirty) > -{ > - long numerator, denominator; > - unsigned long dirty = bdi_dirty; > - u64 inv = dirty / TASK_LIMIT_FRACTION; > - > - task_dirties_fraction(tsk, &numerator, &denominator); > - inv *= numerator; > - do_div(inv, denominator); > - > - dirty -= inv; > - > - return max(dirty, bdi_dirty/2); > -} > - > -/* Minimum limit for any task */ > -static unsigned long task_min_dirty_limit(unsigned long bdi_dirty) > -{ > - return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION; > -} > - > /* > * > */ > @@ -855,24 +811,28 @@ static unsigned long ratelimit_pages(uns > * perform some writeout. > */ > static void balance_dirty_pages(struct address_space *mapping, > - unsigned long write_chunk) > + unsigned long pages_dirtied) > { > - unsigned long nr_reclaimable, bdi_nr_reclaimable; > + unsigned long nr_reclaimable; > unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */ > unsigned long bdi_dirty; > unsigned long background_thresh; > unsigned long dirty_thresh; > unsigned long bdi_thresh; > - unsigned long task_bdi_thresh; > - unsigned long min_task_bdi_thresh; > - unsigned long pages_written = 0; > - unsigned long pause = 1; > + unsigned long pause = 0; > bool dirty_exceeded = false; > - bool clear_dirty_exceeded = true; > + unsigned long bw; > + unsigned long base_bw; > struct backing_dev_info *bdi = mapping->backing_dev_info; > unsigned long start_time = jiffies; > > for (;;) { > + /* > + * Unstable writes are a feature of certain networked > + * filesystems (i.e. NFS) in which data may have been > + * written to the server's write cache, but has not yet > + * been flushed to permanent storage. > + */ > nr_reclaimable = global_page_state(NR_FILE_DIRTY) + > global_page_state(NR_UNSTABLE_NFS); > nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); > @@ -888,8 +848,6 @@ static void balance_dirty_pages(struct a > break; > > bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); > - min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh); > - task_bdi_thresh = task_dirty_limit(current, bdi_thresh); > > /* > * In order to avoid the stacked BDI deadlock we need > @@ -901,56 +859,38 @@ static void balance_dirty_pages(struct a > * actually dirty; with m+n sitting in the percpu > * deltas. > */ > - if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) { > - bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE); > - bdi_dirty = bdi_nr_reclaimable + > + if (bdi_thresh < 2 * bdi_stat_error(bdi)) > + bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) + > bdi_stat_sum(bdi, BDI_WRITEBACK); > - } else { > - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); > - bdi_dirty = bdi_nr_reclaimable + > + else > + bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) + > bdi_stat(bdi, BDI_WRITEBACK); > - } > > - /* > - * The bdi thresh is somehow "soft" limit derived from the > - * global "hard" limit. The former helps to prevent heavy IO > - * bdi or process from holding back light ones; The latter is > - * the last resort safeguard. > - */ > - dirty_exceeded = (bdi_dirty > task_bdi_thresh) || > + dirty_exceeded = (bdi_dirty > bdi_thresh) || > (nr_dirty > dirty_thresh); > - clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) && > - (nr_dirty <= dirty_thresh); > - > - if (!dirty_exceeded) > - break; > - > - if (!bdi->dirty_exceeded) > + if (dirty_exceeded && !bdi->dirty_exceeded) > bdi->dirty_exceeded = 1; > > bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty, > bdi_thresh, bdi_dirty, start_time); > > - /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. > - * Unstable writes are a feature of certain networked > - * filesystems (i.e. NFS) in which data may have been > - * written to the server's write cache, but has not yet > - * been flushed to permanent storage. > - * Only move pages to writeback if this bdi is over its > - * threshold otherwise wait until the disk writes catch > - * up. > - */ > - trace_balance_dirty_start(bdi); > - if (bdi_nr_reclaimable > task_bdi_thresh) { > - pages_written += writeback_inodes_wb(&bdi->wb, > - write_chunk); > - trace_balance_dirty_written(bdi, pages_written); > - if (pages_written >= write_chunk) > - break; /* We've done our duty */ > + if (unlikely(!writeback_in_progress(bdi))) > + bdi_start_background_writeback(bdi); > + > + base_bw = bdi->dirty_ratelimit; > + bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty, > + bdi_thresh, bdi_dirty); > + if (unlikely(bw == 0)) { > + pause = MAX_PAUSE; > + goto pause; > } > + bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; > + pause = (HZ * pages_dirtied + bw / 2) / (bw | 1); > + pause = min(pause, MAX_PAUSE); > + > +pause: > __set_current_state(TASK_UNINTERRUPTIBLE); > io_schedule_timeout(pause); > - trace_balance_dirty_wait(bdi); > > dirty_thresh = hard_dirty_limit(dirty_thresh); > /* > @@ -960,8 +900,7 @@ static void balance_dirty_pages(struct a > * (b) the pause time limit makes the dirtiers more responsive. > */ > if (nr_dirty < dirty_thresh + > - dirty_thresh / DIRTY_MAXPAUSE_AREA && > - time_after(jiffies, start_time + MAX_PAUSE)) > + dirty_thresh / DIRTY_MAXPAUSE_AREA) > break; > /* > * pass-good area. When some bdi gets blocked (eg. NFS server > @@ -974,18 +913,9 @@ static void balance_dirty_pages(struct a > dirty_thresh / DIRTY_PASSGOOD_AREA && > bdi_dirty < bdi_thresh) > break; > - > - /* > - * Increase the delay for each loop, up to our previous > - * default of taking a 100ms nap. > - */ > - pause <<= 1; > - if (pause > HZ / 10) > - pause = HZ / 10; > } > > - /* Clear dirty_exceeded flag only when no task can exceed the limit */ > - if (clear_dirty_exceeded && bdi->dirty_exceeded) > + if (!dirty_exceeded && bdi->dirty_exceeded) > bdi->dirty_exceeded = 0; > > current->nr_dirtied = 0; > @@ -1002,8 +932,10 @@ static void balance_dirty_pages(struct a > * In normal mode, we start background writeout at the lower > * background_thresh, to keep the amount of dirty memory low. > */ > - if ((laptop_mode && pages_written) || > - (!laptop_mode && (nr_reclaimable > background_thresh))) > + if (laptop_mode) > + return; > + > + if (nr_reclaimable > background_thresh) > bdi_start_background_writeback(bdi); > } > > --- linux-next.orig/include/trace/events/writeback.h 2011-08-06 11:08:34.000000000 +0800 > +++ linux-next/include/trace/events/writeback.h 2011-08-06 11:17:29.000000000 +0800 > @@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg > DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister); > DEFINE_WRITEBACK_EVENT(writeback_thread_start); > DEFINE_WRITEBACK_EVENT(writeback_thread_stop); > -DEFINE_WRITEBACK_EVENT(balance_dirty_start); > -DEFINE_WRITEBACK_EVENT(balance_dirty_wait); > - > -TRACE_EVENT(balance_dirty_written, > - > - TP_PROTO(struct backing_dev_info *bdi, int written), > - > - TP_ARGS(bdi, written), > - > - TP_STRUCT__entry( > - __array(char, name, 32) > - __field(int, written) > - ), > - > - TP_fast_assign( > - strncpy(__entry->name, dev_name(bdi->dev), 32); > - __entry->written = written; > - ), > - > - TP_printk("bdi %s written %d", > - __entry->name, > - __entry->written > - ) > -); > > DECLARE_EVENT_CLASS(wbc_class, > TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), > ^ permalink raw reply related [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-06 16:46 ` Andrea Righi 0 siblings, 0 replies; 301+ messages in thread From: Andrea Righi @ 2011-08-06 16:46 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote: > As proposed by Chris, Dave and Jan, don't start foreground writeback IO > inside balance_dirty_pages(). Instead, simply let it idle sleep for some > time to throttle the dirtying task. In the mean while, kick off the > per-bdi flusher thread to do background writeback IO. > > RATIONALS > ========= > > - disk seeks on concurrent writeback of multiple inodes (Dave Chinner) > > If every thread doing writes and being throttled start foreground > writeback, it leads to N IO submitters from at least N different > inodes at the same time, end up with N different sets of IO being > issued with potentially zero locality to each other, resulting in > much lower elevator sort/merge efficiency and hence we seek the disk > all over the place to service the different sets of IO. > OTOH, if there is only one submission thread, it doesn't jump between > inodes in the same way when congestion clears - it keeps writing to > the same inode, resulting in large related chunks of sequential IOs > being issued to the disk. This is more efficient than the above > foreground writeback because the elevator works better and the disk > seeks less. > > - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner) > > With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes > from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". > > * "CPU usage has dropped by ~55%", "it certainly appears that most of > the CPU time saving comes from the removal of contention on the > inode_wb_list_lock" (IMHO at least 10% comes from the reduction of > cacheline bouncing, because the new code is able to call much less > frequently into balance_dirty_pages() and hence access the global > page states) > > * the user space "App overhead" is reduced by 20%, by avoiding the > cacheline pollution by the complex writeback code path > > * "for a ~5% throughput reduction", "the number of write IOs have > dropped by ~25%", and the elapsed time reduced from 41:42.17 to > 40:53.23. > > * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, > and improves IO throughput from 38MB/s to 42MB/s. > > - IO size too small for fast arrays and too large for slow USB sticks > > The write_chunk used by current balance_dirty_pages() cannot be > directly set to some large value (eg. 128MB) for better IO efficiency. > Because it could lead to more than 1 second user perceivable stalls. > Even the current 4MB write size may be too large for slow USB sticks. > The fact that balance_dirty_pages() starts IO on itself couples the > IO size to wait time, which makes it hard to do suitable IO size while > keeping the wait time under control. > > Now it's possible to increase writeback chunk size proportional to the > disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram, > the larger writeback size dramatically reduces the seek count to 1/10 > (far beyond my expectation) and improves the write throughput by 24%. > > - long block time in balance_dirty_pages() hurts desktop responsiveness > > Many of us may have the experience: it often takes a couple of seconds > or even long time to stop a heavy writing dd/cp/tar command with > Ctrl-C or "kill -9". > > - IO pipeline broken by bumpy write() progress > > There are a broad class of "loop {read(buf); write(buf);}" applications > whose read() pipeline will be under-utilized or even come to a stop if > the write()s have long latencies _or_ don't progress in a constant rate. > The current threshold based throttling inherently transfers the large > low level IO completion fluctuations to bumpy application write()s, > and further deteriorates with increasing number of dirtiers and/or bdi's. > > For example, when doing 50 dd's + 1 remote rsync to an XFS partition, > the rsync progresses very bumpy in legacy kernel, and throughput is > improved by 67% by this patchset. (plus the larger write chunk size, > it will be 93% speedup). > > The new rate based throttling can support 1000+ dd's with excellent > smoothness, low latency and low overheads. > > For the above reasons, it's much better to do IO-less and low latency > pauses in balance_dirty_pages(). > > Jan Kara, Dave Chinner and me explored the scheme to let > balance_dirty_pages() wait for enough writeback IO completions to > safeguard the dirty limit. However it's found to have two problems: > > - in large NUMA systems, the per-cpu counters may have big accounting > errors, leading to big throttle wait time and jitters. > > - NFS may kill large amount of unstable pages with one single COMMIT. > Because NFS server serves COMMIT with expensive fsync() IOs, it is > desirable to delay and reduce the number of COMMITs. So it's not > likely to optimize away such kind of bursty IO completions, and the > resulted large (and tiny) stall times in IO completion based throttling. > > So here is a pause time oriented approach, which tries to control the > pause time in each balance_dirty_pages() invocations, by controlling > the number of pages dirtied before calling balance_dirty_pages(), for > smooth and efficient dirty throttling: > > - avoid useless (eg. zero pause time) balance_dirty_pages() calls > - avoid too small pause time (less than 4ms, which burns CPU power) > - avoid too large pause time (more than 200ms, which hurts responsiveness) > - avoid big fluctuations of pause times I definitely agree that too small pauses must be avoided. However, I don't understand very well from the code how the minimum sleep time is regulated. I've added a simple tracepoint (see below) to monitor the pause times in balance_dirty_pages(). Sometimes I see very small pause time if I set a low dirty threshold (<=32MB). Example: # echo $((16 * 1024 * 1024)) > /proc/sys/vm/dirty_bytes # iozone -A >/dev/null & # cat /sys/kernel/debug/tracing/trace_pipe ... iozone-2075 [001] 380.604961: writeback_dirty_throttle: 1 iozone-2075 [001] 380.605966: writeback_dirty_throttle: 2 iozone-2075 [001] 380.608405: writeback_dirty_throttle: 0 iozone-2075 [001] 380.608980: writeback_dirty_throttle: 1 iozone-2075 [001] 380.609952: writeback_dirty_throttle: 1 iozone-2075 [001] 380.610952: writeback_dirty_throttle: 2 iozone-2075 [001] 380.612662: writeback_dirty_throttle: 0 iozone-2075 [000] 380.613799: writeback_dirty_throttle: 1 iozone-2075 [000] 380.614771: writeback_dirty_throttle: 1 iozone-2075 [000] 380.615767: writeback_dirty_throttle: 2 ... BTW, I can see this behavior only in the first minute while iozone is running. Ater ~1min things seem to get stable (sleeps are usually between 50ms and 200ms). I wonder if we shouldn't add an explicit check also for the minimum sleep time. Thanks, -Andrea Signed-off-by: Andrea Righi <andrea@betterlinux.com> --- include/trace/events/writeback.h | 12 ++++++++++++ mm/page-writeback.c | 1 + 2 files changed, 13 insertions(+), 0 deletions(-) diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 9c2cc8a..22b04b9 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -78,6 +78,18 @@ TRACE_EVENT(writeback_pages_written, TP_printk("%ld", __entry->pages) ); +TRACE_EVENT(writeback_dirty_throttle, + TP_PROTO(unsigned long sleep), + TP_ARGS(sleep), + TP_STRUCT__entry( + __field(unsigned long, sleep) + ), + TP_fast_assign( + __entry->sleep = sleep; + ), + TP_printk("%u", jiffies_to_msecs(__entry->sleep)) +); + DECLARE_EVENT_CLASS(writeback_class, TP_PROTO(struct backing_dev_info *bdi), TP_ARGS(bdi), diff --git a/mm/page-writeback.c b/mm/page-writeback.c index a998931..e5a2664 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -889,6 +889,7 @@ static void balance_dirty_pages(struct address_space *mapping, pause = min_t(unsigned long, pause, MAX_PAUSE); pause: + trace_writeback_dirty_throttle(pause); __set_current_state(TASK_UNINTERRUPTIBLE); io_schedule_timeout(pause); > > It can control pause times at will. The default policy will be to do > ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case. > > BEHAVIOR CHANGE > =============== > > (1) dirty threshold > > Users will notice that the applications will get throttled once crossing > the global (background + dirty)/2=15% threshold, and then balanced around > 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable > memory in 1-dd case. > > Since the task will be soft throttled earlier than before, it may be > perceived by end users as performance "slow down" if his application > happens to dirty more than 15% dirtyable memory. > > (2) smoothness/responsiveness > > Users will notice a more responsive system during heavy writeback. > "killall dd" will take effect instantly. > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > include/trace/events/writeback.h | 24 ---- > mm/page-writeback.c | 142 +++++++---------------------- > 2 files changed, 37 insertions(+), 129 deletions(-) > > --- linux-next.orig/mm/page-writeback.c 2011-08-06 11:17:26.000000000 +0800 > +++ linux-next/mm/page-writeback.c 2011-08-06 16:16:30.000000000 +0800 > @@ -242,50 +242,6 @@ static void bdi_writeout_fraction(struct > numerator, denominator); > } > > -static inline void task_dirties_fraction(struct task_struct *tsk, > - long *numerator, long *denominator) > -{ > - prop_fraction_single(&vm_dirties, &tsk->dirties, > - numerator, denominator); > -} > - > -/* > - * task_dirty_limit - scale down dirty throttling threshold for one task > - * > - * task specific dirty limit: > - * > - * dirty -= (dirty/8) * p_{t} > - * > - * To protect light/slow dirtying tasks from heavier/fast ones, we start > - * throttling individual tasks before reaching the bdi dirty limit. > - * Relatively low thresholds will be allocated to heavy dirtiers. So when > - * dirty pages grow large, heavy dirtiers will be throttled first, which will > - * effectively curb the growth of dirty pages. Light dirtiers with high enough > - * dirty threshold may never get throttled. > - */ > -#define TASK_LIMIT_FRACTION 8 > -static unsigned long task_dirty_limit(struct task_struct *tsk, > - unsigned long bdi_dirty) > -{ > - long numerator, denominator; > - unsigned long dirty = bdi_dirty; > - u64 inv = dirty / TASK_LIMIT_FRACTION; > - > - task_dirties_fraction(tsk, &numerator, &denominator); > - inv *= numerator; > - do_div(inv, denominator); > - > - dirty -= inv; > - > - return max(dirty, bdi_dirty/2); > -} > - > -/* Minimum limit for any task */ > -static unsigned long task_min_dirty_limit(unsigned long bdi_dirty) > -{ > - return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION; > -} > - > /* > * > */ > @@ -855,24 +811,28 @@ static unsigned long ratelimit_pages(uns > * perform some writeout. > */ > static void balance_dirty_pages(struct address_space *mapping, > - unsigned long write_chunk) > + unsigned long pages_dirtied) > { > - unsigned long nr_reclaimable, bdi_nr_reclaimable; > + unsigned long nr_reclaimable; > unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */ > unsigned long bdi_dirty; > unsigned long background_thresh; > unsigned long dirty_thresh; > unsigned long bdi_thresh; > - unsigned long task_bdi_thresh; > - unsigned long min_task_bdi_thresh; > - unsigned long pages_written = 0; > - unsigned long pause = 1; > + unsigned long pause = 0; > bool dirty_exceeded = false; > - bool clear_dirty_exceeded = true; > + unsigned long bw; > + unsigned long base_bw; > struct backing_dev_info *bdi = mapping->backing_dev_info; > unsigned long start_time = jiffies; > > for (;;) { > + /* > + * Unstable writes are a feature of certain networked > + * filesystems (i.e. NFS) in which data may have been > + * written to the server's write cache, but has not yet > + * been flushed to permanent storage. > + */ > nr_reclaimable = global_page_state(NR_FILE_DIRTY) + > global_page_state(NR_UNSTABLE_NFS); > nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); > @@ -888,8 +848,6 @@ static void balance_dirty_pages(struct a > break; > > bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); > - min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh); > - task_bdi_thresh = task_dirty_limit(current, bdi_thresh); > > /* > * In order to avoid the stacked BDI deadlock we need > @@ -901,56 +859,38 @@ static void balance_dirty_pages(struct a > * actually dirty; with m+n sitting in the percpu > * deltas. > */ > - if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) { > - bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE); > - bdi_dirty = bdi_nr_reclaimable + > + if (bdi_thresh < 2 * bdi_stat_error(bdi)) > + bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) + > bdi_stat_sum(bdi, BDI_WRITEBACK); > - } else { > - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); > - bdi_dirty = bdi_nr_reclaimable + > + else > + bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) + > bdi_stat(bdi, BDI_WRITEBACK); > - } > > - /* > - * The bdi thresh is somehow "soft" limit derived from the > - * global "hard" limit. The former helps to prevent heavy IO > - * bdi or process from holding back light ones; The latter is > - * the last resort safeguard. > - */ > - dirty_exceeded = (bdi_dirty > task_bdi_thresh) || > + dirty_exceeded = (bdi_dirty > bdi_thresh) || > (nr_dirty > dirty_thresh); > - clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) && > - (nr_dirty <= dirty_thresh); > - > - if (!dirty_exceeded) > - break; > - > - if (!bdi->dirty_exceeded) > + if (dirty_exceeded && !bdi->dirty_exceeded) > bdi->dirty_exceeded = 1; > > bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty, > bdi_thresh, bdi_dirty, start_time); > > - /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. > - * Unstable writes are a feature of certain networked > - * filesystems (i.e. NFS) in which data may have been > - * written to the server's write cache, but has not yet > - * been flushed to permanent storage. > - * Only move pages to writeback if this bdi is over its > - * threshold otherwise wait until the disk writes catch > - * up. > - */ > - trace_balance_dirty_start(bdi); > - if (bdi_nr_reclaimable > task_bdi_thresh) { > - pages_written += writeback_inodes_wb(&bdi->wb, > - write_chunk); > - trace_balance_dirty_written(bdi, pages_written); > - if (pages_written >= write_chunk) > - break; /* We've done our duty */ > + if (unlikely(!writeback_in_progress(bdi))) > + bdi_start_background_writeback(bdi); > + > + base_bw = bdi->dirty_ratelimit; > + bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty, > + bdi_thresh, bdi_dirty); > + if (unlikely(bw == 0)) { > + pause = MAX_PAUSE; > + goto pause; > } > + bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; > + pause = (HZ * pages_dirtied + bw / 2) / (bw | 1); > + pause = min(pause, MAX_PAUSE); > + > +pause: > __set_current_state(TASK_UNINTERRUPTIBLE); > io_schedule_timeout(pause); > - trace_balance_dirty_wait(bdi); > > dirty_thresh = hard_dirty_limit(dirty_thresh); > /* > @@ -960,8 +900,7 @@ static void balance_dirty_pages(struct a > * (b) the pause time limit makes the dirtiers more responsive. > */ > if (nr_dirty < dirty_thresh + > - dirty_thresh / DIRTY_MAXPAUSE_AREA && > - time_after(jiffies, start_time + MAX_PAUSE)) > + dirty_thresh / DIRTY_MAXPAUSE_AREA) > break; > /* > * pass-good area. When some bdi gets blocked (eg. NFS server > @@ -974,18 +913,9 @@ static void balance_dirty_pages(struct a > dirty_thresh / DIRTY_PASSGOOD_AREA && > bdi_dirty < bdi_thresh) > break; > - > - /* > - * Increase the delay for each loop, up to our previous > - * default of taking a 100ms nap. > - */ > - pause <<= 1; > - if (pause > HZ / 10) > - pause = HZ / 10; > } > > - /* Clear dirty_exceeded flag only when no task can exceed the limit */ > - if (clear_dirty_exceeded && bdi->dirty_exceeded) > + if (!dirty_exceeded && bdi->dirty_exceeded) > bdi->dirty_exceeded = 0; > > current->nr_dirtied = 0; > @@ -1002,8 +932,10 @@ static void balance_dirty_pages(struct a > * In normal mode, we start background writeout at the lower > * background_thresh, to keep the amount of dirty memory low. > */ > - if ((laptop_mode && pages_written) || > - (!laptop_mode && (nr_reclaimable > background_thresh))) > + if (laptop_mode) > + return; > + > + if (nr_reclaimable > background_thresh) > bdi_start_background_writeback(bdi); > } > > --- linux-next.orig/include/trace/events/writeback.h 2011-08-06 11:08:34.000000000 +0800 > +++ linux-next/include/trace/events/writeback.h 2011-08-06 11:17:29.000000000 +0800 > @@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg > DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister); > DEFINE_WRITEBACK_EVENT(writeback_thread_start); > DEFINE_WRITEBACK_EVENT(writeback_thread_stop); > -DEFINE_WRITEBACK_EVENT(balance_dirty_start); > -DEFINE_WRITEBACK_EVENT(balance_dirty_wait); > - > -TRACE_EVENT(balance_dirty_written, > - > - TP_PROTO(struct backing_dev_info *bdi, int written), > - > - TP_ARGS(bdi, written), > - > - TP_STRUCT__entry( > - __array(char, name, 32) > - __field(int, written) > - ), > - > - TP_fast_assign( > - strncpy(__entry->name, dev_name(bdi->dev), 32); > - __entry->written = written; > - ), > - > - TP_printk("bdi %s written %d", > - __entry->name, > - __entry->written > - ) > -); > > DECLARE_EVENT_CLASS(wbc_class, > TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-06 16:46 ` Andrea Righi (?) @ 2011-08-07 7:18 ` Wu Fengguang 2011-08-07 9:50 ` Andrea Righi -1 siblings, 1 reply; 301+ messages in thread From: Wu Fengguang @ 2011-08-07 7:18 UTC (permalink / raw) To: Andrea Righi Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML [-- Attachment #1: Type: text/plain, Size: 3482 bytes --] Andrea, On Sun, Aug 07, 2011 at 12:46:56AM +0800, Andrea Righi wrote: > On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote: > > So here is a pause time oriented approach, which tries to control the > > pause time in each balance_dirty_pages() invocations, by controlling > > the number of pages dirtied before calling balance_dirty_pages(), for > > smooth and efficient dirty throttling: > > > > - avoid useless (eg. zero pause time) balance_dirty_pages() calls > > - avoid too small pause time (less than 4ms, which burns CPU power) > > - avoid too large pause time (more than 200ms, which hurts responsiveness) > > - avoid big fluctuations of pause times > > I definitely agree that too small pauses must be avoided. However, I > don't understand very well from the code how the minimum sleep time is > regulated. Thanks for pointing this out. Yes, the sleep time regulation is not here and I should have mentioned that above. Since this is only the core bits, there will be some followup patches to fix the rough edges. (attached the two relevant patches) > I've added a simple tracepoint (see below) to monitor the pause times in > balance_dirty_pages(). > > Sometimes I see very small pause time if I set a low dirty threshold > (<=32MB). Yeah, it's definitely possible. > Example: > > # echo $((16 * 1024 * 1024)) > /proc/sys/vm/dirty_bytes > # iozone -A >/dev/null & > # cat /sys/kernel/debug/tracing/trace_pipe > ... > iozone-2075 [001] 380.604961: writeback_dirty_throttle: 1 > iozone-2075 [001] 380.605966: writeback_dirty_throttle: 2 > iozone-2075 [001] 380.608405: writeback_dirty_throttle: 0 > iozone-2075 [001] 380.608980: writeback_dirty_throttle: 1 > iozone-2075 [001] 380.609952: writeback_dirty_throttle: 1 > iozone-2075 [001] 380.610952: writeback_dirty_throttle: 2 > iozone-2075 [001] 380.612662: writeback_dirty_throttle: 0 > iozone-2075 [000] 380.613799: writeback_dirty_throttle: 1 > iozone-2075 [000] 380.614771: writeback_dirty_throttle: 1 > iozone-2075 [000] 380.615767: writeback_dirty_throttle: 2 > ... > > BTW, I can see this behavior only in the first minute while iozone is > running. Ater ~1min things seem to get stable (sleeps are usually > between 50ms and 200ms). > Yeah, it's roughly in line with this graph, where the red dots are the pause time: http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/512M/xfs-1dd-4k-8p-438M-20:10-3.0.0-next-20110802+-2011-08-06.11:03/balance_dirty_pages-pause.png Note that the big change of pattern in the middle is due to a deliberate disturb: a dd will be started at 100s _reading_ 1GB data, which effectively livelocked the other dd dirtier task with the CFQ io scheduler. > I wonder if we shouldn't add an explicit check also for the minimum > sleep time. With the more complete patchset including the pause time regulation, the pause time distribution should look much better, falling nicely into the range (5ms, 20ms): http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/xfs-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:51/balance_dirty_pages-pause.png > +TRACE_EVENT(writeback_dirty_throttle, > + TP_PROTO(unsigned long sleep), > + TP_ARGS(sleep), btw, I've just pushed two more tracing patches to the git tree. Hope it helps :) Thanks, Fengguang [-- Attachment #2: max-pause --] [-- Type: text/plain, Size: 3065 bytes --] Subject: writeback: limit max dirty pause time Date: Sat Jun 11 19:21:43 CST 2011 Apply two policies to scale down the max pause time for 1) small number of concurrent dirtiers 2) small memory system (comparing to storage bandwidth) MAX_PAUSE=200ms may only be suitable for high end servers with lots of concurrent dirtiers, where the large pause time can reduce much overheads. Otherwise, smaller pause time is desirable whenever possible, so as to get good responsiveness and smooth user experiences. It's actually required for good disk utilization in the case when all the dirty pages can be synced to disk within MAX_PAUSE=200ms. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- mm/page-writeback.c | 43 ++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 41 insertions(+), 2 deletions(-) --- linux-next.orig/mm/page-writeback.c 2011-08-07 14:23:45.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-07 14:25:29.000000000 +0800 @@ -856,6 +856,42 @@ static unsigned long ratelimit_pages(uns return 1; } +static unsigned long bdi_max_pause(struct backing_dev_info *bdi, + unsigned long bdi_dirty) +{ + unsigned long hi = ilog2(bdi->write_bandwidth); + unsigned long lo = ilog2(bdi->dirty_ratelimit); + unsigned long t; + + /* target for ~10ms pause on 1-dd case */ + t = HZ / 50; + + /* + * Scale up pause time for concurrent dirtiers in order to reduce CPU + * overheads. + * + * (N * 20ms) on 2^N concurrent tasks. + */ + if (hi > lo) + t += (hi - lo) * (20 * HZ) / 1024; + + /* + * Limit pause time for small memory systems. If sleeping for too long + * time, a small pool of dirty/writeback pages may go empty and disk go + * idle. + * + * 1ms for every 1MB; may further consider bdi bandwidth. + */ + if (bdi_dirty) + t = min(t, bdi_dirty >> (30 - PAGE_CACHE_SHIFT - ilog2(HZ))); + + /* + * The pause time will be settled within range (max_pause/4, max_pause). + * Apply a minimal value of 4 to get a non-zero max_pause/4. + */ + return clamp_val(t, 4, MAX_PAUSE); +} + /* * balance_dirty_pages() must be called by processes which are generating dirty * data. It looks at the number of dirty pages in the machine and will force @@ -873,6 +909,7 @@ static void balance_dirty_pages(struct a unsigned long dirty_thresh; unsigned long bdi_thresh; long pause = 0; + long max_pause; bool dirty_exceeded = false; unsigned long bw; unsigned long base_bw; @@ -930,16 +967,18 @@ static void balance_dirty_pages(struct a if (unlikely(!writeback_in_progress(bdi))) bdi_start_background_writeback(bdi); + max_pause = bdi_max_pause(bdi, bdi_dirty); + base_bw = bdi->dirty_ratelimit; bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty, bdi_thresh, bdi_dirty); if (unlikely(bw == 0)) { - pause = MAX_PAUSE; + pause = max_pause; goto pause; } bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; pause = (HZ * pages_dirtied + bw / 2) / (bw | 1); - pause = min(pause, MAX_PAUSE); + pause = min(pause, max_pause); pause: trace_balance_dirty_pages(bdi, [-- Attachment #3: max-pause-adaption --] [-- Type: text/plain, Size: 1829 bytes --] Subject: writeback: control dirty pause time Date: Sat Jun 11 19:32:32 CST 2011 The dirty pause time shall ultimately be controlled by adjusting nr_dirtied_pause, since there is relationship pause = pages_dirtied / pos_bw Assuming pages_dirtied ~= nr_dirtied_pause pos_bw ~= base_bw We get nr_dirtied_pause ~= base_bw * desired_pause Here base_bw is preferred over pos_bw because it's more stable. It's also important to limit possible large transitional errors: - bw is changing quickly - pages_dirtied << nr_dirtied_pause on entering dirty exceeded area - pages_dirtied >> nr_dirtied_pause on btrfs (to be improved by a separate fix, but still expect non-trivial errors) So we end up using the above formula inside clamp_val(). The best test case for this code is to run 100 "dd bs=4M" tasks on btrfs and check its pause time distribution. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- mm/page-writeback.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) --- linux-next.orig/mm/page-writeback.c 2011-08-07 14:51:18.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-07 15:02:08.000000000 +0800 @@ -1021,7 +1021,19 @@ pause: bdi->dirty_exceeded = 0; current->nr_dirtied = 0; - current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh); + if (pause == 0) + current->nr_dirtied_pause = + ratelimit_pages(nr_dirty, dirty_thresh); + else if (pause < max_pause / 4) + current->nr_dirtied_pause = clamp_val( + base_bw * (max_pause/2) / HZ, + pages_dirtied + pages_dirtied/8, + pages_dirtied * 4); + else if (pause > max_pause) + current->nr_dirtied_pause = 1 | clamp_val( + base_bw * (max_pause*3/8) / HZ, + current->nr_dirtied_pause / 4, + current->nr_dirtied_pause*7/8); if (writeback_in_progress(bdi)) return; ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-07 7:18 ` Wu Fengguang @ 2011-08-07 9:50 ` Andrea Righi 0 siblings, 0 replies; 301+ messages in thread From: Andrea Righi @ 2011-08-07 9:50 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML On Sun, Aug 07, 2011 at 03:18:57PM +0800, Wu Fengguang wrote: > Andrea, > > On Sun, Aug 07, 2011 at 12:46:56AM +0800, Andrea Righi wrote: > > On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote: > > > > So here is a pause time oriented approach, which tries to control the > > > pause time in each balance_dirty_pages() invocations, by controlling > > > the number of pages dirtied before calling balance_dirty_pages(), for > > > smooth and efficient dirty throttling: > > > > > > - avoid useless (eg. zero pause time) balance_dirty_pages() calls > > > - avoid too small pause time (less than 4ms, which burns CPU power) > > > - avoid too large pause time (more than 200ms, which hurts responsiveness) > > > - avoid big fluctuations of pause times > > > > I definitely agree that too small pauses must be avoided. However, I > > don't understand very well from the code how the minimum sleep time is > > regulated. > > Thanks for pointing this out. Yes, the sleep time regulation is not > here and I should have mentioned that above. Since this is only the > core bits, there will be some followup patches to fix the rough edges. > (attached the two relevant patches) > > > I've added a simple tracepoint (see below) to monitor the pause times in > > balance_dirty_pages(). > > > > Sometimes I see very small pause time if I set a low dirty threshold > > (<=32MB). > > Yeah, it's definitely possible. > > > Example: > > > > # echo $((16 * 1024 * 1024)) > /proc/sys/vm/dirty_bytes > > # iozone -A >/dev/null & > > # cat /sys/kernel/debug/tracing/trace_pipe > > ... > > iozone-2075 [001] 380.604961: writeback_dirty_throttle: 1 > > iozone-2075 [001] 380.605966: writeback_dirty_throttle: 2 > > iozone-2075 [001] 380.608405: writeback_dirty_throttle: 0 > > iozone-2075 [001] 380.608980: writeback_dirty_throttle: 1 > > iozone-2075 [001] 380.609952: writeback_dirty_throttle: 1 > > iozone-2075 [001] 380.610952: writeback_dirty_throttle: 2 > > iozone-2075 [001] 380.612662: writeback_dirty_throttle: 0 > > iozone-2075 [000] 380.613799: writeback_dirty_throttle: 1 > > iozone-2075 [000] 380.614771: writeback_dirty_throttle: 1 > > iozone-2075 [000] 380.615767: writeback_dirty_throttle: 2 > > ... > > > > BTW, I can see this behavior only in the first minute while iozone is > > running. Ater ~1min things seem to get stable (sleeps are usually > > between 50ms and 200ms). > > > > Yeah, it's roughly in line with this graph, where the red dots are the > pause time: > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/512M/xfs-1dd-4k-8p-438M-20:10-3.0.0-next-20110802+-2011-08-06.11:03/balance_dirty_pages-pause.png > > Note that the big change of pattern in the middle is due to a > deliberate disturb: a dd will be started at 100s _reading_ 1GB data, > which effectively livelocked the other dd dirtier task with the CFQ io > scheduler. > > > I wonder if we shouldn't add an explicit check also for the minimum > > sleep time. > > With the more complete patchset including the pause time regulation, > the pause time distribution should look much better, falling nicely > into the range (5ms, 20ms): > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/xfs-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:51/balance_dirty_pages-pause.png > > > +TRACE_EVENT(writeback_dirty_throttle, > > + TP_PROTO(unsigned long sleep), > > + TP_ARGS(sleep), > > btw, I've just pushed two more tracing patches to the git tree. > Hope it helps :) Perfect. Thanks for the clarification and the additional patches, I'm going to test them right now. -Andrea ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-07 9:50 ` Andrea Righi 0 siblings, 0 replies; 301+ messages in thread From: Andrea Righi @ 2011-08-07 9:50 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm, LKML On Sun, Aug 07, 2011 at 03:18:57PM +0800, Wu Fengguang wrote: > Andrea, > > On Sun, Aug 07, 2011 at 12:46:56AM +0800, Andrea Righi wrote: > > On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote: > > > > So here is a pause time oriented approach, which tries to control the > > > pause time in each balance_dirty_pages() invocations, by controlling > > > the number of pages dirtied before calling balance_dirty_pages(), for > > > smooth and efficient dirty throttling: > > > > > > - avoid useless (eg. zero pause time) balance_dirty_pages() calls > > > - avoid too small pause time (less than 4ms, which burns CPU power) > > > - avoid too large pause time (more than 200ms, which hurts responsiveness) > > > - avoid big fluctuations of pause times > > > > I definitely agree that too small pauses must be avoided. However, I > > don't understand very well from the code how the minimum sleep time is > > regulated. > > Thanks for pointing this out. Yes, the sleep time regulation is not > here and I should have mentioned that above. Since this is only the > core bits, there will be some followup patches to fix the rough edges. > (attached the two relevant patches) > > > I've added a simple tracepoint (see below) to monitor the pause times in > > balance_dirty_pages(). > > > > Sometimes I see very small pause time if I set a low dirty threshold > > (<=32MB). > > Yeah, it's definitely possible. > > > Example: > > > > # echo $((16 * 1024 * 1024)) > /proc/sys/vm/dirty_bytes > > # iozone -A >/dev/null & > > # cat /sys/kernel/debug/tracing/trace_pipe > > ... > > iozone-2075 [001] 380.604961: writeback_dirty_throttle: 1 > > iozone-2075 [001] 380.605966: writeback_dirty_throttle: 2 > > iozone-2075 [001] 380.608405: writeback_dirty_throttle: 0 > > iozone-2075 [001] 380.608980: writeback_dirty_throttle: 1 > > iozone-2075 [001] 380.609952: writeback_dirty_throttle: 1 > > iozone-2075 [001] 380.610952: writeback_dirty_throttle: 2 > > iozone-2075 [001] 380.612662: writeback_dirty_throttle: 0 > > iozone-2075 [000] 380.613799: writeback_dirty_throttle: 1 > > iozone-2075 [000] 380.614771: writeback_dirty_throttle: 1 > > iozone-2075 [000] 380.615767: writeback_dirty_throttle: 2 > > ... > > > > BTW, I can see this behavior only in the first minute while iozone is > > running. Ater ~1min things seem to get stable (sleeps are usually > > between 50ms and 200ms). > > > > Yeah, it's roughly in line with this graph, where the red dots are the > pause time: > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/512M/xfs-1dd-4k-8p-438M-20:10-3.0.0-next-20110802+-2011-08-06.11:03/balance_dirty_pages-pause.png > > Note that the big change of pattern in the middle is due to a > deliberate disturb: a dd will be started at 100s _reading_ 1GB data, > which effectively livelocked the other dd dirtier task with the CFQ io > scheduler. > > > I wonder if we shouldn't add an explicit check also for the minimum > > sleep time. > > With the more complete patchset including the pause time regulation, > the pause time distribution should look much better, falling nicely > into the range (5ms, 20ms): > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/xfs-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:51/balance_dirty_pages-pause.png > > > +TRACE_EVENT(writeback_dirty_throttle, > > + TP_PROTO(unsigned long sleep), > > + TP_ARGS(sleep), > > btw, I've just pushed two more tracing patches to the git tree. > Hope it helps :) Perfect. Thanks for the clarification and the additional patches, I'm going to test them right now. -Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-06 8:44 ` Wu Fengguang @ 2011-08-09 18:15 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-09 18:15 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote: [..] > - trace_balance_dirty_start(bdi); > - if (bdi_nr_reclaimable > task_bdi_thresh) { > - pages_written += writeback_inodes_wb(&bdi->wb, > - write_chunk); > - trace_balance_dirty_written(bdi, pages_written); > - if (pages_written >= write_chunk) > - break; /* We've done our duty */ > + if (unlikely(!writeback_in_progress(bdi))) > + bdi_start_background_writeback(bdi); > + > + base_bw = bdi->dirty_ratelimit; > + bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty, > + bdi_thresh, bdi_dirty); For the sake of consistency of usage of varibale naming how about using pos_ratio = bdi_position_ratio()? > + if (unlikely(bw == 0)) { > + pause = MAX_PAUSE; > + goto pause; > } > + bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; So far bw had pos_ratio as value now it will be replaced with actual bandwidth as value. It makes code confusing. So using pos_ratio will help. bw = (u64)base_bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-09 18:15 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-09 18:15 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote: [..] > - trace_balance_dirty_start(bdi); > - if (bdi_nr_reclaimable > task_bdi_thresh) { > - pages_written += writeback_inodes_wb(&bdi->wb, > - write_chunk); > - trace_balance_dirty_written(bdi, pages_written); > - if (pages_written >= write_chunk) > - break; /* We've done our duty */ > + if (unlikely(!writeback_in_progress(bdi))) > + bdi_start_background_writeback(bdi); > + > + base_bw = bdi->dirty_ratelimit; > + bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty, > + bdi_thresh, bdi_dirty); For the sake of consistency of usage of varibale naming how about using pos_ratio = bdi_position_ratio()? > + if (unlikely(bw == 0)) { > + pause = MAX_PAUSE; > + goto pause; > } > + bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; So far bw had pos_ratio as value now it will be replaced with actual bandwidth as value. It makes code confusing. So using pos_ratio will help. bw = (u64)base_bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-09 18:15 ` Vivek Goyal (?) @ 2011-08-09 18:41 ` Peter Zijlstra -1 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 18:41 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 14:15 -0400, Vivek Goyal wrote: > > So far bw had pos_ratio as value now it will be replaced with actual > bandwidth as value. It makes code confusing. So using pos_ratio will > help. Agreed on consistency, also I'm not sure bandwidth is the right term here to begin with, its a pages/s unit and I think rate would be better here. But whatever ;-) ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-09 18:41 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 18:41 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 14:15 -0400, Vivek Goyal wrote: > > So far bw had pos_ratio as value now it will be replaced with actual > bandwidth as value. It makes code confusing. So using pos_ratio will > help. Agreed on consistency, also I'm not sure bandwidth is the right term here to begin with, its a pages/s unit and I think rate would be better here. But whatever ;-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-09 18:41 ` Peter Zijlstra 0 siblings, 0 replies; 301+ messages in thread From: Peter Zijlstra @ 2011-08-09 18:41 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, 2011-08-09 at 14:15 -0400, Vivek Goyal wrote: > > So far bw had pos_ratio as value now it will be replaced with actual > bandwidth as value. It makes code confusing. So using pos_ratio will > help. Agreed on consistency, also I'm not sure bandwidth is the right term here to begin with, its a pages/s unit and I think rate would be better here. But whatever ;-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-09 18:41 ` Peter Zijlstra @ 2011-08-10 3:22 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 3:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 02:41:05AM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 14:15 -0400, Vivek Goyal wrote: > > > > So far bw had pos_ratio as value now it will be replaced with actual > > bandwidth as value. It makes code confusing. So using pos_ratio will > > help. > > Agreed on consistency, also I'm not sure bandwidth is the right term > here to begin with, its a pages/s unit and I think rate would be better > here. But whatever ;-) Good idea, I'll switch to the name "rate". Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-10 3:22 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 3:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 02:41:05AM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 14:15 -0400, Vivek Goyal wrote: > > > > So far bw had pos_ratio as value now it will be replaced with actual > > bandwidth as value. It makes code confusing. So using pos_ratio will > > help. > > Agreed on consistency, also I'm not sure bandwidth is the right term > here to begin with, its a pages/s unit and I think rate would be better > here. But whatever ;-) Good idea, I'll switch to the name "rate". Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-09 18:15 ` Vivek Goyal @ 2011-08-10 3:26 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 3:26 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 02:15:43AM +0800, Vivek Goyal wrote: > On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote: > > [..] > > - trace_balance_dirty_start(bdi); > > - if (bdi_nr_reclaimable > task_bdi_thresh) { > > - pages_written += writeback_inodes_wb(&bdi->wb, > > - write_chunk); > > - trace_balance_dirty_written(bdi, pages_written); > > - if (pages_written >= write_chunk) > > - break; /* We've done our duty */ > > + if (unlikely(!writeback_in_progress(bdi))) > > + bdi_start_background_writeback(bdi); > > + > > + base_bw = bdi->dirty_ratelimit; > > + bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty, > > + bdi_thresh, bdi_dirty); > > For the sake of consistency of usage of varibale naming how about using > > pos_ratio = bdi_position_ratio()? OK! > > + if (unlikely(bw == 0)) { > > + pause = MAX_PAUSE; > > + goto pause; > > } > > + bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; > > So far bw had pos_ratio as value now it will be replaced with actual > bandwidth as value. It makes code confusing. So using pos_ratio will > help. > > bw = (u64)base_bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; Yeah it makes good sense. I'll change to. rate = (u64)base_rate * pos_ratio >> BANDWIDTH_CALC_SHIFT; Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-10 3:26 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 3:26 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Wed, Aug 10, 2011 at 02:15:43AM +0800, Vivek Goyal wrote: > On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote: > > [..] > > - trace_balance_dirty_start(bdi); > > - if (bdi_nr_reclaimable > task_bdi_thresh) { > > - pages_written += writeback_inodes_wb(&bdi->wb, > > - write_chunk); > > - trace_balance_dirty_written(bdi, pages_written); > > - if (pages_written >= write_chunk) > > - break; /* We've done our duty */ > > + if (unlikely(!writeback_in_progress(bdi))) > > + bdi_start_background_writeback(bdi); > > + > > + base_bw = bdi->dirty_ratelimit; > > + bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty, > > + bdi_thresh, bdi_dirty); > > For the sake of consistency of usage of varibale naming how about using > > pos_ratio = bdi_position_ratio()? OK! > > + if (unlikely(bw == 0)) { > > + pause = MAX_PAUSE; > > + goto pause; > > } > > + bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; > > So far bw had pos_ratio as value now it will be replaced with actual > bandwidth as value. It makes code confusing. So using pos_ratio will > help. > > bw = (u64)base_bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; Yeah it makes good sense. I'll change to. rate = (u64)base_rate * pos_ratio >> BANDWIDTH_CALC_SHIFT; Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-06 8:44 ` Wu Fengguang @ 2011-08-09 19:16 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-09 19:16 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote: [..] > -/* > - * task_dirty_limit - scale down dirty throttling threshold for one task > - * > - * task specific dirty limit: > - * > - * dirty -= (dirty/8) * p_{t} > - * > - * To protect light/slow dirtying tasks from heavier/fast ones, we start > - * throttling individual tasks before reaching the bdi dirty limit. > - * Relatively low thresholds will be allocated to heavy dirtiers. So when > - * dirty pages grow large, heavy dirtiers will be throttled first, which will > - * effectively curb the growth of dirty pages. Light dirtiers with high enough > - * dirty threshold may never get throttled. > - */ Hi Fengguang, So we have got rid of the notion of per task dirty limit based on their fraction? What replaces it. I can't see any code which is replacing it. If yes, I am wondering how do you get fairness among tasks which share this bdi. Also wondering what did this patch series to do make sure that tasks share bdi more fairly and get write_bw/N bandwidth. Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-09 19:16 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-09 19:16 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote: [..] > -/* > - * task_dirty_limit - scale down dirty throttling threshold for one task > - * > - * task specific dirty limit: > - * > - * dirty -= (dirty/8) * p_{t} > - * > - * To protect light/slow dirtying tasks from heavier/fast ones, we start > - * throttling individual tasks before reaching the bdi dirty limit. > - * Relatively low thresholds will be allocated to heavy dirtiers. So when > - * dirty pages grow large, heavy dirtiers will be throttled first, which will > - * effectively curb the growth of dirty pages. Light dirtiers with high enough > - * dirty threshold may never get throttled. > - */ Hi Fengguang, So we have got rid of the notion of per task dirty limit based on their fraction? What replaces it. I can't see any code which is replacing it. If yes, I am wondering how do you get fairness among tasks which share this bdi. Also wondering what did this patch series to do make sure that tasks share bdi more fairly and get write_bw/N bandwidth. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-09 19:16 ` Vivek Goyal (?) @ 2011-08-10 4:33 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-10 4:33 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML [-- Attachment #1: Type: text/plain, Size: 3749 bytes --] On Wed, Aug 10, 2011 at 03:16:22AM +0800, Vivek Goyal wrote: > On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote: > > [..] > > -/* > > - * task_dirty_limit - scale down dirty throttling threshold for one task > > - * > > - * task specific dirty limit: > > - * > > - * dirty -= (dirty/8) * p_{t} > > - * > > - * To protect light/slow dirtying tasks from heavier/fast ones, we start > > - * throttling individual tasks before reaching the bdi dirty limit. > > - * Relatively low thresholds will be allocated to heavy dirtiers. So when > > - * dirty pages grow large, heavy dirtiers will be throttled first, which will > > - * effectively curb the growth of dirty pages. Light dirtiers with high enough > > - * dirty threshold may never get throttled. > > - */ > > Hi Fengguang, > > So we have got rid of the notion of per task dirty limit based on their > fraction? What replaces it. It's simply removed :) > I can't see any code which is replacing it. The think time compensation feature (patch attached) will be providing the same protection for light/slow dirtiers. With it, the slower dirtiers won't be throttled at all, because the pause time calculated by period = pages_dirtied / rate pause = period - think will be <= 0. For example, given write_bw = 100MB/s and - 2 dd tasks that dirty pages as fast as possible - 1 scp whose dirty rate is limited by network bandwidth 10MB/s Then with think time compensation, the real dirty rates will be - 2 dd tasks: (100-10)/2 = 45MB/s (each) - 1 scp task: 10MB/s The scp task won't be throttled by balance_dirty_pages() any more. This is a tested feature. In the below graph, the dirty rate (the slope of the lines) of the last 3 tasks are 2, 4, 8 MB/s http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/RATES-2-4-8/btrfs-fio-rates-128k-8p-2975M-2.6.38-rc6-dt6+-2011-03-01-20-45/balance_dirty_pages-task-bw.png given this fio workload, which started one full speed dirtier and four 1, 2, 4, 8 MB/s rate limited dirtiers http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/RATES-2-4-8/btrfs-fio-rates-128k-8p-2975M-2.6.38-rc6-dt6+-2011-03-01-20-45/fio-rates > If yes, I am wondering how > do you get fairness among tasks which share this bdi. > > Also wondering what did this patch series to do make sure that tasks > share bdi more fairly and get write_bw/N bandwidth. Each of the N dd tasks will be rate limited by rate = base_rate * pos_ratio At any time snapshot, each bdi task will see almost the same base_rate and pos_ratio, so will be throttled almost at the same rate. This is a strong guarantee of fairness under all situations. Since pos_ratio is fluctuating (evenly) around 1.0, and base_rate=bdi->dirty_ratelimit is fluctuating around (write_bw/N), on average we get avg_rate = (write_bw/N) * 1.0 (I'll explain the "dirty_ratelimit = write_bw/N" magic other emails.) The below graphs demonstrate the dirty progress of the last 3 dd tasks. The slope of each curve is the dirty rate. They vividly show three curves progressing at the same pace in all of the 3 stages - rampup stage (20-100s) - disturbed stage (120s-160s) (disturbed by starting a 1GB read dd in the middle of the tests) - stable stage (after 160s) And dirtied almost the same amount of pages during the test. http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/8G/xfs-10dd-4k-32p-6802M-20:10-3.0.0-next-20110802+-2011-08-06.16:26/balance_dirty_pages-task-bw.png http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/2G/xfs-10dd-4k-8p-1947M-20:10-3.0.0-next-20110802+-2011-08-06.15:49/balance_dirty_pages-task-bw.png Thanks, Fengguang [-- Attachment #2: think-time-compensation --] [-- Type: text/plain, Size: 5083 bytes --] Subject: writeback: dirty ratelimit - think time compensation Date: Sat Jun 11 19:25:42 CST 2011 Compensate the task's think time when computing the final pause time, so that ->dirty_ratelimit can be executed accurately. In the rare case that the task slept longer than the period time (result in negative pause time), the extra sleep time will be compensated in next period if it's not too big (<500ms). Accumulated errors are carefully avoided as long as the max pause area is not hitted. Pseudo code: period = pages_dirtied / bw; think = jiffies - dirty_paused_when; pause = period - think; case 1: period > think pause = period - think dirty_paused_when += pause period time |======================================>| think time |===============>| ------|----------------|----------------------|----------- dirty_paused_when jiffies case 2: period <= think don't pause; reduce future pause time by: dirty_paused_when += period period time |=========================>| think time |======================================>| ------|--------------------------+------------|----------- dirty_paused_when jiffies Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/linux/sched.h | 1 + kernel/fork.c | 1 + mm/page-writeback.c | 34 +++++++++++++++++++++++++++++++--- 3 files changed, 33 insertions(+), 3 deletions(-) --- linux-next.orig/include/linux/sched.h 2011-08-09 07:53:31.000000000 +0800 +++ linux-next/include/linux/sched.h 2011-08-09 07:54:12.000000000 +0800 @@ -1531,6 +1531,7 @@ struct task_struct { */ int nr_dirtied; int nr_dirtied_pause; + unsigned long dirty_paused_when; /* start of a write-and-pause period */ #ifdef CONFIG_LATENCYTOP int latency_record_count; --- linux-next.orig/mm/page-writeback.c 2011-08-09 07:53:31.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-09 08:08:11.000000000 +0800 @@ -817,6 +817,7 @@ static void balance_dirty_pages(struct a unsigned long background_thresh; unsigned long dirty_thresh; unsigned long bdi_thresh; + long period; long pause = 0; bool dirty_exceeded = false; unsigned long bw; @@ -825,6 +826,8 @@ static void balance_dirty_pages(struct a unsigned long start_time = jiffies; for (;;) { + unsigned long now = jiffies; + /* * Unstable writes are a feature of certain networked * filesystems (i.e. NFS) in which data may have been @@ -842,8 +845,11 @@ static void balance_dirty_pages(struct a * catch-up. This avoids (excessively) small writeouts * when the bdi limits are ramping up. */ - if (nr_dirty <= (background_thresh + dirty_thresh) / 2) + if (nr_dirty <= (background_thresh + dirty_thresh) / 2) { + current->dirty_paused_when = now; + current->nr_dirtied = 0; break; + } bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); @@ -879,17 +885,40 @@ static void balance_dirty_pages(struct a bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty, bdi_thresh, bdi_dirty); if (unlikely(bw == 0)) { + period = MAX_PAUSE; pause = MAX_PAUSE; goto pause; } bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT; - pause = (HZ * pages_dirtied + bw / 2) / (bw | 1); + period = (HZ * pages_dirtied + bw / 2) / (bw | 1); + pause = current->dirty_paused_when + period - now; + /* + * For less than 1s think time (ext3/4 may block the dirtier + * for up to 800ms from time to time on 1-HDD; so does xfs, + * however at much less frequency), try to compensate it in + * future periods by updating the virtual time; otherwise just + * do a reset, as it may be a light dirtier. + */ + if (unlikely(pause <= 0)) { + if (pause < -HZ) { + current->dirty_paused_when = now; + current->nr_dirtied = 0; + } else if (period) { + current->dirty_paused_when += period; + current->nr_dirtied = 0; + } + pause = 1; /* avoid resetting nr_dirtied_pause below */ + break; + } pause = min(pause, MAX_PAUSE); pause: __set_current_state(TASK_UNINTERRUPTIBLE); io_schedule_timeout(pause); + current->dirty_paused_when = now + pause; + current->nr_dirtied = 0; + dirty_thresh = hard_dirty_limit(dirty_thresh); /* * max-pause area. If dirty exceeded but still within this @@ -916,7 +945,6 @@ pause: if (!dirty_exceeded && bdi->dirty_exceeded) bdi->dirty_exceeded = 0; - current->nr_dirtied = 0; current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh); if (writeback_in_progress(bdi)) --- linux-next.orig/kernel/fork.c 2011-08-09 07:53:31.000000000 +0800 +++ linux-next/kernel/fork.c 2011-08-09 07:54:12.000000000 +0800 @@ -1303,6 +1303,7 @@ static struct task_struct *copy_process( p->nr_dirtied = 0; p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10); + p->dirty_paused_when = 0; /* * Ok, make it visible to the rest of the system. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 2011-08-06 8:44 ` Wu Fengguang @ 2011-08-09 2:01 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-09 2:01 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote: > Hi all, > > The _core_ bits of the IO-less balance_dirty_pages(). > Heavily simplified and re-commented to make it easier to review. > > git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8 > > Only the bare minimal algorithms are presented, so you will find some rough > edges in the graphs below. But it's usable :) > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/ > > And an introduction to the (more complete) algorithms: > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf > > Questions and reviews are highly appreciated! Hi Wu, I am going through the slide number 39 where you talk about it being future proof and it can be used for IO control purposes. You have listed following merits of this approach. * per-bdi nature, works on NFS and Software RAID * no delayed response (working at the right layer) * no page tracking, hence decoupled from memcg * no interactions with FS and CFQ * get proportional IO controller for free * reuse/inherit all the base facilities/functions I would say that it will also be a good idea to list the demerits of this approach in current form and that is that it only deals with controlling buffered write IO and nothing else. So on the same block device, other direct writes might be going on from same group and in this scheme a user will not have any control. Another disadvantage is that throttling at page cache level does not take care of IO spikes at device level. Now I think one could probably come up with more sophisticated scheme where throttling is done at bdi level but is also accounted at device level at IO controller. (Something similar I had done in the past but Dave Chinner did not like it). Anyway, keeping track of per cgroup rate and throttling accordingly can definitely help implement an algorithm for per cgroup IO control. We probably just need to find a reasonable way to account all this IO to end device so that we have control of all kind of IO of a cgroup. How do you implement proportional control here? From overall bdi bandwidth vary per cgroup bandwidth regularly based on cgroup weight? Again the issue here is that it controls only buffered WRITES and nothing else and in this case co-ordinating with CFQ will probably be hard. So I guess usage of proportional IO just for buffered WRITES will have limited usage. Thanks Vivek > > shortlog: > > Wu Fengguang (5): > writeback: account per-bdi accumulated dirtied pages > writeback: dirty position control > writeback: dirty rate control > writeback: per task dirty rate limit > writeback: IO-less balance_dirty_pages() > > The last 4 patches are one single logical change, but splitted here to > make it easier to review the different parts of the algorithm. > > diffstat: > > include/linux/backing-dev.h | 8 + > include/linux/sched.h | 7 + > include/trace/events/writeback.h | 24 -- > mm/backing-dev.c | 3 + > mm/memory_hotplug.c | 3 - > mm/page-writeback.c | 459 ++++++++++++++++++++++---------------- > 6 files changed, 290 insertions(+), 214 deletions(-) > > Thanks, > Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 @ 2011-08-09 2:01 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-09 2:01 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote: > Hi all, > > The _core_ bits of the IO-less balance_dirty_pages(). > Heavily simplified and re-commented to make it easier to review. > > git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8 > > Only the bare minimal algorithms are presented, so you will find some rough > edges in the graphs below. But it's usable :) > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/ > > And an introduction to the (more complete) algorithms: > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf > > Questions and reviews are highly appreciated! Hi Wu, I am going through the slide number 39 where you talk about it being future proof and it can be used for IO control purposes. You have listed following merits of this approach. * per-bdi nature, works on NFS and Software RAID * no delayed response (working at the right layer) * no page tracking, hence decoupled from memcg * no interactions with FS and CFQ * get proportional IO controller for free * reuse/inherit all the base facilities/functions I would say that it will also be a good idea to list the demerits of this approach in current form and that is that it only deals with controlling buffered write IO and nothing else. So on the same block device, other direct writes might be going on from same group and in this scheme a user will not have any control. Another disadvantage is that throttling at page cache level does not take care of IO spikes at device level. Now I think one could probably come up with more sophisticated scheme where throttling is done at bdi level but is also accounted at device level at IO controller. (Something similar I had done in the past but Dave Chinner did not like it). Anyway, keeping track of per cgroup rate and throttling accordingly can definitely help implement an algorithm for per cgroup IO control. We probably just need to find a reasonable way to account all this IO to end device so that we have control of all kind of IO of a cgroup. How do you implement proportional control here? From overall bdi bandwidth vary per cgroup bandwidth regularly based on cgroup weight? Again the issue here is that it controls only buffered WRITES and nothing else and in this case co-ordinating with CFQ will probably be hard. So I guess usage of proportional IO just for buffered WRITES will have limited usage. Thanks Vivek > > shortlog: > > Wu Fengguang (5): > writeback: account per-bdi accumulated dirtied pages > writeback: dirty position control > writeback: dirty rate control > writeback: per task dirty rate limit > writeback: IO-less balance_dirty_pages() > > The last 4 patches are one single logical change, but splitted here to > make it easier to review the different parts of the algorithm. > > diffstat: > > include/linux/backing-dev.h | 8 + > include/linux/sched.h | 7 + > include/trace/events/writeback.h | 24 -- > mm/backing-dev.c | 3 + > mm/memory_hotplug.c | 3 - > mm/page-writeback.c | 459 ++++++++++++++++++++++---------------- > 6 files changed, 290 insertions(+), 214 deletions(-) > > Thanks, > Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 2011-08-09 2:01 ` Vivek Goyal @ 2011-08-09 5:55 ` Dave Chinner -1 siblings, 0 replies; 301+ messages in thread From: Dave Chinner @ 2011-08-09 5:55 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Mon, Aug 08, 2011 at 10:01:27PM -0400, Vivek Goyal wrote: > On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote: > > Hi all, > > > > The _core_ bits of the IO-less balance_dirty_pages(). > > Heavily simplified and re-commented to make it easier to review. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8 > > > > Only the bare minimal algorithms are presented, so you will find some rough > > edges in the graphs below. But it's usable :) > > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/ > > > > And an introduction to the (more complete) algorithms: > > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf > > > > Questions and reviews are highly appreciated! > > Hi Wu, > > I am going through the slide number 39 where you talk about it being > future proof and it can be used for IO control purposes. You have listed > following merits of this approach. > > * per-bdi nature, works on NFS and Software RAID > * no delayed response (working at the right layer) > * no page tracking, hence decoupled from memcg > * no interactions with FS and CFQ > * get proportional IO controller for free > * reuse/inherit all the base facilities/functions > > I would say that it will also be a good idea to list the demerits of > this approach in current form and that is that it only deals with > controlling buffered write IO and nothing else. That's not a demerit - that is all it is designed to do. > So on the same block device, other direct writes might be going on > from same group and in this scheme a user will not have any > control. But it is taken into account by the IO write throttling. > Another disadvantage is that throttling at page cache > level does not take care of IO spikes at device level. And that is handled as well. How? By the indirect effect other IO and IO spikes have on the writeback rate. That is, other IO reduces the writeback bandwidth, which then changes the throttling parameters via feedback loops. The buffered write throttle is designed to reduce the page cache dirtying rate to the current cleaning rate of the backing device is. Increase the cleaning rate (i.e. device is otherwise idle) and it will throttle less. Decrease the cleaning rate (i.e. other IO spikes or block IO throttle activates) and it will throttle more. We have to do vary buffered write throttling like this to adapt to changing IO workloads (e.g. someone starting a read-heavy workload will slow down writeback rate, so we need to throttle buffered writes more aggressively), so it has to be independent of any sort of block layer IO controller. Simply put: the block IO controller still has direct control over the rate at which buffered writes drain out of the system. The IO-less write throttle simply limits the rate at which buffered writes come into the system to match whatever the IO path allows to drain out.... > Now I think one could probably come up with more sophisticated scheme > where throttling is done at bdi level but is also accounted at device > level at IO controller. (Something similar I had done in the past but > Dave Chinner did not like it). I don't like it because it is solution to a specific problem and requires complex coupling across multiple layers of the system. We are trying to move away from that throttling model. More fundamentally, though, is that it is not a general solution to the entire class of "IO writeback rate changed" problems that buffered write throttling needs to solve. > Anyway, keeping track of per cgroup rate and throttling accordingly > can definitely help implement an algorithm for per cgroup IO control. > We probably just need to find a reasonable way to account all this > IO to end device so that we have control of all kind of IO of a cgroup. > How do you implement proportional control here? From overall bdi bandwidth > vary per cgroup bandwidth regularly based on cgroup weight? Again the > issue here is that it controls only buffered WRITES and nothing else and > in this case co-ordinating with CFQ will probably be hard. So I guess > usage of proportional IO just for buffered WRITES will have limited > usage. The whole point of doing the throttling this way is that we don't need any sort of special connection between block IO throttling and page cache (buffered write) throttling. We significantly reduce the coupling between the layers by relying on feedback-driven control loops to determine the buffered write throttling thresholds adaptively. IOWs, the IO-less write throttling at the page cache will adjust automatically to whatever throughput the block IO throttling allows async writes to achieve. However, before we have a "finished product", there is still another piece of the puzzle to be put in place - memcg-aware buffered writeback. That is, having a flusher thread do work on behalf of memcg in the IO context of the memcg. Then the IO controller just sees a stream of async writes in the context of the memcg the buffered writes came from in the first place. The block layer throttles them just like any other IO in the IO context of the memcg... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 @ 2011-08-09 5:55 ` Dave Chinner 0 siblings, 0 replies; 301+ messages in thread From: Dave Chinner @ 2011-08-09 5:55 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Mon, Aug 08, 2011 at 10:01:27PM -0400, Vivek Goyal wrote: > On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote: > > Hi all, > > > > The _core_ bits of the IO-less balance_dirty_pages(). > > Heavily simplified and re-commented to make it easier to review. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8 > > > > Only the bare minimal algorithms are presented, so you will find some rough > > edges in the graphs below. But it's usable :) > > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/ > > > > And an introduction to the (more complete) algorithms: > > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf > > > > Questions and reviews are highly appreciated! > > Hi Wu, > > I am going through the slide number 39 where you talk about it being > future proof and it can be used for IO control purposes. You have listed > following merits of this approach. > > * per-bdi nature, works on NFS and Software RAID > * no delayed response (working at the right layer) > * no page tracking, hence decoupled from memcg > * no interactions with FS and CFQ > * get proportional IO controller for free > * reuse/inherit all the base facilities/functions > > I would say that it will also be a good idea to list the demerits of > this approach in current form and that is that it only deals with > controlling buffered write IO and nothing else. That's not a demerit - that is all it is designed to do. > So on the same block device, other direct writes might be going on > from same group and in this scheme a user will not have any > control. But it is taken into account by the IO write throttling. > Another disadvantage is that throttling at page cache > level does not take care of IO spikes at device level. And that is handled as well. How? By the indirect effect other IO and IO spikes have on the writeback rate. That is, other IO reduces the writeback bandwidth, which then changes the throttling parameters via feedback loops. The buffered write throttle is designed to reduce the page cache dirtying rate to the current cleaning rate of the backing device is. Increase the cleaning rate (i.e. device is otherwise idle) and it will throttle less. Decrease the cleaning rate (i.e. other IO spikes or block IO throttle activates) and it will throttle more. We have to do vary buffered write throttling like this to adapt to changing IO workloads (e.g. someone starting a read-heavy workload will slow down writeback rate, so we need to throttle buffered writes more aggressively), so it has to be independent of any sort of block layer IO controller. Simply put: the block IO controller still has direct control over the rate at which buffered writes drain out of the system. The IO-less write throttle simply limits the rate at which buffered writes come into the system to match whatever the IO path allows to drain out.... > Now I think one could probably come up with more sophisticated scheme > where throttling is done at bdi level but is also accounted at device > level at IO controller. (Something similar I had done in the past but > Dave Chinner did not like it). I don't like it because it is solution to a specific problem and requires complex coupling across multiple layers of the system. We are trying to move away from that throttling model. More fundamentally, though, is that it is not a general solution to the entire class of "IO writeback rate changed" problems that buffered write throttling needs to solve. > Anyway, keeping track of per cgroup rate and throttling accordingly > can definitely help implement an algorithm for per cgroup IO control. > We probably just need to find a reasonable way to account all this > IO to end device so that we have control of all kind of IO of a cgroup. > How do you implement proportional control here? From overall bdi bandwidth > vary per cgroup bandwidth regularly based on cgroup weight? Again the > issue here is that it controls only buffered WRITES and nothing else and > in this case co-ordinating with CFQ will probably be hard. So I guess > usage of proportional IO just for buffered WRITES will have limited > usage. The whole point of doing the throttling this way is that we don't need any sort of special connection between block IO throttling and page cache (buffered write) throttling. We significantly reduce the coupling between the layers by relying on feedback-driven control loops to determine the buffered write throttling thresholds adaptively. IOWs, the IO-less write throttling at the page cache will adjust automatically to whatever throughput the block IO throttling allows async writes to achieve. However, before we have a "finished product", there is still another piece of the puzzle to be put in place - memcg-aware buffered writeback. That is, having a flusher thread do work on behalf of memcg in the IO context of the memcg. Then the IO controller just sees a stream of async writes in the context of the memcg the buffered writes came from in the first place. The block layer throttles them just like any other IO in the IO context of the memcg... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 2011-08-09 5:55 ` Dave Chinner @ 2011-08-09 14:04 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-09 14:04 UTC (permalink / raw) To: Dave Chinner Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 09, 2011 at 03:55:51PM +1000, Dave Chinner wrote: > On Mon, Aug 08, 2011 at 10:01:27PM -0400, Vivek Goyal wrote: > > On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote: > > > Hi all, > > > > > > The _core_ bits of the IO-less balance_dirty_pages(). > > > Heavily simplified and re-commented to make it easier to review. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8 > > > > > > Only the bare minimal algorithms are presented, so you will find some rough > > > edges in the graphs below. But it's usable :) > > > > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/ > > > > > > And an introduction to the (more complete) algorithms: > > > > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf > > > > > > Questions and reviews are highly appreciated! > > > > Hi Wu, > > > > I am going through the slide number 39 where you talk about it being > > future proof and it can be used for IO control purposes. You have listed > > following merits of this approach. > > > > * per-bdi nature, works on NFS and Software RAID > > * no delayed response (working at the right layer) > > * no page tracking, hence decoupled from memcg > > * no interactions with FS and CFQ > > * get proportional IO controller for free > > * reuse/inherit all the base facilities/functions > > > > I would say that it will also be a good idea to list the demerits of > > this approach in current form and that is that it only deals with > > controlling buffered write IO and nothing else. > > That's not a demerit - that is all it is designed to do. It is designed to improve the existing task throttling functionality and we are trying to extend the same to cgroups too. So if by design something does not gel well with existing pieces, it is demerit to me. Atleast there should be a good explanation of design intention and how it is going to be useful. For example, how this thing is going to gel with existing IO controller? Are you going to create two separate mechianisms. One for control of writes while entering the cache and other for controlling the writes at device level? The fact that this mechanism does not know about any other IO in the system/cgroup is a limiting factor. From usability point of view, a user expects any kind of IO happening from a group. So are we planning to create a new controller? Or add additional files in existing controller to control the per cgroup write throttling behavior? Even if we create additional files, again then a user is forced to put separate write policies for buffered writes and direct writes. I was hoping a better interface would be that user puts a policy on writes and that takes affect and a user does not have to worry whether the applications inside the cgroup are doing buffered writes or direct writes. > > > So on the same block device, other direct writes might be going on > > from same group and in this scheme a user will not have any > > control. > > But it is taken into account by the IO write throttling. You mean blkio controller? It does. But my complain is that we are trying to control two separate knobs for two kind of IOs and I am trying to come up with a single knob. Current interface for write control in blkio controller looks like. blkio.throtl.write_bps_device Once can write to this file specifying the write limit of a cgroup on a particular device. I was hoping that buffered write limits will come out of same limit but with these pathes looks like we shall have to create a new interface altogether which just controls buffered writes and nothing else and user is supposed to know what his application is doing and try to configure the limits accordingly. So my concern is that how the overall interface would look like and how well it will work with existing controller and how a user is supposed to use it. In fact current IO controller does throttling at device level so interface is device specific. One is supposed to know the major and minor number of device to specify. I am not sure in this case what one is supposed to do as it is bdi specific and for NFS case there is no device. So one is supposed to speciy bdi or limits are going to be global (system wide, independent of bdi or block device)? > > > Another disadvantage is that throttling at page cache > > level does not take care of IO spikes at device level. > > And that is handled as well. > > How? By the indirect effect other IO and IO spikes have on the > writeback rate. That is, other IO reduces the writeback bandwidth, > which then changes the throttling parameters via feedback loops. Actually I was referring to effect of buffered writes on other IO going on the device. With control being on device level, one can tightly control the WRITEs flowing out of a cgroup to Lun and that can help a bit knowing how bad it will be for other reads going on the lun. With this scheme, flusher threads can suddenly throw tons of writes on lun and then no IO for another few seconds. So basically IO is bursty at device level and doing control at device level can make it more smooth. So we have two ways to control buffered writes. - Throttle them while entering the page cache - Throttle them at device and feedback loop in turn throttles them at page cache level based on dirty ratio. Myself and Andrea had implemented first appraoch (same what Wu is suggesting now with a different mechanism) and following was your response. https://lkml.org/lkml/2011/6/28/494 To me it looked like that at that point of time you preferred precise throttling at device level and now you seem to prefer precise throttling at page cache level? Again, I am not against cgroup parameter based throttling at page cache level. It simplifies the implementation and probably is good enough for lots of people. I am only worried about that the interface and how does it work with existing interfaces. In absolute throttling one does not have to care about feedback or what is the underlying bdi bandwidth. So to me these patches are good for work conserving IO control where we want to determine how fast we can write to device and then throttle tasks accordingly. But in absolute throttling one specifies the upper limit and there we don't need the mechanism to determine what the bdi badnwidth or how many dirty pages are there and throttle tasks accordingly. > > The buffered write throttle is designed to reduce the page cache > dirtying rate to the current cleaning rate of the backing device > is. Increase the cleaning rate (i.e. device is otherwise idle) and > it will throttle less. Decrease the cleaning rate (i.e. other IO > spikes or block IO throttle activates) and it will throttle more. > > We have to do vary buffered write throttling like this to adapt to > changing IO workloads (e.g. someone starting a read-heavy workload > will slow down writeback rate, so we need to throttle buffered > writes more aggressively), so it has to be independent of any sort > of block layer IO controller. > > Simply put: the block IO controller still has direct control over > the rate at which buffered writes drain out of the system. The > IO-less write throttle simply limits the rate at which buffered > writes come into the system to match whatever the IO path allows to > drain out.... Ok, this makes sense. So it goes back to the previous design where absolute cgroup based control happens at device level and IO less throttle implements the feedback loop to slow down the writes into page cache. That makes sense. But Wu's slides suggest that one can directly implement cgroup based IO control in IO less throttling and that's where I have concerns. Anyway this stuff shall have to be made cgroup aware so that tasks of different groups can see different throttling depending on how much IO that group is able to do at device level. > > > Now I think one could probably come up with more sophisticated scheme > > where throttling is done at bdi level but is also accounted at device > > level at IO controller. (Something similar I had done in the past but > > Dave Chinner did not like it). > > I don't like it because it is solution to a specific problem and > requires complex coupling across multiple layers of the system. We > are trying to move away from that throttling model. More > fundamentally, though, is that it is not a general solution to the > entire class of "IO writeback rate changed" problems that buffered > write throttling needs to solve. > > > Anyway, keeping track of per cgroup rate and throttling accordingly > > can definitely help implement an algorithm for per cgroup IO control. > > We probably just need to find a reasonable way to account all this > > IO to end device so that we have control of all kind of IO of a cgroup. > > How do you implement proportional control here? From overall bdi bandwidth > > vary per cgroup bandwidth regularly based on cgroup weight? Again the > > issue here is that it controls only buffered WRITES and nothing else and > > in this case co-ordinating with CFQ will probably be hard. So I guess > > usage of proportional IO just for buffered WRITES will have limited > > usage. > > The whole point of doing the throttling this way is that we don't > need any sort of special connection between block IO throttling and > page cache (buffered write) throttling. We significantly reduce the > coupling between the layers by relying on feedback-driven control > loops to determine the buffered write throttling thresholds > adaptively. IOWs, the IO-less write throttling at the page cache > will adjust automatically to whatever throughput the block IO > throttling allows async writes to achieve. This is good. But that's not the impression one gets from Wu's slides. > > However, before we have a "finished product", there is still another > piece of the puzzle to be put in place - memcg-aware buffered > writeback. That is, having a flusher thread do work on behalf of > memcg in the IO context of the memcg. Then the IO controller just > sees a stream of async writes in the context of the memcg the > buffered writes came from in the first place. The block layer > throttles them just like any other IO in the IO context of the > memcg... Yes that is still a piece remaining. I was hoping that Greg Thelen will be able to extend his patches to submit writes in the context of per cgroup flusher/worker threads and solve this problem. Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 @ 2011-08-09 14:04 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-09 14:04 UTC (permalink / raw) To: Dave Chinner Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 09, 2011 at 03:55:51PM +1000, Dave Chinner wrote: > On Mon, Aug 08, 2011 at 10:01:27PM -0400, Vivek Goyal wrote: > > On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote: > > > Hi all, > > > > > > The _core_ bits of the IO-less balance_dirty_pages(). > > > Heavily simplified and re-commented to make it easier to review. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8 > > > > > > Only the bare minimal algorithms are presented, so you will find some rough > > > edges in the graphs below. But it's usable :) > > > > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/ > > > > > > And an introduction to the (more complete) algorithms: > > > > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf > > > > > > Questions and reviews are highly appreciated! > > > > Hi Wu, > > > > I am going through the slide number 39 where you talk about it being > > future proof and it can be used for IO control purposes. You have listed > > following merits of this approach. > > > > * per-bdi nature, works on NFS and Software RAID > > * no delayed response (working at the right layer) > > * no page tracking, hence decoupled from memcg > > * no interactions with FS and CFQ > > * get proportional IO controller for free > > * reuse/inherit all the base facilities/functions > > > > I would say that it will also be a good idea to list the demerits of > > this approach in current form and that is that it only deals with > > controlling buffered write IO and nothing else. > > That's not a demerit - that is all it is designed to do. It is designed to improve the existing task throttling functionality and we are trying to extend the same to cgroups too. So if by design something does not gel well with existing pieces, it is demerit to me. Atleast there should be a good explanation of design intention and how it is going to be useful. For example, how this thing is going to gel with existing IO controller? Are you going to create two separate mechianisms. One for control of writes while entering the cache and other for controlling the writes at device level? The fact that this mechanism does not know about any other IO in the system/cgroup is a limiting factor. From usability point of view, a user expects any kind of IO happening from a group. So are we planning to create a new controller? Or add additional files in existing controller to control the per cgroup write throttling behavior? Even if we create additional files, again then a user is forced to put separate write policies for buffered writes and direct writes. I was hoping a better interface would be that user puts a policy on writes and that takes affect and a user does not have to worry whether the applications inside the cgroup are doing buffered writes or direct writes. > > > So on the same block device, other direct writes might be going on > > from same group and in this scheme a user will not have any > > control. > > But it is taken into account by the IO write throttling. You mean blkio controller? It does. But my complain is that we are trying to control two separate knobs for two kind of IOs and I am trying to come up with a single knob. Current interface for write control in blkio controller looks like. blkio.throtl.write_bps_device Once can write to this file specifying the write limit of a cgroup on a particular device. I was hoping that buffered write limits will come out of same limit but with these pathes looks like we shall have to create a new interface altogether which just controls buffered writes and nothing else and user is supposed to know what his application is doing and try to configure the limits accordingly. So my concern is that how the overall interface would look like and how well it will work with existing controller and how a user is supposed to use it. In fact current IO controller does throttling at device level so interface is device specific. One is supposed to know the major and minor number of device to specify. I am not sure in this case what one is supposed to do as it is bdi specific and for NFS case there is no device. So one is supposed to speciy bdi or limits are going to be global (system wide, independent of bdi or block device)? > > > Another disadvantage is that throttling at page cache > > level does not take care of IO spikes at device level. > > And that is handled as well. > > How? By the indirect effect other IO and IO spikes have on the > writeback rate. That is, other IO reduces the writeback bandwidth, > which then changes the throttling parameters via feedback loops. Actually I was referring to effect of buffered writes on other IO going on the device. With control being on device level, one can tightly control the WRITEs flowing out of a cgroup to Lun and that can help a bit knowing how bad it will be for other reads going on the lun. With this scheme, flusher threads can suddenly throw tons of writes on lun and then no IO for another few seconds. So basically IO is bursty at device level and doing control at device level can make it more smooth. So we have two ways to control buffered writes. - Throttle them while entering the page cache - Throttle them at device and feedback loop in turn throttles them at page cache level based on dirty ratio. Myself and Andrea had implemented first appraoch (same what Wu is suggesting now with a different mechanism) and following was your response. https://lkml.org/lkml/2011/6/28/494 To me it looked like that at that point of time you preferred precise throttling at device level and now you seem to prefer precise throttling at page cache level? Again, I am not against cgroup parameter based throttling at page cache level. It simplifies the implementation and probably is good enough for lots of people. I am only worried about that the interface and how does it work with existing interfaces. In absolute throttling one does not have to care about feedback or what is the underlying bdi bandwidth. So to me these patches are good for work conserving IO control where we want to determine how fast we can write to device and then throttle tasks accordingly. But in absolute throttling one specifies the upper limit and there we don't need the mechanism to determine what the bdi badnwidth or how many dirty pages are there and throttle tasks accordingly. > > The buffered write throttle is designed to reduce the page cache > dirtying rate to the current cleaning rate of the backing device > is. Increase the cleaning rate (i.e. device is otherwise idle) and > it will throttle less. Decrease the cleaning rate (i.e. other IO > spikes or block IO throttle activates) and it will throttle more. > > We have to do vary buffered write throttling like this to adapt to > changing IO workloads (e.g. someone starting a read-heavy workload > will slow down writeback rate, so we need to throttle buffered > writes more aggressively), so it has to be independent of any sort > of block layer IO controller. > > Simply put: the block IO controller still has direct control over > the rate at which buffered writes drain out of the system. The > IO-less write throttle simply limits the rate at which buffered > writes come into the system to match whatever the IO path allows to > drain out.... Ok, this makes sense. So it goes back to the previous design where absolute cgroup based control happens at device level and IO less throttle implements the feedback loop to slow down the writes into page cache. That makes sense. But Wu's slides suggest that one can directly implement cgroup based IO control in IO less throttling and that's where I have concerns. Anyway this stuff shall have to be made cgroup aware so that tasks of different groups can see different throttling depending on how much IO that group is able to do at device level. > > > Now I think one could probably come up with more sophisticated scheme > > where throttling is done at bdi level but is also accounted at device > > level at IO controller. (Something similar I had done in the past but > > Dave Chinner did not like it). > > I don't like it because it is solution to a specific problem and > requires complex coupling across multiple layers of the system. We > are trying to move away from that throttling model. More > fundamentally, though, is that it is not a general solution to the > entire class of "IO writeback rate changed" problems that buffered > write throttling needs to solve. > > > Anyway, keeping track of per cgroup rate and throttling accordingly > > can definitely help implement an algorithm for per cgroup IO control. > > We probably just need to find a reasonable way to account all this > > IO to end device so that we have control of all kind of IO of a cgroup. > > How do you implement proportional control here? From overall bdi bandwidth > > vary per cgroup bandwidth regularly based on cgroup weight? Again the > > issue here is that it controls only buffered WRITES and nothing else and > > in this case co-ordinating with CFQ will probably be hard. So I guess > > usage of proportional IO just for buffered WRITES will have limited > > usage. > > The whole point of doing the throttling this way is that we don't > need any sort of special connection between block IO throttling and > page cache (buffered write) throttling. We significantly reduce the > coupling between the layers by relying on feedback-driven control > loops to determine the buffered write throttling thresholds > adaptively. IOWs, the IO-less write throttling at the page cache > will adjust automatically to whatever throughput the block IO > throttling allows async writes to achieve. This is good. But that's not the impression one gets from Wu's slides. > > However, before we have a "finished product", there is still another > piece of the puzzle to be put in place - memcg-aware buffered > writeback. That is, having a flusher thread do work on behalf of > memcg in the IO context of the memcg. Then the IO controller just > sees a stream of async writes in the context of the memcg the > buffered writes came from in the first place. The block layer > throttles them just like any other IO in the IO context of the > memcg... Yes that is still a piece remaining. I was hoping that Greg Thelen will be able to extend his patches to submit writes in the context of per cgroup flusher/worker threads and solve this problem. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 2011-08-09 14:04 ` Vivek Goyal (?) @ 2011-08-10 7:41 ` Greg Thelen -1 siblings, 0 replies; 301+ messages in thread From: Greg Thelen @ 2011-08-10 7:41 UTC (permalink / raw) To: Vivek Goyal Cc: Minchan Kim, Wu Fengguang, Dave Chinner, Christoph Hellwig, LKML, Andrea Righi, Andrew Morton, linux-fsdevel, linux-mm, Jan Kara, KAMEZAWA Hiroyuki On Aug 9, 2011 7:04 AM, "Vivek Goyal" <vgoyal@redhat.com> wrote: > > On Tue, Aug 09, 2011 at 03:55:51PM +1000, Dave Chinner wrote: > > On Mon, Aug 08, 2011 at 10:01:27PM -0400, Vivek Goyal wrote: > > > On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote: > > > > Hi all, > > > > > > > > The _core_ bits of the IO-less balance_dirty_pages(). > > > > Heavily simplified and re-commented to make it easier to review. > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8 > > > > > > > > Only the bare minimal algorithms are presented, so you will find some rough > > > > edges in the graphs below. But it's usable :) > > > > > > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/ > > > > > > > > And an introduction to the (more complete) algorithms: > > > > > > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf > > > > > > > > Questions and reviews are highly appreciated! > > > > > > Hi Wu, > > > > > > I am going through the slide number 39 where you talk about it being > > > future proof and it can be used for IO control purposes. You have listed > > > following merits of this approach. > > > > > > * per-bdi nature, works on NFS and Software RAID > > > * no delayed response (working at the right layer) > > > * no page tracking, hence decoupled from memcg > > > * no interactions with FS and CFQ > > > * get proportional IO controller for free > > > * reuse/inherit all the base facilities/functions > > > > > > I would say that it will also be a good idea to list the demerits of > > > this approach in current form and that is that it only deals with > > > controlling buffered write IO and nothing else. > > > > That's not a demerit - that is all it is designed to do. > > It is designed to improve the existing task throttling functionality and > we are trying to extend the same to cgroups too. So if by design something > does not gel well with existing pieces, it is demerit to me. Atleast > there should be a good explanation of design intention and how it is > going to be useful. > > For example, how this thing is going to gel with existing IO controller? > Are you going to create two separate mechianisms. One for control of > writes while entering the cache and other for controlling the writes > at device level? > > The fact that this mechanism does not know about any other IO in the > system/cgroup is a limiting factor. From usability point of view, a > user expects any kind of IO happening from a group. > > So are we planning to create a new controller? Or add additional files > in existing controller to control the per cgroup write throttling > behavior? Even if we create additional files, again then a user is > forced to put separate write policies for buffered writes and direct > writes. I was hoping a better interface would be that user puts a > policy on writes and that takes affect and a user does not have to > worry whether the applications inside the cgroup are doing buffered > writes or direct writes. > > > > > > So on the same block device, other direct writes might be going on > > > from same group and in this scheme a user will not have any > > > control. > > > > But it is taken into account by the IO write throttling. > > You mean blkio controller? > > It does. But my complain is that we are trying to control two separate > knobs for two kind of IOs and I am trying to come up with a single > knob. > > Current interface for write control in blkio controller looks like. > > blkio.throtl.write_bps_device > > Once can write to this file specifying the write limit of a cgroup > on a particular device. I was hoping that buffered write limits > will come out of same limit but with these pathes looks like we > shall have to create a new interface altogether which just controls > buffered writes and nothing else and user is supposed to know what > his application is doing and try to configure the limits accordingly. > > So my concern is that how the overall interface would look like and > how well it will work with existing controller and how a user is > supposed to use it. > > In fact current IO controller does throttling at device level so > interface is device specific. One is supposed to know the major > and minor number of device to specify. I am not sure in this > case what one is supposed to do as it is bdi specific and for > NFS case there is no device. So one is supposed to speciy bdi or > limits are going to be global (system wide, independent of bdi > or block device)? > > > > > > Another disadvantage is that throttling at page cache > > > level does not take care of IO spikes at device level. > > > > And that is handled as well. > > > > How? By the indirect effect other IO and IO spikes have on the > > writeback rate. That is, other IO reduces the writeback bandwidth, > > which then changes the throttling parameters via feedback loops. > > Actually I was referring to effect of buffered writes on other IO > going on the device. With control being on device level, one can > tightly control the WRITEs flowing out of a cgroup to Lun and that > can help a bit knowing how bad it will be for other reads going on > the lun. > > With this scheme, flusher threads can suddenly throw tons of writes > on lun and then no IO for another few seconds. So basically IO is > bursty at device level and doing control at device level can make > it more smooth. > > So we have two ways to control buffered writes. > > - Throttle them while entering the page cache > - Throttle them at device and feedback loop in turn throttles them at > page cache level based on dirty ratio. > > Myself and Andrea had implemented first appraoch (same what Wu is > suggesting now with a different mechanism) and following was your > response. > > https://lkml.org/lkml/2011/6/28/494 > > To me it looked like that at that point of time you preferred precise > throttling at device level and now you seem to prefer precise throttling > at page cache level? > > Again, I am not against cgroup parameter based throttling at page > cache level. It simplifies the implementation and probably is good > enough for lots of people. I am only worried about that the interface > and how does it work with existing interfaces. > > In absolute throttling one does not have to care about feedback or > what is the underlying bdi bandwidth. So to me these patches are > good for work conserving IO control where we want to determine how > fast we can write to device and then throttle tasks accordingly. But > in absolute throttling one specifies the upper limit and there we > don't need the mechanism to determine what the bdi badnwidth or > how many dirty pages are there and throttle tasks accordingly. > > > > > The buffered write throttle is designed to reduce the page cache > > dirtying rate to the current cleaning rate of the backing device > > is. Increase the cleaning rate (i.e. device is otherwise idle) and > > it will throttle less. Decrease the cleaning rate (i.e. other IO > > spikes or block IO throttle activates) and it will throttle more. > > > > We have to do vary buffered write throttling like this to adapt to > > changing IO workloads (e.g. someone starting a read-heavy workload > > will slow down writeback rate, so we need to throttle buffered > > writes more aggressively), so it has to be independent of any sort > > of block layer IO controller. > > > > Simply put: the block IO controller still has direct control over > > the rate at which buffered writes drain out of the system. The > > IO-less write throttle simply limits the rate at which buffered > > writes come into the system to match whatever the IO path allows to > > drain out.... > > Ok, this makes sense. So it goes back to the previous design where > absolute cgroup based control happens at device level and IO less > throttle implements the feedback loop to slow down the writes into > page cache. That makes sense. But Wu's slides suggest that one can > directly implement cgroup based IO control in IO less throttling > and that's where I have concerns. > > Anyway this stuff shall have to be made cgroup aware so that tasks > of different groups can see different throttling depending on how > much IO that group is able to do at device level. > > > > > > Now I think one could probably come up with more sophisticated scheme > > > where throttling is done at bdi level but is also accounted at device > > > level at IO controller. (Something similar I had done in the past but > > > Dave Chinner did not like it). > > > > I don't like it because it is solution to a specific problem and > > requires complex coupling across multiple layers of the system. We > > are trying to move away from that throttling model. More > > fundamentally, though, is that it is not a general solution to the > > entire class of "IO writeback rate changed" problems that buffered > > write throttling needs to solve. > > > > > Anyway, keeping track of per cgroup rate and throttling accordingly > > > can definitely help implement an algorithm for per cgroup IO control. > > > We probably just need to find a reasonable way to account all this > > > IO to end device so that we have control of all kind of IO of a cgroup. > > > How do you implement proportional control here? From overall bdi bandwidth > > > vary per cgroup bandwidth regularly based on cgroup weight? Again the > > > issue here is that it controls only buffered WRITES and nothing else and > > > in this case co-ordinating with CFQ will probably be hard. So I guess > > > usage of proportional IO just for buffered WRITES will have limited > > > usage. > > > > The whole point of doing the throttling this way is that we don't > > need any sort of special connection between block IO throttling and > > page cache (buffered write) throttling. We significantly reduce the > > coupling between the layers by relying on feedback-driven control > > loops to determine the buffered write throttling thresholds > > adaptively. IOWs, the IO-less write throttling at the page cache > > will adjust automatically to whatever throughput the block IO > > throttling allows async writes to achieve. > > This is good. But that's not the impression one gets from Wu's slides. > > > > > However, before we have a "finished product", there is still another > > piece of the puzzle to be put in place - memcg-aware buffered > > writeback. That is, having a flusher thread do work on behalf of > > memcg in the IO context of the memcg. Then the IO controller just > > sees a stream of async writes in the context of the memcg the > > buffered writes came from in the first place. The block layer > > throttles them just like any other IO in the IO context of the > > memcg... > > Yes that is still a piece remaining. I was hoping that Greg Thelen will > be able to extend his patches to submit writes in the context of > per cgroup flusher/worker threads and solve this problem. > > Thanks > Vivek Are you suggesting multiple flushers per bdi (one per cgroup)? I thought the point of IO less was to one issue buffered writes from a single thread. Note: I have rebased the memcg writeback code to latest mmotm and am testing it now. These patches do not introduce additional threads; the existing bdi flusher threads are used with an optional memcg filter. ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 @ 2011-08-10 7:41 ` Greg Thelen 0 siblings, 0 replies; 301+ messages in thread From: Greg Thelen @ 2011-08-10 7:41 UTC (permalink / raw) To: Vivek Goyal Cc: Minchan Kim, Wu Fengguang, Dave Chinner, Christoph Hellwig, LKML, Andrea Righi, Andrew Morton, linux-fsdevel, linux-mm, Jan Kara, KAMEZAWA Hiroyuki On Aug 9, 2011 7:04 AM, "Vivek Goyal" <vgoyal@redhat.com> wrote: > > On Tue, Aug 09, 2011 at 03:55:51PM +1000, Dave Chinner wrote: > > On Mon, Aug 08, 2011 at 10:01:27PM -0400, Vivek Goyal wrote: > > > On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote: > > > > Hi all, > > > > > > > > The _core_ bits of the IO-less balance_dirty_pages(). > > > > Heavily simplified and re-commented to make it easier to review. > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8 > > > > > > > > Only the bare minimal algorithms are presented, so you will find some rough > > > > edges in the graphs below. But it's usable :) > > > > > > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/ > > > > > > > > And an introduction to the (more complete) algorithms: > > > > > > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf > > > > > > > > Questions and reviews are highly appreciated! > > > > > > Hi Wu, > > > > > > I am going through the slide number 39 where you talk about it being > > > future proof and it can be used for IO control purposes. You have listed > > > following merits of this approach. > > > > > > * per-bdi nature, works on NFS and Software RAID > > > * no delayed response (working at the right layer) > > > * no page tracking, hence decoupled from memcg > > > * no interactions with FS and CFQ > > > * get proportional IO controller for free > > > * reuse/inherit all the base facilities/functions > > > > > > I would say that it will also be a good idea to list the demerits of > > > this approach in current form and that is that it only deals with > > > controlling buffered write IO and nothing else. > > > > That's not a demerit - that is all it is designed to do. > > It is designed to improve the existing task throttling functionality and > we are trying to extend the same to cgroups too. So if by design something > does not gel well with existing pieces, it is demerit to me. Atleast > there should be a good explanation of design intention and how it is > going to be useful. > > For example, how this thing is going to gel with existing IO controller? > Are you going to create two separate mechianisms. One for control of > writes while entering the cache and other for controlling the writes > at device level? > > The fact that this mechanism does not know about any other IO in the > system/cgroup is a limiting factor. From usability point of view, a > user expects any kind of IO happening from a group. > > So are we planning to create a new controller? Or add additional files > in existing controller to control the per cgroup write throttling > behavior? Even if we create additional files, again then a user is > forced to put separate write policies for buffered writes and direct > writes. I was hoping a better interface would be that user puts a > policy on writes and that takes affect and a user does not have to > worry whether the applications inside the cgroup are doing buffered > writes or direct writes. > > > > > > So on the same block device, other direct writes might be going on > > > from same group and in this scheme a user will not have any > > > control. > > > > But it is taken into account by the IO write throttling. > > You mean blkio controller? > > It does. But my complain is that we are trying to control two separate > knobs for two kind of IOs and I am trying to come up with a single > knob. > > Current interface for write control in blkio controller looks like. > > blkio.throtl.write_bps_device > > Once can write to this file specifying the write limit of a cgroup > on a particular device. I was hoping that buffered write limits > will come out of same limit but with these pathes looks like we > shall have to create a new interface altogether which just controls > buffered writes and nothing else and user is supposed to know what > his application is doing and try to configure the limits accordingly. > > So my concern is that how the overall interface would look like and > how well it will work with existing controller and how a user is > supposed to use it. > > In fact current IO controller does throttling at device level so > interface is device specific. One is supposed to know the major > and minor number of device to specify. I am not sure in this > case what one is supposed to do as it is bdi specific and for > NFS case there is no device. So one is supposed to speciy bdi or > limits are going to be global (system wide, independent of bdi > or block device)? > > > > > > Another disadvantage is that throttling at page cache > > > level does not take care of IO spikes at device level. > > > > And that is handled as well. > > > > How? By the indirect effect other IO and IO spikes have on the > > writeback rate. That is, other IO reduces the writeback bandwidth, > > which then changes the throttling parameters via feedback loops. > > Actually I was referring to effect of buffered writes on other IO > going on the device. With control being on device level, one can > tightly control the WRITEs flowing out of a cgroup to Lun and that > can help a bit knowing how bad it will be for other reads going on > the lun. > > With this scheme, flusher threads can suddenly throw tons of writes > on lun and then no IO for another few seconds. So basically IO is > bursty at device level and doing control at device level can make > it more smooth. > > So we have two ways to control buffered writes. > > - Throttle them while entering the page cache > - Throttle them at device and feedback loop in turn throttles them at > page cache level based on dirty ratio. > > Myself and Andrea had implemented first appraoch (same what Wu is > suggesting now with a different mechanism) and following was your > response. > > https://lkml.org/lkml/2011/6/28/494 > > To me it looked like that at that point of time you preferred precise > throttling at device level and now you seem to prefer precise throttling > at page cache level? > > Again, I am not against cgroup parameter based throttling at page > cache level. It simplifies the implementation and probably is good > enough for lots of people. I am only worried about that the interface > and how does it work with existing interfaces. > > In absolute throttling one does not have to care about feedback or > what is the underlying bdi bandwidth. So to me these patches are > good for work conserving IO control where we want to determine how > fast we can write to device and then throttle tasks accordingly. But > in absolute throttling one specifies the upper limit and there we > don't need the mechanism to determine what the bdi badnwidth or > how many dirty pages are there and throttle tasks accordingly. > > > > > The buffered write throttle is designed to reduce the page cache > > dirtying rate to the current cleaning rate of the backing device > > is. Increase the cleaning rate (i.e. device is otherwise idle) and > > it will throttle less. Decrease the cleaning rate (i.e. other IO > > spikes or block IO throttle activates) and it will throttle more. > > > > We have to do vary buffered write throttling like this to adapt to > > changing IO workloads (e.g. someone starting a read-heavy workload > > will slow down writeback rate, so we need to throttle buffered > > writes more aggressively), so it has to be independent of any sort > > of block layer IO controller. > > > > Simply put: the block IO controller still has direct control over > > the rate at which buffered writes drain out of the system. The > > IO-less write throttle simply limits the rate at which buffered > > writes come into the system to match whatever the IO path allows to > > drain out.... > > Ok, this makes sense. So it goes back to the previous design where > absolute cgroup based control happens at device level and IO less > throttle implements the feedback loop to slow down the writes into > page cache. That makes sense. But Wu's slides suggest that one can > directly implement cgroup based IO control in IO less throttling > and that's where I have concerns. > > Anyway this stuff shall have to be made cgroup aware so that tasks > of different groups can see different throttling depending on how > much IO that group is able to do at device level. > > > > > > Now I think one could probably come up with more sophisticated scheme > > > where throttling is done at bdi level but is also accounted at device > > > level at IO controller. (Something similar I had done in the past but > > > Dave Chinner did not like it). > > > > I don't like it because it is solution to a specific problem and > > requires complex coupling across multiple layers of the system. We > > are trying to move away from that throttling model. More > > fundamentally, though, is that it is not a general solution to the > > entire class of "IO writeback rate changed" problems that buffered > > write throttling needs to solve. > > > > > Anyway, keeping track of per cgroup rate and throttling accordingly > > > can definitely help implement an algorithm for per cgroup IO control. > > > We probably just need to find a reasonable way to account all this > > > IO to end device so that we have control of all kind of IO of a cgroup. > > > How do you implement proportional control here? From overall bdi bandwidth > > > vary per cgroup bandwidth regularly based on cgroup weight? Again the > > > issue here is that it controls only buffered WRITES and nothing else and > > > in this case co-ordinating with CFQ will probably be hard. So I guess > > > usage of proportional IO just for buffered WRITES will have limited > > > usage. > > > > The whole point of doing the throttling this way is that we don't > > need any sort of special connection between block IO throttling and > > page cache (buffered write) throttling. We significantly reduce the > > coupling between the layers by relying on feedback-driven control > > loops to determine the buffered write throttling thresholds > > adaptively. IOWs, the IO-less write throttling at the page cache > > will adjust automatically to whatever throughput the block IO > > throttling allows async writes to achieve. > > This is good. But that's not the impression one gets from Wu's slides. > > > > > However, before we have a "finished product", there is still another > > piece of the puzzle to be put in place - memcg-aware buffered > > writeback. That is, having a flusher thread do work on behalf of > > memcg in the IO context of the memcg. Then the IO controller just > > sees a stream of async writes in the context of the memcg the > > buffered writes came from in the first place. The block layer > > throttles them just like any other IO in the IO context of the > > memcg... > > Yes that is still a piece remaining. I was hoping that Greg Thelen will > be able to extend his patches to submit writes in the context of > per cgroup flusher/worker threads and solve this problem. > > Thanks > Vivek Are you suggesting multiple flushers per bdi (one per cgroup)? I thought the point of IO less was to one issue buffered writes from a single thread. Note: I have rebased the memcg writeback code to latest mmotm and am testing it now. These patches do not introduce additional threads; the existing bdi flusher threads are used with an optional memcg filter. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 @ 2011-08-10 7:41 ` Greg Thelen 0 siblings, 0 replies; 301+ messages in thread From: Greg Thelen @ 2011-08-10 7:41 UTC (permalink / raw) To: Vivek Goyal Cc: Minchan Kim, Wu Fengguang, Dave Chinner, Christoph Hellwig, LKML, Andrea Righi, Andrew Morton, linux-fsdevel, linux-mm, Jan Kara, KAMEZAWA Hiroyuki On Aug 9, 2011 7:04 AM, "Vivek Goyal" <vgoyal@redhat.com> wrote: > > On Tue, Aug 09, 2011 at 03:55:51PM +1000, Dave Chinner wrote: > > On Mon, Aug 08, 2011 at 10:01:27PM -0400, Vivek Goyal wrote: > > > On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote: > > > > Hi all, > > > > > > > > The _core_ bits of the IO-less balance_dirty_pages(). > > > > Heavily simplified and re-commented to make it easier to review. > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8 > > > > > > > > Only the bare minimal algorithms are presented, so you will find some rough > > > > edges in the graphs below. But it's usable :) > > > > > > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/ > > > > > > > > And an introduction to the (more complete) algorithms: > > > > > > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf > > > > > > > > Questions and reviews are highly appreciated! > > > > > > Hi Wu, > > > > > > I am going through the slide number 39 where you talk about it being > > > future proof and it can be used for IO control purposes. You have listed > > > following merits of this approach. > > > > > > * per-bdi nature, works on NFS and Software RAID > > > * no delayed response (working at the right layer) > > > * no page tracking, hence decoupled from memcg > > > * no interactions with FS and CFQ > > > * get proportional IO controller for free > > > * reuse/inherit all the base facilities/functions > > > > > > I would say that it will also be a good idea to list the demerits of > > > this approach in current form and that is that it only deals with > > > controlling buffered write IO and nothing else. > > > > That's not a demerit - that is all it is designed to do. > > It is designed to improve the existing task throttling functionality and > we are trying to extend the same to cgroups too. So if by design something > does not gel well with existing pieces, it is demerit to me. Atleast > there should be a good explanation of design intention and how it is > going to be useful. > > For example, how this thing is going to gel with existing IO controller? > Are you going to create two separate mechianisms. One for control of > writes while entering the cache and other for controlling the writes > at device level? > > The fact that this mechanism does not know about any other IO in the > system/cgroup is a limiting factor. From usability point of view, a > user expects any kind of IO happening from a group. > > So are we planning to create a new controller? Or add additional files > in existing controller to control the per cgroup write throttling > behavior? Even if we create additional files, again then a user is > forced to put separate write policies for buffered writes and direct > writes. I was hoping a better interface would be that user puts a > policy on writes and that takes affect and a user does not have to > worry whether the applications inside the cgroup are doing buffered > writes or direct writes. > > > > > > So on the same block device, other direct writes might be going on > > > from same group and in this scheme a user will not have any > > > control. > > > > But it is taken into account by the IO write throttling. > > You mean blkio controller? > > It does. But my complain is that we are trying to control two separate > knobs for two kind of IOs and I am trying to come up with a single > knob. > > Current interface for write control in blkio controller looks like. > > blkio.throtl.write_bps_device > > Once can write to this file specifying the write limit of a cgroup > on a particular device. I was hoping that buffered write limits > will come out of same limit but with these pathes looks like we > shall have to create a new interface altogether which just controls > buffered writes and nothing else and user is supposed to know what > his application is doing and try to configure the limits accordingly. > > So my concern is that how the overall interface would look like and > how well it will work with existing controller and how a user is > supposed to use it. > > In fact current IO controller does throttling at device level so > interface is device specific. One is supposed to know the major > and minor number of device to specify. I am not sure in this > case what one is supposed to do as it is bdi specific and for > NFS case there is no device. So one is supposed to speciy bdi or > limits are going to be global (system wide, independent of bdi > or block device)? > > > > > > Another disadvantage is that throttling at page cache > > > level does not take care of IO spikes at device level. > > > > And that is handled as well. > > > > How? By the indirect effect other IO and IO spikes have on the > > writeback rate. That is, other IO reduces the writeback bandwidth, > > which then changes the throttling parameters via feedback loops. > > Actually I was referring to effect of buffered writes on other IO > going on the device. With control being on device level, one can > tightly control the WRITEs flowing out of a cgroup to Lun and that > can help a bit knowing how bad it will be for other reads going on > the lun. > > With this scheme, flusher threads can suddenly throw tons of writes > on lun and then no IO for another few seconds. So basically IO is > bursty at device level and doing control at device level can make > it more smooth. > > So we have two ways to control buffered writes. > > - Throttle them while entering the page cache > - Throttle them at device and feedback loop in turn throttles them at > page cache level based on dirty ratio. > > Myself and Andrea had implemented first appraoch (same what Wu is > suggesting now with a different mechanism) and following was your > response. > > https://lkml.org/lkml/2011/6/28/494 > > To me it looked like that at that point of time you preferred precise > throttling at device level and now you seem to prefer precise throttling > at page cache level? > > Again, I am not against cgroup parameter based throttling at page > cache level. It simplifies the implementation and probably is good > enough for lots of people. I am only worried about that the interface > and how does it work with existing interfaces. > > In absolute throttling one does not have to care about feedback or > what is the underlying bdi bandwidth. So to me these patches are > good for work conserving IO control where we want to determine how > fast we can write to device and then throttle tasks accordingly. But > in absolute throttling one specifies the upper limit and there we > don't need the mechanism to determine what the bdi badnwidth or > how many dirty pages are there and throttle tasks accordingly. > > > > > The buffered write throttle is designed to reduce the page cache > > dirtying rate to the current cleaning rate of the backing device > > is. Increase the cleaning rate (i.e. device is otherwise idle) and > > it will throttle less. Decrease the cleaning rate (i.e. other IO > > spikes or block IO throttle activates) and it will throttle more. > > > > We have to do vary buffered write throttling like this to adapt to > > changing IO workloads (e.g. someone starting a read-heavy workload > > will slow down writeback rate, so we need to throttle buffered > > writes more aggressively), so it has to be independent of any sort > > of block layer IO controller. > > > > Simply put: the block IO controller still has direct control over > > the rate at which buffered writes drain out of the system. The > > IO-less write throttle simply limits the rate at which buffered > > writes come into the system to match whatever the IO path allows to > > drain out.... > > Ok, this makes sense. So it goes back to the previous design where > absolute cgroup based control happens at device level and IO less > throttle implements the feedback loop to slow down the writes into > page cache. That makes sense. But Wu's slides suggest that one can > directly implement cgroup based IO control in IO less throttling > and that's where I have concerns. > > Anyway this stuff shall have to be made cgroup aware so that tasks > of different groups can see different throttling depending on how > much IO that group is able to do at device level. > > > > > > Now I think one could probably come up with more sophisticated scheme > > > where throttling is done at bdi level but is also accounted at device > > > level at IO controller. (Something similar I had done in the past but > > > Dave Chinner did not like it). > > > > I don't like it because it is solution to a specific problem and > > requires complex coupling across multiple layers of the system. We > > are trying to move away from that throttling model. More > > fundamentally, though, is that it is not a general solution to the > > entire class of "IO writeback rate changed" problems that buffered > > write throttling needs to solve. > > > > > Anyway, keeping track of per cgroup rate and throttling accordingly > > > can definitely help implement an algorithm for per cgroup IO control. > > > We probably just need to find a reasonable way to account all this > > > IO to end device so that we have control of all kind of IO of a cgroup. > > > How do you implement proportional control here? From overall bdi bandwidth > > > vary per cgroup bandwidth regularly based on cgroup weight? Again the > > > issue here is that it controls only buffered WRITES and nothing else and > > > in this case co-ordinating with CFQ will probably be hard. So I guess > > > usage of proportional IO just for buffered WRITES will have limited > > > usage. > > > > The whole point of doing the throttling this way is that we don't > > need any sort of special connection between block IO throttling and > > page cache (buffered write) throttling. We significantly reduce the > > coupling between the layers by relying on feedback-driven control > > loops to determine the buffered write throttling thresholds > > adaptively. IOWs, the IO-less write throttling at the page cache > > will adjust automatically to whatever throughput the block IO > > throttling allows async writes to achieve. > > This is good. But that's not the impression one gets from Wu's slides. > > > > > However, before we have a "finished product", there is still another > > piece of the puzzle to be put in place - memcg-aware buffered > > writeback. That is, having a flusher thread do work on behalf of > > memcg in the IO context of the memcg. Then the IO controller just > > sees a stream of async writes in the context of the memcg the > > buffered writes came from in the first place. The block layer > > throttles them just like any other IO in the IO context of the > > memcg... > > Yes that is still a piece remaining. I was hoping that Greg Thelen will > be able to extend his patches to submit writes in the context of > per cgroup flusher/worker threads and solve this problem. > > Thanks > Vivek Are you suggesting multiple flushers per bdi (one per cgroup)? I thought the point of IO less was to one issue buffered writes from a single thread. Note: I have rebased the memcg writeback code to latest mmotm and am testing it now. These patches do not introduce additional threads; the existing bdi flusher threads are used with an optional memcg filter. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 2011-08-10 7:41 ` Greg Thelen (?) @ 2011-08-10 18:40 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-10 18:40 UTC (permalink / raw) To: Greg Thelen Cc: Minchan Kim, Wu Fengguang, Dave Chinner, Christoph Hellwig, LKML, Andrea Righi, Andrew Morton, linux-fsdevel, linux-mm, Jan Kara, KAMEZAWA Hiroyuki On Wed, Aug 10, 2011 at 12:41:00AM -0700, Greg Thelen wrote: [..] > > > However, before we have a "finished product", there is still another > > > piece of the puzzle to be put in place - memcg-aware buffered > > > writeback. That is, having a flusher thread do work on behalf of > > > memcg in the IO context of the memcg. Then the IO controller just > > > sees a stream of async writes in the context of the memcg the > > > buffered writes came from in the first place. The block layer > > > throttles them just like any other IO in the IO context of the > > > memcg... > > > > Yes that is still a piece remaining. I was hoping that Greg Thelen will > > be able to extend his patches to submit writes in the context of > > per cgroup flusher/worker threads and solve this problem. > > > > Thanks > > Vivek > > Are you suggesting multiple flushers per bdi (one per cgroup)? I > thought the point of IO less was to one issue buffered writes from a > single thread. I think in one of the mail threads Dave Chinner mentioned this idea of using per cgroup worker/worqueue. Agreed that it leads back to the issue of multiple writers (but only if multiple cgroups are there). But at the same time it simplifies atleast two problems. - Worker could be migrated to the cgroup we are writting for and we don't need the IO tracking logic. blkio controller should will automatically account the IO to right group. - We don't have to worry about a single flusher thread sleeping on request queue because either queue or group is congested and this can lead other group's IO is not being submitted. Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 @ 2011-08-10 18:40 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-10 18:40 UTC (permalink / raw) To: Greg Thelen Cc: Minchan Kim, Wu Fengguang, Dave Chinner, Christoph Hellwig, LKML, Andrea Righi, Andrew Morton, linux-fsdevel, linux-mm, Jan Kara, KAMEZAWA Hiroyuki On Wed, Aug 10, 2011 at 12:41:00AM -0700, Greg Thelen wrote: [..] > > > However, before we have a "finished product", there is still another > > > piece of the puzzle to be put in place - memcg-aware buffered > > > writeback. That is, having a flusher thread do work on behalf of > > > memcg in the IO context of the memcg. Then the IO controller just > > > sees a stream of async writes in the context of the memcg the > > > buffered writes came from in the first place. The block layer > > > throttles them just like any other IO in the IO context of the > > > memcg... > > > > Yes that is still a piece remaining. I was hoping that Greg Thelen will > > be able to extend his patches to submit writes in the context of > > per cgroup flusher/worker threads and solve this problem. > > > > Thanks > > Vivek > > Are you suggesting multiple flushers per bdi (one per cgroup)? I > thought the point of IO less was to one issue buffered writes from a > single thread. I think in one of the mail threads Dave Chinner mentioned this idea of using per cgroup worker/worqueue. Agreed that it leads back to the issue of multiple writers (but only if multiple cgroups are there). But at the same time it simplifies atleast two problems. - Worker could be migrated to the cgroup we are writting for and we don't need the IO tracking logic. blkio controller should will automatically account the IO to right group. - We don't have to worry about a single flusher thread sleeping on request queue because either queue or group is congested and this can lead other group's IO is not being submitted. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 @ 2011-08-10 18:40 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-10 18:40 UTC (permalink / raw) To: Greg Thelen Cc: Minchan Kim, Wu Fengguang, Dave Chinner, Christoph Hellwig, LKML, Andrea Righi, Andrew Morton, linux-fsdevel, linux-mm, Jan Kara, KAMEZAWA Hiroyuki On Wed, Aug 10, 2011 at 12:41:00AM -0700, Greg Thelen wrote: [..] > > > However, before we have a "finished product", there is still another > > > piece of the puzzle to be put in place - memcg-aware buffered > > > writeback. That is, having a flusher thread do work on behalf of > > > memcg in the IO context of the memcg. Then the IO controller just > > > sees a stream of async writes in the context of the memcg the > > > buffered writes came from in the first place. The block layer > > > throttles them just like any other IO in the IO context of the > > > memcg... > > > > Yes that is still a piece remaining. I was hoping that Greg Thelen will > > be able to extend his patches to submit writes in the context of > > per cgroup flusher/worker threads and solve this problem. > > > > Thanks > > Vivek > > Are you suggesting multiple flushers per bdi (one per cgroup)? I > thought the point of IO less was to one issue buffered writes from a > single thread. I think in one of the mail threads Dave Chinner mentioned this idea of using per cgroup worker/worqueue. Agreed that it leads back to the issue of multiple writers (but only if multiple cgroups are there). But at the same time it simplifies atleast two problems. - Worker could be migrated to the cgroup we are writting for and we don't need the IO tracking logic. blkio controller should will automatically account the IO to right group. - We don't have to worry about a single flusher thread sleeping on request queue because either queue or group is congested and this can lead other group's IO is not being submitted. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 2011-08-09 2:01 ` Vivek Goyal @ 2011-08-11 3:21 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-11 3:21 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML > [...] it only deals with controlling buffered write IO and nothing > else. So on the same block device, other direct writes might be > going on from same group and in this scheme a user will not have any > control. The IO-less balance_dirty_pages() will be able to throttle DIRECT writes. There is nothing fundamental in the way. The basic approach will be to add a balance_dirty_pages_ratelimited_nr() call in the DIRECT write path, and to call into balance_dirty_pages() regardless of the various dirty thresholds. Then the IO-less balance_dirty_pages() has all the facilities to throttle a task at any auto-estimated or user-specified ratelimit. > Another disadvantage is that throttling at page cache level does not > take care of IO spikes at device level. Yes this is a problem. But it's a problem best fixable in the IO scheduler.. (I cannot go to details at this time, however it does _sound_ possible to me..) > How do you implement proportional control here? From overall bdi bandwidth > vary per cgroup bandwidth regularly based on cgroup weight? Again the > issue here is that it controls only buffered WRITES and nothing else and > in this case co-ordinating with CFQ will probably be hard. So I guess > usage of proportional IO just for buffered WRITES will have limited > usage. "priority" may be a more suitable phrase. It will be implemented like this (without the user interface): @@ -1007,6 +1001,13 @@ static void balance_dirty_pages(struct a max_pause = bdi_max_pause(bdi, bdi_dirty); base_rate = bdi->dirty_ratelimit; + /* + * Double the bandwidth for PF_LESS_THROTTLE (ie. nfsd) and + * real-time tasks. + */ + if (current->flags & PF_LESS_THROTTLE || rt_task(current)) + base_rate *= 2; + pos_ratio = bdi_position_ratio(bdi, dirty_thresh, background_thresh, nr_dirty, bdi_thresh, bdi_dirty); That is, if start 2 dd tasks A and B with priority_B=2. Then the resulting rate_B will be equal to 2*rate_A. The ->dirty_ratelimit will auto adapt to rate_A or equally (write_bw/3). The same can be applied to cgroup. One may specify the whole cgroup's dirty rate be throttled at N times that of a normal dd in the root cgroup, or be throttled at some absolute 10MB/s rate. The corresponding cgroup->dirty_ratelimit will be set to (N * bdi->dirty_ratelimit) for the former and 10MB/s for the latter. The user can specify any combinations of "priority" and "absolute ratelimit" for any task and/or cgroup, tasks inside cgroup, and so on. We have very powerful (bdi or cgroup)->dirty_ratelimit adaptation mechanism to support the combinations :) The "priority" can even be applied to DIRECT dirtiers, _as long as_ there are other buffered dirtiers to generate enough dirty pages. It's not as easy to apply priorities when there are only DIRECT dirtiers. In contrast, the absolute ratelimit is always applicable to all kind of tasks and cgroups. Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 @ 2011-08-11 3:21 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-11 3:21 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML > [...] it only deals with controlling buffered write IO and nothing > else. So on the same block device, other direct writes might be > going on from same group and in this scheme a user will not have any > control. The IO-less balance_dirty_pages() will be able to throttle DIRECT writes. There is nothing fundamental in the way. The basic approach will be to add a balance_dirty_pages_ratelimited_nr() call in the DIRECT write path, and to call into balance_dirty_pages() regardless of the various dirty thresholds. Then the IO-less balance_dirty_pages() has all the facilities to throttle a task at any auto-estimated or user-specified ratelimit. > Another disadvantage is that throttling at page cache level does not > take care of IO spikes at device level. Yes this is a problem. But it's a problem best fixable in the IO scheduler.. (I cannot go to details at this time, however it does _sound_ possible to me..) > How do you implement proportional control here? From overall bdi bandwidth > vary per cgroup bandwidth regularly based on cgroup weight? Again the > issue here is that it controls only buffered WRITES and nothing else and > in this case co-ordinating with CFQ will probably be hard. So I guess > usage of proportional IO just for buffered WRITES will have limited > usage. "priority" may be a more suitable phrase. It will be implemented like this (without the user interface): @@ -1007,6 +1001,13 @@ static void balance_dirty_pages(struct a max_pause = bdi_max_pause(bdi, bdi_dirty); base_rate = bdi->dirty_ratelimit; + /* + * Double the bandwidth for PF_LESS_THROTTLE (ie. nfsd) and + * real-time tasks. + */ + if (current->flags & PF_LESS_THROTTLE || rt_task(current)) + base_rate *= 2; + pos_ratio = bdi_position_ratio(bdi, dirty_thresh, background_thresh, nr_dirty, bdi_thresh, bdi_dirty); That is, if start 2 dd tasks A and B with priority_B=2. Then the resulting rate_B will be equal to 2*rate_A. The ->dirty_ratelimit will auto adapt to rate_A or equally (write_bw/3). The same can be applied to cgroup. One may specify the whole cgroup's dirty rate be throttled at N times that of a normal dd in the root cgroup, or be throttled at some absolute 10MB/s rate. The corresponding cgroup->dirty_ratelimit will be set to (N * bdi->dirty_ratelimit) for the former and 10MB/s for the latter. The user can specify any combinations of "priority" and "absolute ratelimit" for any task and/or cgroup, tasks inside cgroup, and so on. We have very powerful (bdi or cgroup)->dirty_ratelimit adaptation mechanism to support the combinations :) The "priority" can even be applied to DIRECT dirtiers, _as long as_ there are other buffered dirtiers to generate enough dirty pages. It's not as easy to apply priorities when there are only DIRECT dirtiers. In contrast, the absolute ratelimit is always applicable to all kind of tasks and cgroups. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 2011-08-11 3:21 ` Wu Fengguang @ 2011-08-11 20:42 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-11 20:42 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Thu, Aug 11, 2011 at 11:21:43AM +0800, Wu Fengguang wrote: > > [...] it only deals with controlling buffered write IO and nothing > > else. So on the same block device, other direct writes might be > > going on from same group and in this scheme a user will not have any > > control. > > The IO-less balance_dirty_pages() will be able to throttle DIRECT > writes. There is nothing fundamental in the way. > > The basic approach will be to add a balance_dirty_pages_ratelimited_nr() > call in the DIRECT write path, and to call into balance_dirty_pages() > regardless of the various dirty thresholds. > > Then the IO-less balance_dirty_pages() has all the facilities to > throttle a task at any auto-estimated or user-specified ratelimit. A direct IO being routed through balance_dirty_pages() when it is really not dirtying anything, sounds really odd to me. What about direct AIO. Throttling direct IO at balance_dirty_pages() is little different than throttling at device level where we build a buffer of requests and submit requests asynchronously (even when submitter has crossed the threshold/rate). Submitter does not have to block and can go back to user space and do other things while waiting for completion of submitted IO. You know what, since the beginning you have been talking about how this mechanism can be extended to do some IO control. That's fine. I think a more fruitul discussion can happen if we approach the problem in a different way and that is lets figure out what are the requirements, what are the problems, what do we need to control, what is the best place to control something and how the interface is going to look like. Once we figure out interfaces and what are we trying to achieve then rest of it is just mechanism and your method is one possible way of implementing things and then we can discuss advantages and disadvantages of various mechanisms. What do we want --------------- To me I see basic problem is as follows. We primarily want to provide two controls, atleast at cgroup level. If the same can be extended to task level, that's a bonus. - Notion of iopriority (work conserving control, proportional IO) - Absolute limits (non work conserving control, throttling) What do we currently have ------------------------- - Proportional IO is implemented at device level in CFQ IO scheduler. - It works both at task level (ioprio) and group level (blkio.weight). The only problem is it works only for synchronous IO and does not cover buffered WRITES. - Throttling - Implemented at block layer (per device). Works for groups. There is no per task interface. Again works for synchronous IO and does not cover buffered writes. So to me in current scheme of things there is only one big problem to be solved. - How to control buffered writes. - prportional IO - Absolute throttling. Proportional IO --------------- - Because we lose all the context information of submitter by the time IO reaches CFQ, for task ioprio, it is probably best to do something about it when writting to bdi. So your scheme sounds like a good candiate for that. - At cgroup level, things get little more complicated as priority belongs to the whole group and a group could be doing some READs, some direct WRITES and some buffered WRITEs. If we implement a group's proportional write control at page cache level, we have following issue. - bdi based control does not know about READs and direct WRITEs. Now assume that a high prio group is doing just buffered writes and a low prio group is doing READs. CFQ will choke WRITEs behind READs and effectively a higher prio group did not get its share. So I think doing proportional IO control at device level provides better control overall and better integration with cgroups. Throttling ---------- - Throttling of buffered WRITEs can be done at page cache level and it makes sense to me in general. There seem to be two primary issue we need to think about. - It should gel well with current IO controller interfaces. Either we provide a separate control file in blkio controller which only controls buffered write rate or we come up with a way so that common control knows both about direct and buffered writes and control can come out of common quota. For example if somebody says that 10MB/s is limit for write for this cgroup on device 8:32, then that limit is effective both for direct write as well as buffered write. Alternatively we could implement a separate control file say blkio.throttle.buffered_write_bps_device where one specifies the buffered write rate of a cgroup on a device and your logic parses it and controls it. And direct IO control limit comes from a separate existing file. blkio.throttle.write_bps_device. In my opinion it is less integrated appraoch and user will find it less friendly to configure. - IO spike at device when flusher clean up dirty memory. I know you have been saying that IO scheduler's somehow should take care of it, but IO schedulers provide ony so much of protection against WRITE. On top of that protection is not predictable. CFQ still provides good protection against WRITEs but what about deadline and noop. There spikes for sure will lead to less predictable IO latencies for READs. If we implement throttling for buffered write at device level and feedback mechanism reduces the dirty rate for the cgroup automatically that will take care of both the above issues. The only issue we will have to worry about how to take care of priority inversion issues where a high prio IO does not get throttled behind low prio IO. For that file systems will have to be more parallel. Throttling at page cache level has this advantage that it has to worry less about this serializaiton. So I see following immediate extension of your scheme possible. - Inherit ioprio from iocontext and provide buffered write service differentiation for writers. - Create a per task buffered write throttling interface and do absolute throttling of task. - We can possibly do the idea of throttling group wide buffered writes only control at this layer using this mechanism. Thoughts? Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 @ 2011-08-11 20:42 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-11 20:42 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Thu, Aug 11, 2011 at 11:21:43AM +0800, Wu Fengguang wrote: > > [...] it only deals with controlling buffered write IO and nothing > > else. So on the same block device, other direct writes might be > > going on from same group and in this scheme a user will not have any > > control. > > The IO-less balance_dirty_pages() will be able to throttle DIRECT > writes. There is nothing fundamental in the way. > > The basic approach will be to add a balance_dirty_pages_ratelimited_nr() > call in the DIRECT write path, and to call into balance_dirty_pages() > regardless of the various dirty thresholds. > > Then the IO-less balance_dirty_pages() has all the facilities to > throttle a task at any auto-estimated or user-specified ratelimit. A direct IO being routed through balance_dirty_pages() when it is really not dirtying anything, sounds really odd to me. What about direct AIO. Throttling direct IO at balance_dirty_pages() is little different than throttling at device level where we build a buffer of requests and submit requests asynchronously (even when submitter has crossed the threshold/rate). Submitter does not have to block and can go back to user space and do other things while waiting for completion of submitted IO. You know what, since the beginning you have been talking about how this mechanism can be extended to do some IO control. That's fine. I think a more fruitul discussion can happen if we approach the problem in a different way and that is lets figure out what are the requirements, what are the problems, what do we need to control, what is the best place to control something and how the interface is going to look like. Once we figure out interfaces and what are we trying to achieve then rest of it is just mechanism and your method is one possible way of implementing things and then we can discuss advantages and disadvantages of various mechanisms. What do we want --------------- To me I see basic problem is as follows. We primarily want to provide two controls, atleast at cgroup level. If the same can be extended to task level, that's a bonus. - Notion of iopriority (work conserving control, proportional IO) - Absolute limits (non work conserving control, throttling) What do we currently have ------------------------- - Proportional IO is implemented at device level in CFQ IO scheduler. - It works both at task level (ioprio) and group level (blkio.weight). The only problem is it works only for synchronous IO and does not cover buffered WRITES. - Throttling - Implemented at block layer (per device). Works for groups. There is no per task interface. Again works for synchronous IO and does not cover buffered writes. So to me in current scheme of things there is only one big problem to be solved. - How to control buffered writes. - prportional IO - Absolute throttling. Proportional IO --------------- - Because we lose all the context information of submitter by the time IO reaches CFQ, for task ioprio, it is probably best to do something about it when writting to bdi. So your scheme sounds like a good candiate for that. - At cgroup level, things get little more complicated as priority belongs to the whole group and a group could be doing some READs, some direct WRITES and some buffered WRITEs. If we implement a group's proportional write control at page cache level, we have following issue. - bdi based control does not know about READs and direct WRITEs. Now assume that a high prio group is doing just buffered writes and a low prio group is doing READs. CFQ will choke WRITEs behind READs and effectively a higher prio group did not get its share. So I think doing proportional IO control at device level provides better control overall and better integration with cgroups. Throttling ---------- - Throttling of buffered WRITEs can be done at page cache level and it makes sense to me in general. There seem to be two primary issue we need to think about. - It should gel well with current IO controller interfaces. Either we provide a separate control file in blkio controller which only controls buffered write rate or we come up with a way so that common control knows both about direct and buffered writes and control can come out of common quota. For example if somebody says that 10MB/s is limit for write for this cgroup on device 8:32, then that limit is effective both for direct write as well as buffered write. Alternatively we could implement a separate control file say blkio.throttle.buffered_write_bps_device where one specifies the buffered write rate of a cgroup on a device and your logic parses it and controls it. And direct IO control limit comes from a separate existing file. blkio.throttle.write_bps_device. In my opinion it is less integrated appraoch and user will find it less friendly to configure. - IO spike at device when flusher clean up dirty memory. I know you have been saying that IO scheduler's somehow should take care of it, but IO schedulers provide ony so much of protection against WRITE. On top of that protection is not predictable. CFQ still provides good protection against WRITEs but what about deadline and noop. There spikes for sure will lead to less predictable IO latencies for READs. If we implement throttling for buffered write at device level and feedback mechanism reduces the dirty rate for the cgroup automatically that will take care of both the above issues. The only issue we will have to worry about how to take care of priority inversion issues where a high prio IO does not get throttled behind low prio IO. For that file systems will have to be more parallel. Throttling at page cache level has this advantage that it has to worry less about this serializaiton. So I see following immediate extension of your scheme possible. - Inherit ioprio from iocontext and provide buffered write service differentiation for writers. - Create a per task buffered write throttling interface and do absolute throttling of task. - We can possibly do the idea of throttling group wide buffered writes only control at this layer using this mechanism. Thoughts? Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 2011-08-11 20:42 ` Vivek Goyal @ 2011-08-11 21:00 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-11 21:00 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Thu, Aug 11, 2011 at 04:42:55PM -0400, Vivek Goyal wrote: [..] > So I see following immediate extension of your scheme possible. > > - Inherit ioprio from iocontext and provide buffered write service > differentiation for writers. > > - Create a per task buffered write throttling interface and do > absolute throttling of task. > > - We can possibly do the idea of throttling group wide buffered > writes only control at this layer using this mechanism. Though personally I like the idea of absolute throttling at page cache level as it can help a bit with problem of buffered WRITES impacting the latency of everything else in the system. CFQ helps a lot but it idles enough that cost of this isolation is very high on faster storage. Deadline and noop really do not do much about protection from WRITEs. So it is not perfect but might prove to be good enough for some use cases. Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 0/5] IO-less dirty throttling v8 @ 2011-08-11 21:00 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-11 21:00 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Thu, Aug 11, 2011 at 04:42:55PM -0400, Vivek Goyal wrote: [..] > So I see following immediate extension of your scheme possible. > > - Inherit ioprio from iocontext and provide buffered write service > differentiation for writers. > > - Create a per task buffered write throttling interface and do > absolute throttling of task. > > - We can possibly do the idea of throttling group wide buffered > writes only control at this layer using this mechanism. Though personally I like the idea of absolute throttling at page cache level as it can help a bit with problem of buffered WRITES impacting the latency of everything else in the system. CFQ helps a lot but it idles enough that cost of this isolation is very high on faster storage. Deadline and noop really do not do much about protection from WRITEs. So it is not perfect but might prove to be good enough for some use cases. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 0/5] IO-less dirty throttling v9 @ 2011-08-16 2:20 Wu Fengguang 2011-08-16 2:20 ` Wu Fengguang 0 siblings, 1 reply; 301+ messages in thread From: Wu Fengguang @ 2011-08-16 2:20 UTC (permalink / raw) To: linux-fsdevel Cc: Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML, Wu Fengguang Hi, The core bits of the IO-less balance_dirty_pages(). git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v9 Changes since v8: - a lot of renames and comment/changelog rework - use 3rd order polynomial as the global control line (Peter) - stabilize dirty_ratelimit by decreasing update step size on small errors - limit per-CPU dirtied pages to avoid dirty pages run away on 1k+ tasks (Peter) Thanks a lot to Peter and Andrea, Vivek for the careful reviews! shortlog: Wu Fengguang (5): writeback: account per-bdi accumulated dirtied pages writeback: dirty position control writeback: dirty rate control writeback: per task dirty rate limit writeback: IO-less balance_dirty_pages() The last 4 patches are one single logical change, but splitted here to make it easier to review the different parts of the algorithm. diffstat: fs/fs-writeback.c | 2 include/linux/backing-dev.h | 8 include/linux/sched.h | 7 include/linux/writeback.h | 1 include/trace/events/writeback.h | 24 - kernel/fork.c | 3 mm/backing-dev.c | 3 mm/page-writeback.c | 544 ++++++++++++++++++++--------- 8 files changed, 414 insertions(+), 178 deletions(-) Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-16 2:20 [PATCH 0/5] IO-less dirty throttling v9 Wu Fengguang 2011-08-16 2:20 ` Wu Fengguang @ 2011-08-16 2:20 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-16 2:20 UTC (permalink / raw) To: linux-fsdevel Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --] [-- Type: text/plain, Size: 15084 bytes --] As proposed by Chris, Dave and Jan, don't start foreground writeback IO inside balance_dirty_pages(). Instead, simply let it idle sleep for some time to throttle the dirtying task. In the mean while, kick off the per-bdi flusher thread to do background writeback IO. RATIONALS ========= - disk seeks on concurrent writeback of multiple inodes (Dave Chinner) If every thread doing writes and being throttled start foreground writeback, it leads to N IO submitters from at least N different inodes at the same time, end up with N different sets of IO being issued with potentially zero locality to each other, resulting in much lower elevator sort/merge efficiency and hence we seek the disk all over the place to service the different sets of IO. OTOH, if there is only one submission thread, it doesn't jump between inodes in the same way when congestion clears - it keeps writing to the same inode, resulting in large related chunks of sequential IOs being issued to the disk. This is more efficient than the above foreground writeback because the elevator works better and the disk seeks less. - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner) With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". * "CPU usage has dropped by ~55%", "it certainly appears that most of the CPU time saving comes from the removal of contention on the inode_wb_list_lock" (IMHO at least 10% comes from the reduction of cacheline bouncing, because the new code is able to call much less frequently into balance_dirty_pages() and hence access the global page states) * the user space "App overhead" is reduced by 20%, by avoiding the cacheline pollution by the complex writeback code path * "for a ~5% throughput reduction", "the number of write IOs have dropped by ~25%", and the elapsed time reduced from 41:42.17 to 40:53.23. * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and improves IO throughput from 38MB/s to 42MB/s. - IO size too small for fast arrays and too large for slow USB sticks The write_chunk used by current balance_dirty_pages() cannot be directly set to some large value (eg. 128MB) for better IO efficiency. Because it could lead to more than 1 second user perceivable stalls. Even the current 4MB write size may be too large for slow USB sticks. The fact that balance_dirty_pages() starts IO on itself couples the IO size to wait time, which makes it hard to do suitable IO size while keeping the wait time under control. Now it's possible to increase writeback chunk size proportional to the disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram, the larger writeback size dramatically reduces the seek count to 1/10 (far beyond my expectation) and improves the write throughput by 24%. - long block time in balance_dirty_pages() hurts desktop responsiveness Many of us may have the experience: it often takes a couple of seconds or even long time to stop a heavy writing dd/cp/tar command with Ctrl-C or "kill -9". - IO pipeline broken by bumpy write() progress There are a broad class of "loop {read(buf); write(buf);}" applications whose read() pipeline will be under-utilized or even come to a stop if the write()s have long latencies _or_ don't progress in a constant rate. The current threshold based throttling inherently transfers the large low level IO completion fluctuations to bumpy application write()s, and further deteriorates with increasing number of dirtiers and/or bdi's. For example, when doing 50 dd's + 1 remote rsync to an XFS partition, the rsync progresses very bumpy in legacy kernel, and throughput is improved by 67% by this patchset. (plus the larger write chunk size, it will be 93% speedup). The new rate based throttling can support 1000+ dd's with excellent smoothness, low latency and low overheads. For the above reasons, it's much better to do IO-less and low latency pauses in balance_dirty_pages(). Jan Kara, Dave Chinner and me explored the scheme to let balance_dirty_pages() wait for enough writeback IO completions to safeguard the dirty limit. However it's found to have two problems: - in large NUMA systems, the per-cpu counters may have big accounting errors, leading to big throttle wait time and jitters. - NFS may kill large amount of unstable pages with one single COMMIT. Because NFS server serves COMMIT with expensive fsync() IOs, it is desirable to delay and reduce the number of COMMITs. So it's not likely to optimize away such kind of bursty IO completions, and the resulted large (and tiny) stall times in IO completion based throttling. So here is a pause time oriented approach, which tries to control the pause time in each balance_dirty_pages() invocations, by controlling the number of pages dirtied before calling balance_dirty_pages(), for smooth and efficient dirty throttling: - avoid useless (eg. zero pause time) balance_dirty_pages() calls - avoid too small pause time (less than 4ms, which burns CPU power) - avoid too large pause time (more than 200ms, which hurts responsiveness) - avoid big fluctuations of pause times It can control pause times at will. The default policy (in a followup patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case. BEHAVIOR CHANGE =============== (1) dirty threshold Users will notice that the applications will get throttled once crossing the global (background + dirty)/2=15% threshold, and then balanced around 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable memory in 1-dd case. Since the task will be soft throttled earlier than before, it may be perceived by end users as performance "slow down" if his application happens to dirty more than 15% dirtyable memory. (2) smoothness/responsiveness Users will notice a more responsive system during heavy writeback. "killall dd" will take effect instantly. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/trace/events/writeback.h | 24 ---- mm/page-writeback.c | 147 ++++++++--------------------- 2 files changed, 41 insertions(+), 130 deletions(-) --- linux-next.orig/mm/page-writeback.c 2011-08-15 14:09:01.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-16 08:50:46.000000000 +0800 @@ -250,50 +250,6 @@ static void bdi_writeout_fraction(struct numerator, denominator); } -static inline void task_dirties_fraction(struct task_struct *tsk, - long *numerator, long *denominator) -{ - prop_fraction_single(&vm_dirties, &tsk->dirties, - numerator, denominator); -} - -/* - * task_dirty_limit - scale down dirty throttling threshold for one task - * - * task specific dirty limit: - * - * dirty -= (dirty/8) * p_{t} - * - * To protect light/slow dirtying tasks from heavier/fast ones, we start - * throttling individual tasks before reaching the bdi dirty limit. - * Relatively low thresholds will be allocated to heavy dirtiers. So when - * dirty pages grow large, heavy dirtiers will be throttled first, which will - * effectively curb the growth of dirty pages. Light dirtiers with high enough - * dirty threshold may never get throttled. - */ -#define TASK_LIMIT_FRACTION 8 -static unsigned long task_dirty_limit(struct task_struct *tsk, - unsigned long bdi_dirty) -{ - long numerator, denominator; - unsigned long dirty = bdi_dirty; - u64 inv = dirty / TASK_LIMIT_FRACTION; - - task_dirties_fraction(tsk, &numerator, &denominator); - inv *= numerator; - do_div(inv, denominator); - - dirty -= inv; - - return max(dirty, bdi_dirty/2); -} - -/* Minimum limit for any task */ -static unsigned long task_min_dirty_limit(unsigned long bdi_dirty) -{ - return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION; -} - /* * */ @@ -939,29 +895,34 @@ static unsigned long dirty_poll_interval /* * balance_dirty_pages() must be called by processes which are generating dirty * data. It looks at the number of dirty pages in the machine and will force - * the caller to perform writeback if the system is over `vm_dirty_ratio'. + * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2. * If we're over `background_thresh' then the writeback threads are woken to * perform some writeout. */ static void balance_dirty_pages(struct address_space *mapping, - unsigned long write_chunk) + unsigned long pages_dirtied) { - unsigned long nr_reclaimable, bdi_nr_reclaimable; + unsigned long nr_reclaimable; unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */ unsigned long bdi_dirty; unsigned long background_thresh; unsigned long dirty_thresh; unsigned long bdi_thresh; - unsigned long task_bdi_thresh; - unsigned long min_task_bdi_thresh; - unsigned long pages_written = 0; - unsigned long pause = 1; + long pause = 0; bool dirty_exceeded = false; - bool clear_dirty_exceeded = true; + unsigned long task_ratelimit; + unsigned long base_rate; + unsigned long pos_ratio; struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long start_time = jiffies; for (;;) { + /* + * Unstable writes are a feature of certain networked + * filesystems (i.e. NFS) in which data may have been + * written to the server's write cache, but has not yet + * been flushed to permanent storage. + */ nr_reclaimable = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); @@ -978,8 +939,6 @@ static void balance_dirty_pages(struct a break; bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); - min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh); - task_bdi_thresh = task_dirty_limit(current, bdi_thresh); /* * In order to avoid the stacked BDI deadlock we need @@ -991,57 +950,41 @@ static void balance_dirty_pages(struct a * actually dirty; with m+n sitting in the percpu * deltas. */ - if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) { - bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE); - bdi_dirty = bdi_nr_reclaimable + + if (bdi_thresh < 2 * bdi_stat_error(bdi)) + bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) + bdi_stat_sum(bdi, BDI_WRITEBACK); - } else { - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); - bdi_dirty = bdi_nr_reclaimable + + else + bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) + bdi_stat(bdi, BDI_WRITEBACK); - } - /* - * The bdi thresh is somehow "soft" limit derived from the - * global "hard" limit. The former helps to prevent heavy IO - * bdi or process from holding back light ones; The latter is - * the last resort safeguard. - */ - dirty_exceeded = (bdi_dirty > task_bdi_thresh) || + dirty_exceeded = (bdi_dirty > bdi_thresh) || (nr_dirty > dirty_thresh); - clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) && - (nr_dirty <= dirty_thresh); - - if (!dirty_exceeded) - break; - - if (!bdi->dirty_exceeded) + if (dirty_exceeded && !bdi->dirty_exceeded) bdi->dirty_exceeded = 1; bdi_update_bandwidth(bdi, dirty_thresh, background_thresh, nr_dirty, bdi_thresh, bdi_dirty, start_time); - /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. - * Unstable writes are a feature of certain networked - * filesystems (i.e. NFS) in which data may have been - * written to the server's write cache, but has not yet - * been flushed to permanent storage. - * Only move pages to writeback if this bdi is over its - * threshold otherwise wait until the disk writes catch - * up. - */ - trace_balance_dirty_start(bdi); - if (bdi_nr_reclaimable > task_bdi_thresh) { - pages_written += writeback_inodes_wb(&bdi->wb, - write_chunk); - trace_balance_dirty_written(bdi, pages_written); - if (pages_written >= write_chunk) - break; /* We've done our duty */ + if (unlikely(!writeback_in_progress(bdi))) + bdi_start_background_writeback(bdi); + + base_rate = bdi->dirty_ratelimit; + pos_ratio = bdi_position_ratio(bdi, dirty_thresh, + background_thresh, nr_dirty, + bdi_thresh, bdi_dirty); + if (unlikely(pos_ratio == 0)) { + pause = MAX_PAUSE; + goto pause; } + task_ratelimit = (u64)base_rate * + pos_ratio >> RATELIMIT_CALC_SHIFT; + pause = (HZ * pages_dirtied) / (task_ratelimit | 1); + pause = min(pause, MAX_PAUSE); + +pause: __set_current_state(TASK_UNINTERRUPTIBLE); io_schedule_timeout(pause); - trace_balance_dirty_wait(bdi); dirty_thresh = hard_dirty_limit(dirty_thresh); /* @@ -1051,8 +994,7 @@ static void balance_dirty_pages(struct a * (b) the pause time limit makes the dirtiers more responsive. */ if (nr_dirty < dirty_thresh + - dirty_thresh / DIRTY_MAXPAUSE_AREA && - time_after(jiffies, start_time + MAX_PAUSE)) + dirty_thresh / DIRTY_MAXPAUSE_AREA) break; /* * pass-good area. When some bdi gets blocked (eg. NFS server @@ -1065,18 +1007,9 @@ static void balance_dirty_pages(struct a dirty_thresh / DIRTY_PASSGOOD_AREA && bdi_dirty < bdi_thresh) break; - - /* - * Increase the delay for each loop, up to our previous - * default of taking a 100ms nap. - */ - pause <<= 1; - if (pause > HZ / 10) - pause = HZ / 10; } - /* Clear dirty_exceeded flag only when no task can exceed the limit */ - if (clear_dirty_exceeded && bdi->dirty_exceeded) + if (!dirty_exceeded && bdi->dirty_exceeded) bdi->dirty_exceeded = 0; current->nr_dirtied = 0; @@ -1093,8 +1026,10 @@ static void balance_dirty_pages(struct a * In normal mode, we start background writeout at the lower * background_thresh, to keep the amount of dirty memory low. */ - if ((laptop_mode && pages_written) || - (!laptop_mode && (nr_reclaimable > background_thresh))) + if (laptop_mode) + return; + + if (nr_reclaimable > background_thresh) bdi_start_background_writeback(bdi); } --- linux-next.orig/include/trace/events/writeback.h 2011-08-15 13:59:09.000000000 +0800 +++ linux-next/include/trace/events/writeback.h 2011-08-16 08:50:46.000000000 +0800 @@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister); DEFINE_WRITEBACK_EVENT(writeback_thread_start); DEFINE_WRITEBACK_EVENT(writeback_thread_stop); -DEFINE_WRITEBACK_EVENT(balance_dirty_start); -DEFINE_WRITEBACK_EVENT(balance_dirty_wait); - -TRACE_EVENT(balance_dirty_written, - - TP_PROTO(struct backing_dev_info *bdi, int written), - - TP_ARGS(bdi, written), - - TP_STRUCT__entry( - __array(char, name, 32) - __field(int, written) - ), - - TP_fast_assign( - strncpy(__entry->name, dev_name(bdi->dev), 32); - __entry->written = written; - ), - - TP_printk("bdi %s written %d", - __entry->name, - __entry->written - ) -); DECLARE_EVENT_CLASS(wbc_class, TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-16 2:20 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-16 2:20 UTC (permalink / raw) To: linux-fsdevel Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --] [-- Type: text/plain, Size: 15387 bytes --] As proposed by Chris, Dave and Jan, don't start foreground writeback IO inside balance_dirty_pages(). Instead, simply let it idle sleep for some time to throttle the dirtying task. In the mean while, kick off the per-bdi flusher thread to do background writeback IO. RATIONALS ========= - disk seeks on concurrent writeback of multiple inodes (Dave Chinner) If every thread doing writes and being throttled start foreground writeback, it leads to N IO submitters from at least N different inodes at the same time, end up with N different sets of IO being issued with potentially zero locality to each other, resulting in much lower elevator sort/merge efficiency and hence we seek the disk all over the place to service the different sets of IO. OTOH, if there is only one submission thread, it doesn't jump between inodes in the same way when congestion clears - it keeps writing to the same inode, resulting in large related chunks of sequential IOs being issued to the disk. This is more efficient than the above foreground writeback because the elevator works better and the disk seeks less. - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner) With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". * "CPU usage has dropped by ~55%", "it certainly appears that most of the CPU time saving comes from the removal of contention on the inode_wb_list_lock" (IMHO at least 10% comes from the reduction of cacheline bouncing, because the new code is able to call much less frequently into balance_dirty_pages() and hence access the global page states) * the user space "App overhead" is reduced by 20%, by avoiding the cacheline pollution by the complex writeback code path * "for a ~5% throughput reduction", "the number of write IOs have dropped by ~25%", and the elapsed time reduced from 41:42.17 to 40:53.23. * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and improves IO throughput from 38MB/s to 42MB/s. - IO size too small for fast arrays and too large for slow USB sticks The write_chunk used by current balance_dirty_pages() cannot be directly set to some large value (eg. 128MB) for better IO efficiency. Because it could lead to more than 1 second user perceivable stalls. Even the current 4MB write size may be too large for slow USB sticks. The fact that balance_dirty_pages() starts IO on itself couples the IO size to wait time, which makes it hard to do suitable IO size while keeping the wait time under control. Now it's possible to increase writeback chunk size proportional to the disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram, the larger writeback size dramatically reduces the seek count to 1/10 (far beyond my expectation) and improves the write throughput by 24%. - long block time in balance_dirty_pages() hurts desktop responsiveness Many of us may have the experience: it often takes a couple of seconds or even long time to stop a heavy writing dd/cp/tar command with Ctrl-C or "kill -9". - IO pipeline broken by bumpy write() progress There are a broad class of "loop {read(buf); write(buf);}" applications whose read() pipeline will be under-utilized or even come to a stop if the write()s have long latencies _or_ don't progress in a constant rate. The current threshold based throttling inherently transfers the large low level IO completion fluctuations to bumpy application write()s, and further deteriorates with increasing number of dirtiers and/or bdi's. For example, when doing 50 dd's + 1 remote rsync to an XFS partition, the rsync progresses very bumpy in legacy kernel, and throughput is improved by 67% by this patchset. (plus the larger write chunk size, it will be 93% speedup). The new rate based throttling can support 1000+ dd's with excellent smoothness, low latency and low overheads. For the above reasons, it's much better to do IO-less and low latency pauses in balance_dirty_pages(). Jan Kara, Dave Chinner and me explored the scheme to let balance_dirty_pages() wait for enough writeback IO completions to safeguard the dirty limit. However it's found to have two problems: - in large NUMA systems, the per-cpu counters may have big accounting errors, leading to big throttle wait time and jitters. - NFS may kill large amount of unstable pages with one single COMMIT. Because NFS server serves COMMIT with expensive fsync() IOs, it is desirable to delay and reduce the number of COMMITs. So it's not likely to optimize away such kind of bursty IO completions, and the resulted large (and tiny) stall times in IO completion based throttling. So here is a pause time oriented approach, which tries to control the pause time in each balance_dirty_pages() invocations, by controlling the number of pages dirtied before calling balance_dirty_pages(), for smooth and efficient dirty throttling: - avoid useless (eg. zero pause time) balance_dirty_pages() calls - avoid too small pause time (less than 4ms, which burns CPU power) - avoid too large pause time (more than 200ms, which hurts responsiveness) - avoid big fluctuations of pause times It can control pause times at will. The default policy (in a followup patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case. BEHAVIOR CHANGE =============== (1) dirty threshold Users will notice that the applications will get throttled once crossing the global (background + dirty)/2=15% threshold, and then balanced around 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable memory in 1-dd case. Since the task will be soft throttled earlier than before, it may be perceived by end users as performance "slow down" if his application happens to dirty more than 15% dirtyable memory. (2) smoothness/responsiveness Users will notice a more responsive system during heavy writeback. "killall dd" will take effect instantly. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/trace/events/writeback.h | 24 ---- mm/page-writeback.c | 147 ++++++++--------------------- 2 files changed, 41 insertions(+), 130 deletions(-) --- linux-next.orig/mm/page-writeback.c 2011-08-15 14:09:01.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-16 08:50:46.000000000 +0800 @@ -250,50 +250,6 @@ static void bdi_writeout_fraction(struct numerator, denominator); } -static inline void task_dirties_fraction(struct task_struct *tsk, - long *numerator, long *denominator) -{ - prop_fraction_single(&vm_dirties, &tsk->dirties, - numerator, denominator); -} - -/* - * task_dirty_limit - scale down dirty throttling threshold for one task - * - * task specific dirty limit: - * - * dirty -= (dirty/8) * p_{t} - * - * To protect light/slow dirtying tasks from heavier/fast ones, we start - * throttling individual tasks before reaching the bdi dirty limit. - * Relatively low thresholds will be allocated to heavy dirtiers. So when - * dirty pages grow large, heavy dirtiers will be throttled first, which will - * effectively curb the growth of dirty pages. Light dirtiers with high enough - * dirty threshold may never get throttled. - */ -#define TASK_LIMIT_FRACTION 8 -static unsigned long task_dirty_limit(struct task_struct *tsk, - unsigned long bdi_dirty) -{ - long numerator, denominator; - unsigned long dirty = bdi_dirty; - u64 inv = dirty / TASK_LIMIT_FRACTION; - - task_dirties_fraction(tsk, &numerator, &denominator); - inv *= numerator; - do_div(inv, denominator); - - dirty -= inv; - - return max(dirty, bdi_dirty/2); -} - -/* Minimum limit for any task */ -static unsigned long task_min_dirty_limit(unsigned long bdi_dirty) -{ - return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION; -} - /* * */ @@ -939,29 +895,34 @@ static unsigned long dirty_poll_interval /* * balance_dirty_pages() must be called by processes which are generating dirty * data. It looks at the number of dirty pages in the machine and will force - * the caller to perform writeback if the system is over `vm_dirty_ratio'. + * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2. * If we're over `background_thresh' then the writeback threads are woken to * perform some writeout. */ static void balance_dirty_pages(struct address_space *mapping, - unsigned long write_chunk) + unsigned long pages_dirtied) { - unsigned long nr_reclaimable, bdi_nr_reclaimable; + unsigned long nr_reclaimable; unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */ unsigned long bdi_dirty; unsigned long background_thresh; unsigned long dirty_thresh; unsigned long bdi_thresh; - unsigned long task_bdi_thresh; - unsigned long min_task_bdi_thresh; - unsigned long pages_written = 0; - unsigned long pause = 1; + long pause = 0; bool dirty_exceeded = false; - bool clear_dirty_exceeded = true; + unsigned long task_ratelimit; + unsigned long base_rate; + unsigned long pos_ratio; struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long start_time = jiffies; for (;;) { + /* + * Unstable writes are a feature of certain networked + * filesystems (i.e. NFS) in which data may have been + * written to the server's write cache, but has not yet + * been flushed to permanent storage. + */ nr_reclaimable = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); @@ -978,8 +939,6 @@ static void balance_dirty_pages(struct a break; bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); - min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh); - task_bdi_thresh = task_dirty_limit(current, bdi_thresh); /* * In order to avoid the stacked BDI deadlock we need @@ -991,57 +950,41 @@ static void balance_dirty_pages(struct a * actually dirty; with m+n sitting in the percpu * deltas. */ - if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) { - bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE); - bdi_dirty = bdi_nr_reclaimable + + if (bdi_thresh < 2 * bdi_stat_error(bdi)) + bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) + bdi_stat_sum(bdi, BDI_WRITEBACK); - } else { - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); - bdi_dirty = bdi_nr_reclaimable + + else + bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) + bdi_stat(bdi, BDI_WRITEBACK); - } - /* - * The bdi thresh is somehow "soft" limit derived from the - * global "hard" limit. The former helps to prevent heavy IO - * bdi or process from holding back light ones; The latter is - * the last resort safeguard. - */ - dirty_exceeded = (bdi_dirty > task_bdi_thresh) || + dirty_exceeded = (bdi_dirty > bdi_thresh) || (nr_dirty > dirty_thresh); - clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) && - (nr_dirty <= dirty_thresh); - - if (!dirty_exceeded) - break; - - if (!bdi->dirty_exceeded) + if (dirty_exceeded && !bdi->dirty_exceeded) bdi->dirty_exceeded = 1; bdi_update_bandwidth(bdi, dirty_thresh, background_thresh, nr_dirty, bdi_thresh, bdi_dirty, start_time); - /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. - * Unstable writes are a feature of certain networked - * filesystems (i.e. NFS) in which data may have been - * written to the server's write cache, but has not yet - * been flushed to permanent storage. - * Only move pages to writeback if this bdi is over its - * threshold otherwise wait until the disk writes catch - * up. - */ - trace_balance_dirty_start(bdi); - if (bdi_nr_reclaimable > task_bdi_thresh) { - pages_written += writeback_inodes_wb(&bdi->wb, - write_chunk); - trace_balance_dirty_written(bdi, pages_written); - if (pages_written >= write_chunk) - break; /* We've done our duty */ + if (unlikely(!writeback_in_progress(bdi))) + bdi_start_background_writeback(bdi); + + base_rate = bdi->dirty_ratelimit; + pos_ratio = bdi_position_ratio(bdi, dirty_thresh, + background_thresh, nr_dirty, + bdi_thresh, bdi_dirty); + if (unlikely(pos_ratio == 0)) { + pause = MAX_PAUSE; + goto pause; } + task_ratelimit = (u64)base_rate * + pos_ratio >> RATELIMIT_CALC_SHIFT; + pause = (HZ * pages_dirtied) / (task_ratelimit | 1); + pause = min(pause, MAX_PAUSE); + +pause: __set_current_state(TASK_UNINTERRUPTIBLE); io_schedule_timeout(pause); - trace_balance_dirty_wait(bdi); dirty_thresh = hard_dirty_limit(dirty_thresh); /* @@ -1051,8 +994,7 @@ static void balance_dirty_pages(struct a * (b) the pause time limit makes the dirtiers more responsive. */ if (nr_dirty < dirty_thresh + - dirty_thresh / DIRTY_MAXPAUSE_AREA && - time_after(jiffies, start_time + MAX_PAUSE)) + dirty_thresh / DIRTY_MAXPAUSE_AREA) break; /* * pass-good area. When some bdi gets blocked (eg. NFS server @@ -1065,18 +1007,9 @@ static void balance_dirty_pages(struct a dirty_thresh / DIRTY_PASSGOOD_AREA && bdi_dirty < bdi_thresh) break; - - /* - * Increase the delay for each loop, up to our previous - * default of taking a 100ms nap. - */ - pause <<= 1; - if (pause > HZ / 10) - pause = HZ / 10; } - /* Clear dirty_exceeded flag only when no task can exceed the limit */ - if (clear_dirty_exceeded && bdi->dirty_exceeded) + if (!dirty_exceeded && bdi->dirty_exceeded) bdi->dirty_exceeded = 0; current->nr_dirtied = 0; @@ -1093,8 +1026,10 @@ static void balance_dirty_pages(struct a * In normal mode, we start background writeout at the lower * background_thresh, to keep the amount of dirty memory low. */ - if ((laptop_mode && pages_written) || - (!laptop_mode && (nr_reclaimable > background_thresh))) + if (laptop_mode) + return; + + if (nr_reclaimable > background_thresh) bdi_start_background_writeback(bdi); } --- linux-next.orig/include/trace/events/writeback.h 2011-08-15 13:59:09.000000000 +0800 +++ linux-next/include/trace/events/writeback.h 2011-08-16 08:50:46.000000000 +0800 @@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister); DEFINE_WRITEBACK_EVENT(writeback_thread_start); DEFINE_WRITEBACK_EVENT(writeback_thread_stop); -DEFINE_WRITEBACK_EVENT(balance_dirty_start); -DEFINE_WRITEBACK_EVENT(balance_dirty_wait); - -TRACE_EVENT(balance_dirty_written, - - TP_PROTO(struct backing_dev_info *bdi, int written), - - TP_ARGS(bdi, written), - - TP_STRUCT__entry( - __array(char, name, 32) - __field(int, written) - ), - - TP_fast_assign( - strncpy(__entry->name, dev_name(bdi->dev), 32); - __entry->written = written; - ), - - TP_printk("bdi %s written %d", - __entry->name, - __entry->written - ) -); DECLARE_EVENT_CLASS(wbc_class, TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-16 2:20 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-16 2:20 UTC (permalink / raw) To: linux-fsdevel Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML [-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --] [-- Type: text/plain, Size: 15387 bytes --] As proposed by Chris, Dave and Jan, don't start foreground writeback IO inside balance_dirty_pages(). Instead, simply let it idle sleep for some time to throttle the dirtying task. In the mean while, kick off the per-bdi flusher thread to do background writeback IO. RATIONALS ========= - disk seeks on concurrent writeback of multiple inodes (Dave Chinner) If every thread doing writes and being throttled start foreground writeback, it leads to N IO submitters from at least N different inodes at the same time, end up with N different sets of IO being issued with potentially zero locality to each other, resulting in much lower elevator sort/merge efficiency and hence we seek the disk all over the place to service the different sets of IO. OTOH, if there is only one submission thread, it doesn't jump between inodes in the same way when congestion clears - it keeps writing to the same inode, resulting in large related chunks of sequential IOs being issued to the disk. This is more efficient than the above foreground writeback because the elevator works better and the disk seeks less. - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner) With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". * "CPU usage has dropped by ~55%", "it certainly appears that most of the CPU time saving comes from the removal of contention on the inode_wb_list_lock" (IMHO at least 10% comes from the reduction of cacheline bouncing, because the new code is able to call much less frequently into balance_dirty_pages() and hence access the global page states) * the user space "App overhead" is reduced by 20%, by avoiding the cacheline pollution by the complex writeback code path * "for a ~5% throughput reduction", "the number of write IOs have dropped by ~25%", and the elapsed time reduced from 41:42.17 to 40:53.23. * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and improves IO throughput from 38MB/s to 42MB/s. - IO size too small for fast arrays and too large for slow USB sticks The write_chunk used by current balance_dirty_pages() cannot be directly set to some large value (eg. 128MB) for better IO efficiency. Because it could lead to more than 1 second user perceivable stalls. Even the current 4MB write size may be too large for slow USB sticks. The fact that balance_dirty_pages() starts IO on itself couples the IO size to wait time, which makes it hard to do suitable IO size while keeping the wait time under control. Now it's possible to increase writeback chunk size proportional to the disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram, the larger writeback size dramatically reduces the seek count to 1/10 (far beyond my expectation) and improves the write throughput by 24%. - long block time in balance_dirty_pages() hurts desktop responsiveness Many of us may have the experience: it often takes a couple of seconds or even long time to stop a heavy writing dd/cp/tar command with Ctrl-C or "kill -9". - IO pipeline broken by bumpy write() progress There are a broad class of "loop {read(buf); write(buf);}" applications whose read() pipeline will be under-utilized or even come to a stop if the write()s have long latencies _or_ don't progress in a constant rate. The current threshold based throttling inherently transfers the large low level IO completion fluctuations to bumpy application write()s, and further deteriorates with increasing number of dirtiers and/or bdi's. For example, when doing 50 dd's + 1 remote rsync to an XFS partition, the rsync progresses very bumpy in legacy kernel, and throughput is improved by 67% by this patchset. (plus the larger write chunk size, it will be 93% speedup). The new rate based throttling can support 1000+ dd's with excellent smoothness, low latency and low overheads. For the above reasons, it's much better to do IO-less and low latency pauses in balance_dirty_pages(). Jan Kara, Dave Chinner and me explored the scheme to let balance_dirty_pages() wait for enough writeback IO completions to safeguard the dirty limit. However it's found to have two problems: - in large NUMA systems, the per-cpu counters may have big accounting errors, leading to big throttle wait time and jitters. - NFS may kill large amount of unstable pages with one single COMMIT. Because NFS server serves COMMIT with expensive fsync() IOs, it is desirable to delay and reduce the number of COMMITs. So it's not likely to optimize away such kind of bursty IO completions, and the resulted large (and tiny) stall times in IO completion based throttling. So here is a pause time oriented approach, which tries to control the pause time in each balance_dirty_pages() invocations, by controlling the number of pages dirtied before calling balance_dirty_pages(), for smooth and efficient dirty throttling: - avoid useless (eg. zero pause time) balance_dirty_pages() calls - avoid too small pause time (less than 4ms, which burns CPU power) - avoid too large pause time (more than 200ms, which hurts responsiveness) - avoid big fluctuations of pause times It can control pause times at will. The default policy (in a followup patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case. BEHAVIOR CHANGE =============== (1) dirty threshold Users will notice that the applications will get throttled once crossing the global (background + dirty)/2=15% threshold, and then balanced around 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable memory in 1-dd case. Since the task will be soft throttled earlier than before, it may be perceived by end users as performance "slow down" if his application happens to dirty more than 15% dirtyable memory. (2) smoothness/responsiveness Users will notice a more responsive system during heavy writeback. "killall dd" will take effect instantly. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/trace/events/writeback.h | 24 ---- mm/page-writeback.c | 147 ++++++++--------------------- 2 files changed, 41 insertions(+), 130 deletions(-) --- linux-next.orig/mm/page-writeback.c 2011-08-15 14:09:01.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-08-16 08:50:46.000000000 +0800 @@ -250,50 +250,6 @@ static void bdi_writeout_fraction(struct numerator, denominator); } -static inline void task_dirties_fraction(struct task_struct *tsk, - long *numerator, long *denominator) -{ - prop_fraction_single(&vm_dirties, &tsk->dirties, - numerator, denominator); -} - -/* - * task_dirty_limit - scale down dirty throttling threshold for one task - * - * task specific dirty limit: - * - * dirty -= (dirty/8) * p_{t} - * - * To protect light/slow dirtying tasks from heavier/fast ones, we start - * throttling individual tasks before reaching the bdi dirty limit. - * Relatively low thresholds will be allocated to heavy dirtiers. So when - * dirty pages grow large, heavy dirtiers will be throttled first, which will - * effectively curb the growth of dirty pages. Light dirtiers with high enough - * dirty threshold may never get throttled. - */ -#define TASK_LIMIT_FRACTION 8 -static unsigned long task_dirty_limit(struct task_struct *tsk, - unsigned long bdi_dirty) -{ - long numerator, denominator; - unsigned long dirty = bdi_dirty; - u64 inv = dirty / TASK_LIMIT_FRACTION; - - task_dirties_fraction(tsk, &numerator, &denominator); - inv *= numerator; - do_div(inv, denominator); - - dirty -= inv; - - return max(dirty, bdi_dirty/2); -} - -/* Minimum limit for any task */ -static unsigned long task_min_dirty_limit(unsigned long bdi_dirty) -{ - return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION; -} - /* * */ @@ -939,29 +895,34 @@ static unsigned long dirty_poll_interval /* * balance_dirty_pages() must be called by processes which are generating dirty * data. It looks at the number of dirty pages in the machine and will force - * the caller to perform writeback if the system is over `vm_dirty_ratio'. + * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2. * If we're over `background_thresh' then the writeback threads are woken to * perform some writeout. */ static void balance_dirty_pages(struct address_space *mapping, - unsigned long write_chunk) + unsigned long pages_dirtied) { - unsigned long nr_reclaimable, bdi_nr_reclaimable; + unsigned long nr_reclaimable; unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */ unsigned long bdi_dirty; unsigned long background_thresh; unsigned long dirty_thresh; unsigned long bdi_thresh; - unsigned long task_bdi_thresh; - unsigned long min_task_bdi_thresh; - unsigned long pages_written = 0; - unsigned long pause = 1; + long pause = 0; bool dirty_exceeded = false; - bool clear_dirty_exceeded = true; + unsigned long task_ratelimit; + unsigned long base_rate; + unsigned long pos_ratio; struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long start_time = jiffies; for (;;) { + /* + * Unstable writes are a feature of certain networked + * filesystems (i.e. NFS) in which data may have been + * written to the server's write cache, but has not yet + * been flushed to permanent storage. + */ nr_reclaimable = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); @@ -978,8 +939,6 @@ static void balance_dirty_pages(struct a break; bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); - min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh); - task_bdi_thresh = task_dirty_limit(current, bdi_thresh); /* * In order to avoid the stacked BDI deadlock we need @@ -991,57 +950,41 @@ static void balance_dirty_pages(struct a * actually dirty; with m+n sitting in the percpu * deltas. */ - if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) { - bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE); - bdi_dirty = bdi_nr_reclaimable + + if (bdi_thresh < 2 * bdi_stat_error(bdi)) + bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) + bdi_stat_sum(bdi, BDI_WRITEBACK); - } else { - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); - bdi_dirty = bdi_nr_reclaimable + + else + bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) + bdi_stat(bdi, BDI_WRITEBACK); - } - /* - * The bdi thresh is somehow "soft" limit derived from the - * global "hard" limit. The former helps to prevent heavy IO - * bdi or process from holding back light ones; The latter is - * the last resort safeguard. - */ - dirty_exceeded = (bdi_dirty > task_bdi_thresh) || + dirty_exceeded = (bdi_dirty > bdi_thresh) || (nr_dirty > dirty_thresh); - clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) && - (nr_dirty <= dirty_thresh); - - if (!dirty_exceeded) - break; - - if (!bdi->dirty_exceeded) + if (dirty_exceeded && !bdi->dirty_exceeded) bdi->dirty_exceeded = 1; bdi_update_bandwidth(bdi, dirty_thresh, background_thresh, nr_dirty, bdi_thresh, bdi_dirty, start_time); - /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. - * Unstable writes are a feature of certain networked - * filesystems (i.e. NFS) in which data may have been - * written to the server's write cache, but has not yet - * been flushed to permanent storage. - * Only move pages to writeback if this bdi is over its - * threshold otherwise wait until the disk writes catch - * up. - */ - trace_balance_dirty_start(bdi); - if (bdi_nr_reclaimable > task_bdi_thresh) { - pages_written += writeback_inodes_wb(&bdi->wb, - write_chunk); - trace_balance_dirty_written(bdi, pages_written); - if (pages_written >= write_chunk) - break; /* We've done our duty */ + if (unlikely(!writeback_in_progress(bdi))) + bdi_start_background_writeback(bdi); + + base_rate = bdi->dirty_ratelimit; + pos_ratio = bdi_position_ratio(bdi, dirty_thresh, + background_thresh, nr_dirty, + bdi_thresh, bdi_dirty); + if (unlikely(pos_ratio == 0)) { + pause = MAX_PAUSE; + goto pause; } + task_ratelimit = (u64)base_rate * + pos_ratio >> RATELIMIT_CALC_SHIFT; + pause = (HZ * pages_dirtied) / (task_ratelimit | 1); + pause = min(pause, MAX_PAUSE); + +pause: __set_current_state(TASK_UNINTERRUPTIBLE); io_schedule_timeout(pause); - trace_balance_dirty_wait(bdi); dirty_thresh = hard_dirty_limit(dirty_thresh); /* @@ -1051,8 +994,7 @@ static void balance_dirty_pages(struct a * (b) the pause time limit makes the dirtiers more responsive. */ if (nr_dirty < dirty_thresh + - dirty_thresh / DIRTY_MAXPAUSE_AREA && - time_after(jiffies, start_time + MAX_PAUSE)) + dirty_thresh / DIRTY_MAXPAUSE_AREA) break; /* * pass-good area. When some bdi gets blocked (eg. NFS server @@ -1065,18 +1007,9 @@ static void balance_dirty_pages(struct a dirty_thresh / DIRTY_PASSGOOD_AREA && bdi_dirty < bdi_thresh) break; - - /* - * Increase the delay for each loop, up to our previous - * default of taking a 100ms nap. - */ - pause <<= 1; - if (pause > HZ / 10) - pause = HZ / 10; } - /* Clear dirty_exceeded flag only when no task can exceed the limit */ - if (clear_dirty_exceeded && bdi->dirty_exceeded) + if (!dirty_exceeded && bdi->dirty_exceeded) bdi->dirty_exceeded = 0; current->nr_dirtied = 0; @@ -1093,8 +1026,10 @@ static void balance_dirty_pages(struct a * In normal mode, we start background writeout at the lower * background_thresh, to keep the amount of dirty memory low. */ - if ((laptop_mode && pages_written) || - (!laptop_mode && (nr_reclaimable > background_thresh))) + if (laptop_mode) + return; + + if (nr_reclaimable > background_thresh) bdi_start_background_writeback(bdi); } --- linux-next.orig/include/trace/events/writeback.h 2011-08-15 13:59:09.000000000 +0800 +++ linux-next/include/trace/events/writeback.h 2011-08-16 08:50:46.000000000 +0800 @@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister); DEFINE_WRITEBACK_EVENT(writeback_thread_start); DEFINE_WRITEBACK_EVENT(writeback_thread_stop); -DEFINE_WRITEBACK_EVENT(balance_dirty_start); -DEFINE_WRITEBACK_EVENT(balance_dirty_wait); - -TRACE_EVENT(balance_dirty_written, - - TP_PROTO(struct backing_dev_info *bdi, int written), - - TP_ARGS(bdi, written), - - TP_STRUCT__entry( - __array(char, name, 32) - __field(int, written) - ), - - TP_fast_assign( - strncpy(__entry->name, dev_name(bdi->dev), 32); - __entry->written = written; - ), - - TP_printk("bdi %s written %d", - __entry->name, - __entry->written - ) -); DECLARE_EVENT_CLASS(wbc_class, TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-16 2:20 ` Wu Fengguang @ 2011-08-19 2:06 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-19 2:06 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 16, 2011 at 10:20:11AM +0800, Wu Fengguang wrote: [..] > + if (dirty_exceeded && !bdi->dirty_exceeded) > bdi->dirty_exceeded = 1; > > bdi_update_bandwidth(bdi, dirty_thresh, background_thresh, > nr_dirty, bdi_thresh, bdi_dirty, > start_time); > > - /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. > - * Unstable writes are a feature of certain networked > - * filesystems (i.e. NFS) in which data may have been > - * written to the server's write cache, but has not yet > - * been flushed to permanent storage. > - * Only move pages to writeback if this bdi is over its > - * threshold otherwise wait until the disk writes catch > - * up. > - */ > - trace_balance_dirty_start(bdi); > - if (bdi_nr_reclaimable > task_bdi_thresh) { > - pages_written += writeback_inodes_wb(&bdi->wb, > - write_chunk); > - trace_balance_dirty_written(bdi, pages_written); > - if (pages_written >= write_chunk) > - break; /* We've done our duty */ > + if (unlikely(!writeback_in_progress(bdi))) > + bdi_start_background_writeback(bdi); > + > + base_rate = bdi->dirty_ratelimit; > + pos_ratio = bdi_position_ratio(bdi, dirty_thresh, > + background_thresh, nr_dirty, > + bdi_thresh, bdi_dirty); > + if (unlikely(pos_ratio == 0)) { > + pause = MAX_PAUSE; > + goto pause; > } > + task_ratelimit = (u64)base_rate * > + pos_ratio >> RATELIMIT_CALC_SHIFT; Hi Fenguaang, I am little confused here. I see that you have already taken pos_ratio into account in bdi_update_dirty_ratelimit() and wondering why to take that into account again in balance_diry_pages(). We calculated the pos_rate and balanced_rate and adjusted the bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit(). So why are we adjusting this pos_ratio() adjusted limit again with pos_ratio(). Doesn't it become effectively following (assuming one is decreasing the dirty rate limit). base_rate = bdi->dirty_ratelimit pos_rate = base_rate * pos_ratio(); write_bw balance_rate = pos_rate * -------- dirty_bw delta = max(pos_rate, balance_rate) bdi->dirty_ratelimit = bdi->dirty_ratelimit - delta; task_ratelimit = bdi->dirty_ratelimit * pos_ratio(). So we have already taken into account pos_ratio() while calculating new bdi->dirty_ratelimit. Do we need to take that into account again. Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-19 2:06 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-19 2:06 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 16, 2011 at 10:20:11AM +0800, Wu Fengguang wrote: [..] > + if (dirty_exceeded && !bdi->dirty_exceeded) > bdi->dirty_exceeded = 1; > > bdi_update_bandwidth(bdi, dirty_thresh, background_thresh, > nr_dirty, bdi_thresh, bdi_dirty, > start_time); > > - /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. > - * Unstable writes are a feature of certain networked > - * filesystems (i.e. NFS) in which data may have been > - * written to the server's write cache, but has not yet > - * been flushed to permanent storage. > - * Only move pages to writeback if this bdi is over its > - * threshold otherwise wait until the disk writes catch > - * up. > - */ > - trace_balance_dirty_start(bdi); > - if (bdi_nr_reclaimable > task_bdi_thresh) { > - pages_written += writeback_inodes_wb(&bdi->wb, > - write_chunk); > - trace_balance_dirty_written(bdi, pages_written); > - if (pages_written >= write_chunk) > - break; /* We've done our duty */ > + if (unlikely(!writeback_in_progress(bdi))) > + bdi_start_background_writeback(bdi); > + > + base_rate = bdi->dirty_ratelimit; > + pos_ratio = bdi_position_ratio(bdi, dirty_thresh, > + background_thresh, nr_dirty, > + bdi_thresh, bdi_dirty); > + if (unlikely(pos_ratio == 0)) { > + pause = MAX_PAUSE; > + goto pause; > } > + task_ratelimit = (u64)base_rate * > + pos_ratio >> RATELIMIT_CALC_SHIFT; Hi Fenguaang, I am little confused here. I see that you have already taken pos_ratio into account in bdi_update_dirty_ratelimit() and wondering why to take that into account again in balance_diry_pages(). We calculated the pos_rate and balanced_rate and adjusted the bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit(). So why are we adjusting this pos_ratio() adjusted limit again with pos_ratio(). Doesn't it become effectively following (assuming one is decreasing the dirty rate limit). base_rate = bdi->dirty_ratelimit pos_rate = base_rate * pos_ratio(); write_bw balance_rate = pos_rate * -------- dirty_bw delta = max(pos_rate, balance_rate) bdi->dirty_ratelimit = bdi->dirty_ratelimit - delta; task_ratelimit = bdi->dirty_ratelimit * pos_ratio(). So we have already taken into account pos_ratio() while calculating new bdi->dirty_ratelimit. Do we need to take that into account again. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-19 2:06 ` Vivek Goyal @ 2011-08-19 2:54 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-19 2:54 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML Hi Vivek, > > + base_rate = bdi->dirty_ratelimit; > > + pos_ratio = bdi_position_ratio(bdi, dirty_thresh, > > + background_thresh, nr_dirty, > > + bdi_thresh, bdi_dirty); > > + if (unlikely(pos_ratio == 0)) { > > + pause = MAX_PAUSE; > > + goto pause; > > } > > + task_ratelimit = (u64)base_rate * > > + pos_ratio >> RATELIMIT_CALC_SHIFT; > > Hi Fenguaang, > > I am little confused here. I see that you have already taken pos_ratio > into account in bdi_update_dirty_ratelimit() and wondering why to take > that into account again in balance_diry_pages(). > > We calculated the pos_rate and balanced_rate and adjusted the > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit(). Good question. There are some inter-dependencies in the calculation, and the dependency chain is the opposite to the one in your mind: balance_dirty_pages() used pos_ratio in the first place, so that bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation of the balanced dirty rate, too. Let's return to how the balanced dirty rate is estimated. Please pay special attention to the last paragraphs below the "......" line. Start by throttling each dd task at rate task_ratelimit = task_ratelimit_0 (1) (any non-zero initial value is OK) After 200ms, we measured dirty_rate = # of pages dirtied by all dd's / 200ms write_bw = # of pages written to the disk / 200ms For the aggressive dd dirtiers, the equality holds dirty_rate == N * task_rate == N * task_ratelimit == N * task_ratelimit_0 (2) Or task_ratelimit_0 = dirty_rate / N (3) Now we conclude that the balanced task ratelimit can be estimated by balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (4) Because with (2) and (3), (4) yields the desired equality (1): balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate) == write_bw / N ............................................................................. Now let's revisit (1). Since balance_dirty_pages() chooses to execute the ratelimit task_ratelimit = task_ratelimit_0 = dirty_ratelimit * pos_ratio (5) Put (5) into (4), we get the final form used in bdi_update_dirty_ratelimit() balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate) So you really need to take (dirty_ratelimit * pos_ratio) as a single entity. Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-19 2:54 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-19 2:54 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML Hi Vivek, > > + base_rate = bdi->dirty_ratelimit; > > + pos_ratio = bdi_position_ratio(bdi, dirty_thresh, > > + background_thresh, nr_dirty, > > + bdi_thresh, bdi_dirty); > > + if (unlikely(pos_ratio == 0)) { > > + pause = MAX_PAUSE; > > + goto pause; > > } > > + task_ratelimit = (u64)base_rate * > > + pos_ratio >> RATELIMIT_CALC_SHIFT; > > Hi Fenguaang, > > I am little confused here. I see that you have already taken pos_ratio > into account in bdi_update_dirty_ratelimit() and wondering why to take > that into account again in balance_diry_pages(). > > We calculated the pos_rate and balanced_rate and adjusted the > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit(). Good question. There are some inter-dependencies in the calculation, and the dependency chain is the opposite to the one in your mind: balance_dirty_pages() used pos_ratio in the first place, so that bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation of the balanced dirty rate, too. Let's return to how the balanced dirty rate is estimated. Please pay special attention to the last paragraphs below the "......" line. Start by throttling each dd task at rate task_ratelimit = task_ratelimit_0 (1) (any non-zero initial value is OK) After 200ms, we measured dirty_rate = # of pages dirtied by all dd's / 200ms write_bw = # of pages written to the disk / 200ms For the aggressive dd dirtiers, the equality holds dirty_rate == N * task_rate == N * task_ratelimit == N * task_ratelimit_0 (2) Or task_ratelimit_0 = dirty_rate / N (3) Now we conclude that the balanced task ratelimit can be estimated by balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (4) Because with (2) and (3), (4) yields the desired equality (1): balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate) == write_bw / N ............................................................................. Now let's revisit (1). Since balance_dirty_pages() chooses to execute the ratelimit task_ratelimit = task_ratelimit_0 = dirty_ratelimit * pos_ratio (5) Put (5) into (4), we get the final form used in bdi_update_dirty_ratelimit() balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate) So you really need to take (dirty_ratelimit * pos_ratio) as a single entity. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-19 2:54 ` Wu Fengguang @ 2011-08-19 19:00 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-19 19:00 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote: > Hi Vivek, > > > > + base_rate = bdi->dirty_ratelimit; > > > + pos_ratio = bdi_position_ratio(bdi, dirty_thresh, > > > + background_thresh, nr_dirty, > > > + bdi_thresh, bdi_dirty); > > > + if (unlikely(pos_ratio == 0)) { > > > + pause = MAX_PAUSE; > > > + goto pause; > > > } > > > + task_ratelimit = (u64)base_rate * > > > + pos_ratio >> RATELIMIT_CALC_SHIFT; > > > > Hi Fenguaang, > > > > I am little confused here. I see that you have already taken pos_ratio > > into account in bdi_update_dirty_ratelimit() and wondering why to take > > that into account again in balance_diry_pages(). > > > > We calculated the pos_rate and balanced_rate and adjusted the > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit(). > > Good question. There are some inter-dependencies in the calculation, > and the dependency chain is the opposite to the one in your mind: > balance_dirty_pages() used pos_ratio in the first place, so that > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation > of the balanced dirty rate, too. > > Let's return to how the balanced dirty rate is estimated. Please pay > special attention to the last paragraphs below the "......" line. > > Start by throttling each dd task at rate > > task_ratelimit = task_ratelimit_0 (1) > (any non-zero initial value is OK) > > After 200ms, we measured > > dirty_rate = # of pages dirtied by all dd's / 200ms > write_bw = # of pages written to the disk / 200ms > > For the aggressive dd dirtiers, the equality holds > > dirty_rate == N * task_rate > == N * task_ratelimit > == N * task_ratelimit_0 (2) > Or > task_ratelimit_0 = dirty_rate / N (3) > > Now we conclude that the balanced task ratelimit can be estimated by > > balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (4) > > Because with (2) and (3), (4) yields the desired equality (1): > > balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate) > == write_bw / N Hi Fengguang, Following is my understanding. Please correct me where I got it wrong. Ok, I think I follow till this point. I think what you are saying is that following is our goal in a stable system. task_ratelimit = write_bw/N (6) So we measure the write_bw of a bdi over a period of time and use that as feedback loop to modify bdi->dirty_ratelimit which inturn modifies task_ratelimit and hence we achieve the balance. So we will start with some arbitrary task limit say task_ratelimit_0, and modify that limit over a period of time based on our feedback loop to achieve a balanced system. And following seems to be the formula. write_bw task_ratelimit = task_ratelimit_0 * ------- (7) dirty_rate Now I also understand that by using (2) and (3), you proved that how (7) will lead to (6) and that is our deisred goal. > > ............................................................................. > > Now let's revisit (1). Since balance_dirty_pages() chooses to execute > the ratelimit > > task_ratelimit = task_ratelimit_0 > = dirty_ratelimit * pos_ratio (5) > So balance_drity_pages() chose to take into account pos_ratio() also because for various reason like just taking into account only bandwidth variation as feedback was not sufficient. So we also took pos_ratio into account which in-trun is dependent on gloabal dirty pages and per bdi dirty_pages/rate. So we refined the formula for calculating a tasks's effective rate over a period of time to following. write_bw task_ratelimit = task_ratelimit_0 * ------- * pos_ratio (9) dirty_rate Is my understanding right so far? > Put (5) into (4), we get the final form used in > bdi_update_dirty_ratelimit() > > balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate) > > So you really need to take (dirty_ratelimit * pos_ratio) as a single entity. Now few questions. - What is dirty_ratelimit in formula above? - Is it wrong to understand the issue in following manner. bdi->dirty_ratelimit is tracking write bandwidth variation on the bdi and effectively tracks write_bw/N. bdi->dirty_ratelimit = write_bw/N or write_bw bdi->dirty_ratelimit = previous_bdi->dirty_ratelimit * ------------- (10) dirty_rate Hence a tasks's balanced rate from (9) and (10) is. task_ratelimit = bdi->dirty_ratelimit * pos_ratio (11) So my understanding about (10) and (11) is wrong? if no, then question comes that bdi->dirty_ratelimit is supposed to be keeping track of write bandwidth variations only. And in turn task ratelimit will be driven by both bandwidth varation as well as pos_ratio variation. But you seem to be doing following. bdi->dirty_ratelimit = adjust based on a cobination of bandwidth feedback and pos_ratio feedback. task_ratelimit = bdi->dirty_ratelimit * pos_ratio (12) So my question is that when task_ratelimit is finally being adjusted based on pos_ratio feedback, why bdi->dirty_ratelimit also needs to take that into account. I know you have tried explaining it, but sorry, I did not get it. May be give it another shot in a layman's terms and I might understand it. Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-19 19:00 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-19 19:00 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote: > Hi Vivek, > > > > + base_rate = bdi->dirty_ratelimit; > > > + pos_ratio = bdi_position_ratio(bdi, dirty_thresh, > > > + background_thresh, nr_dirty, > > > + bdi_thresh, bdi_dirty); > > > + if (unlikely(pos_ratio == 0)) { > > > + pause = MAX_PAUSE; > > > + goto pause; > > > } > > > + task_ratelimit = (u64)base_rate * > > > + pos_ratio >> RATELIMIT_CALC_SHIFT; > > > > Hi Fenguaang, > > > > I am little confused here. I see that you have already taken pos_ratio > > into account in bdi_update_dirty_ratelimit() and wondering why to take > > that into account again in balance_diry_pages(). > > > > We calculated the pos_rate and balanced_rate and adjusted the > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit(). > > Good question. There are some inter-dependencies in the calculation, > and the dependency chain is the opposite to the one in your mind: > balance_dirty_pages() used pos_ratio in the first place, so that > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation > of the balanced dirty rate, too. > > Let's return to how the balanced dirty rate is estimated. Please pay > special attention to the last paragraphs below the "......" line. > > Start by throttling each dd task at rate > > task_ratelimit = task_ratelimit_0 (1) > (any non-zero initial value is OK) > > After 200ms, we measured > > dirty_rate = # of pages dirtied by all dd's / 200ms > write_bw = # of pages written to the disk / 200ms > > For the aggressive dd dirtiers, the equality holds > > dirty_rate == N * task_rate > == N * task_ratelimit > == N * task_ratelimit_0 (2) > Or > task_ratelimit_0 = dirty_rate / N (3) > > Now we conclude that the balanced task ratelimit can be estimated by > > balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (4) > > Because with (2) and (3), (4) yields the desired equality (1): > > balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate) > == write_bw / N Hi Fengguang, Following is my understanding. Please correct me where I got it wrong. Ok, I think I follow till this point. I think what you are saying is that following is our goal in a stable system. task_ratelimit = write_bw/N (6) So we measure the write_bw of a bdi over a period of time and use that as feedback loop to modify bdi->dirty_ratelimit which inturn modifies task_ratelimit and hence we achieve the balance. So we will start with some arbitrary task limit say task_ratelimit_0, and modify that limit over a period of time based on our feedback loop to achieve a balanced system. And following seems to be the formula. write_bw task_ratelimit = task_ratelimit_0 * ------- (7) dirty_rate Now I also understand that by using (2) and (3), you proved that how (7) will lead to (6) and that is our deisred goal. > > ............................................................................. > > Now let's revisit (1). Since balance_dirty_pages() chooses to execute > the ratelimit > > task_ratelimit = task_ratelimit_0 > = dirty_ratelimit * pos_ratio (5) > So balance_drity_pages() chose to take into account pos_ratio() also because for various reason like just taking into account only bandwidth variation as feedback was not sufficient. So we also took pos_ratio into account which in-trun is dependent on gloabal dirty pages and per bdi dirty_pages/rate. So we refined the formula for calculating a tasks's effective rate over a period of time to following. write_bw task_ratelimit = task_ratelimit_0 * ------- * pos_ratio (9) dirty_rate Is my understanding right so far? > Put (5) into (4), we get the final form used in > bdi_update_dirty_ratelimit() > > balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate) > > So you really need to take (dirty_ratelimit * pos_ratio) as a single entity. Now few questions. - What is dirty_ratelimit in formula above? - Is it wrong to understand the issue in following manner. bdi->dirty_ratelimit is tracking write bandwidth variation on the bdi and effectively tracks write_bw/N. bdi->dirty_ratelimit = write_bw/N or write_bw bdi->dirty_ratelimit = previous_bdi->dirty_ratelimit * ------------- (10) dirty_rate Hence a tasks's balanced rate from (9) and (10) is. task_ratelimit = bdi->dirty_ratelimit * pos_ratio (11) So my understanding about (10) and (11) is wrong? if no, then question comes that bdi->dirty_ratelimit is supposed to be keeping track of write bandwidth variations only. And in turn task ratelimit will be driven by both bandwidth varation as well as pos_ratio variation. But you seem to be doing following. bdi->dirty_ratelimit = adjust based on a cobination of bandwidth feedback and pos_ratio feedback. task_ratelimit = bdi->dirty_ratelimit * pos_ratio (12) So my question is that when task_ratelimit is finally being adjusted based on pos_ratio feedback, why bdi->dirty_ratelimit also needs to take that into account. I know you have tried explaining it, but sorry, I did not get it. May be give it another shot in a layman's terms and I might understand it. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-19 19:00 ` Vivek Goyal @ 2011-08-21 3:46 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-21 3:46 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote: > On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote: > > Hi Vivek, > > > > > > + base_rate = bdi->dirty_ratelimit; > > > > + pos_ratio = bdi_position_ratio(bdi, dirty_thresh, > > > > + background_thresh, nr_dirty, > > > > + bdi_thresh, bdi_dirty); > > > > + if (unlikely(pos_ratio == 0)) { > > > > + pause = MAX_PAUSE; > > > > + goto pause; > > > > } > > > > + task_ratelimit = (u64)base_rate * > > > > + pos_ratio >> RATELIMIT_CALC_SHIFT; > > > > > > Hi Fenguaang, > > > > > > I am little confused here. I see that you have already taken pos_ratio > > > into account in bdi_update_dirty_ratelimit() and wondering why to take > > > that into account again in balance_diry_pages(). > > > > > > We calculated the pos_rate and balanced_rate and adjusted the > > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit(). > > > > Good question. There are some inter-dependencies in the calculation, > > and the dependency chain is the opposite to the one in your mind: > > balance_dirty_pages() used pos_ratio in the first place, so that > > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation > > of the balanced dirty rate, too. > > > > Let's return to how the balanced dirty rate is estimated. Please pay > > special attention to the last paragraphs below the "......" line. > > > > Start by throttling each dd task at rate > > > > task_ratelimit = task_ratelimit_0 (1) > > (any non-zero initial value is OK) > > > > After 200ms, we measured > > > > dirty_rate = # of pages dirtied by all dd's / 200ms > > write_bw = # of pages written to the disk / 200ms > > > > For the aggressive dd dirtiers, the equality holds > > > > dirty_rate == N * task_rate > > == N * task_ratelimit > > == N * task_ratelimit_0 (2) > > Or > > task_ratelimit_0 = dirty_rate / N (3) > > > > Now we conclude that the balanced task ratelimit can be estimated by > > > > balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (4) > > > > Because with (2) and (3), (4) yields the desired equality (1): > > > > balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate) > > == write_bw / N > > Hi Fengguang, > > Following is my understanding. Please correct me where I got it wrong. > > Ok, I think I follow till this point. I think what you are saying is > that following is our goal in a stable system. > > task_ratelimit = write_bw/N (6) > > So we measure the write_bw of a bdi over a period of time and use that > as feedback loop to modify bdi->dirty_ratelimit which inturn modifies > task_ratelimit and hence we achieve the balance. So we will start with > some arbitrary task limit say task_ratelimit_0, and modify that limit > over a period of time based on our feedback loop to achieve a balanced > system. And following seems to be the formula. > write_bw > task_ratelimit = task_ratelimit_0 * ------- (7) > dirty_rate > > Now I also understand that by using (2) and (3), you proved that > how (7) will lead to (6) and that is our deisred goal. That's right. > > > > ............................................................................. > > > > Now let's revisit (1). Since balance_dirty_pages() chooses to execute > > the ratelimit > > > > task_ratelimit = task_ratelimit_0 > > = dirty_ratelimit * pos_ratio (5) > > > > So balance_drity_pages() chose to take into account pos_ratio() also > because for various reason like just taking into account only bandwidth > variation as feedback was not sufficient. So we also took pos_ratio > into account which in-trun is dependent on gloabal dirty pages and per > bdi dirty_pages/rate. That's right so far. balance_drity_pages() needs to do dirty position control, so used formula (5). > So we refined the formula for calculating a tasks's effective rate > over a period of time to following. > write_bw > task_ratelimit = task_ratelimit_0 * ------- * pos_ratio (9) > dirty_rate > That's not true. It should still be formula (7) when balance_drity_pages() considers pos_ratio. > > Put (5) into (4), we get the final form used in > > bdi_update_dirty_ratelimit() > > > > balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate) > > > > So you really need to take (dirty_ratelimit * pos_ratio) as a single entity. > > Now few questions. > > - What is dirty_ratelimit in formula above? It's bdi->dirty_ratelimit. > - Is it wrong to understand the issue in following manner. > > bdi->dirty_ratelimit is tracking write bandwidth variation on the bdi > and effectively tracks write_bw/N. > > bdi->dirty_ratelimit = write_bw/N Yes. Strictly speaking, the target value is (note the "==") bdi->dirty_ratelimit == write_bw/N > or > > write_bw > bdi->dirty_ratelimit = previous_bdi->dirty_ratelimit * ------------- (10) > dirty_rate Both (9) and (10) are not true. The right form is write_bw balanced_rate = whatever_ratelimit_executed_in_balance_dirty_pages * ---------- dirty_rate where whatever_ratelimit_executed_in_balance_dirty_pages ~= bdi->dirty_ratelimit * pos_ratio bdi->dirty_ratelimit ~= balanced_rate > Hence a tasks's balanced rate from (9) and (10) is. > > task_ratelimit = bdi->dirty_ratelimit * pos_ratio (11) > So my understanding about (10) and (11) is wrong? if no, then question > comes that (11) in itself is right. It's the exact form used in code. > bdi->dirty_ratelimit is supposed to be keeping track of > write bandwidth variations only. Yes in a stable workload. Besides, if the number of dd tasks (N) changed, dirty_ratelimit will adapt to new value (write_bw / N). > And in turn task ratelimit will be > driven by both bandwidth varation as well as pos_ratio variation. That's right. > But you seem to be doing following. > > bdi->dirty_ratelimit = adjust based on a cobination of bandwidth feedback > and pos_ratio feedback. > > task_ratelimit = bdi->dirty_ratelimit * pos_ratio (12) > > So my question is that when task_ratelimit is finally being adjusted > based on pos_ratio feedback, why bdi->dirty_ratelimit also needs to > take that into account. In _concept_, bdi->dirty_ratelimit only depends on whatever_ratelimit_executed_in_balance_dirty_pages. Then, we try to estimate the latter with formula whatever_ratelimit_executed_in_balance_dirty_pages ~= bdi->dirty_ratelimit * pos_ratio That is the main reason we want to limit the step size of bdi->dirty_ratelimit: otherwise the above estimation will have big errors if bdi->dirty_ratelimit has changed a lot during the past 200ms. That's also the reason balanced_rate will have larger errors when close to @limit: because there pos_ratio drops _quickly_ to 0, hence the regular fluctuations in dirty pages will result in big fluctuations in the _relative_ value of pos_ratio. > I know you have tried explaining it, but sorry, I did not get it. May > be give it another shot in a layman's terms and I might understand it. Sorry for that. I can explain if you have more questions :) Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-21 3:46 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-21 3:46 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote: > On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote: > > Hi Vivek, > > > > > > + base_rate = bdi->dirty_ratelimit; > > > > + pos_ratio = bdi_position_ratio(bdi, dirty_thresh, > > > > + background_thresh, nr_dirty, > > > > + bdi_thresh, bdi_dirty); > > > > + if (unlikely(pos_ratio == 0)) { > > > > + pause = MAX_PAUSE; > > > > + goto pause; > > > > } > > > > + task_ratelimit = (u64)base_rate * > > > > + pos_ratio >> RATELIMIT_CALC_SHIFT; > > > > > > Hi Fenguaang, > > > > > > I am little confused here. I see that you have already taken pos_ratio > > > into account in bdi_update_dirty_ratelimit() and wondering why to take > > > that into account again in balance_diry_pages(). > > > > > > We calculated the pos_rate and balanced_rate and adjusted the > > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit(). > > > > Good question. There are some inter-dependencies in the calculation, > > and the dependency chain is the opposite to the one in your mind: > > balance_dirty_pages() used pos_ratio in the first place, so that > > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation > > of the balanced dirty rate, too. > > > > Let's return to how the balanced dirty rate is estimated. Please pay > > special attention to the last paragraphs below the "......" line. > > > > Start by throttling each dd task at rate > > > > task_ratelimit = task_ratelimit_0 (1) > > (any non-zero initial value is OK) > > > > After 200ms, we measured > > > > dirty_rate = # of pages dirtied by all dd's / 200ms > > write_bw = # of pages written to the disk / 200ms > > > > For the aggressive dd dirtiers, the equality holds > > > > dirty_rate == N * task_rate > > == N * task_ratelimit > > == N * task_ratelimit_0 (2) > > Or > > task_ratelimit_0 = dirty_rate / N (3) > > > > Now we conclude that the balanced task ratelimit can be estimated by > > > > balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (4) > > > > Because with (2) and (3), (4) yields the desired equality (1): > > > > balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate) > > == write_bw / N > > Hi Fengguang, > > Following is my understanding. Please correct me where I got it wrong. > > Ok, I think I follow till this point. I think what you are saying is > that following is our goal in a stable system. > > task_ratelimit = write_bw/N (6) > > So we measure the write_bw of a bdi over a period of time and use that > as feedback loop to modify bdi->dirty_ratelimit which inturn modifies > task_ratelimit and hence we achieve the balance. So we will start with > some arbitrary task limit say task_ratelimit_0, and modify that limit > over a period of time based on our feedback loop to achieve a balanced > system. And following seems to be the formula. > write_bw > task_ratelimit = task_ratelimit_0 * ------- (7) > dirty_rate > > Now I also understand that by using (2) and (3), you proved that > how (7) will lead to (6) and that is our deisred goal. That's right. > > > > ............................................................................. > > > > Now let's revisit (1). Since balance_dirty_pages() chooses to execute > > the ratelimit > > > > task_ratelimit = task_ratelimit_0 > > = dirty_ratelimit * pos_ratio (5) > > > > So balance_drity_pages() chose to take into account pos_ratio() also > because for various reason like just taking into account only bandwidth > variation as feedback was not sufficient. So we also took pos_ratio > into account which in-trun is dependent on gloabal dirty pages and per > bdi dirty_pages/rate. That's right so far. balance_drity_pages() needs to do dirty position control, so used formula (5). > So we refined the formula for calculating a tasks's effective rate > over a period of time to following. > write_bw > task_ratelimit = task_ratelimit_0 * ------- * pos_ratio (9) > dirty_rate > That's not true. It should still be formula (7) when balance_drity_pages() considers pos_ratio. > > Put (5) into (4), we get the final form used in > > bdi_update_dirty_ratelimit() > > > > balanced_rate = (dirty_ratelimit * pos_ratio) * (write_bw / dirty_rate) > > > > So you really need to take (dirty_ratelimit * pos_ratio) as a single entity. > > Now few questions. > > - What is dirty_ratelimit in formula above? It's bdi->dirty_ratelimit. > - Is it wrong to understand the issue in following manner. > > bdi->dirty_ratelimit is tracking write bandwidth variation on the bdi > and effectively tracks write_bw/N. > > bdi->dirty_ratelimit = write_bw/N Yes. Strictly speaking, the target value is (note the "==") bdi->dirty_ratelimit == write_bw/N > or > > write_bw > bdi->dirty_ratelimit = previous_bdi->dirty_ratelimit * ------------- (10) > dirty_rate Both (9) and (10) are not true. The right form is write_bw balanced_rate = whatever_ratelimit_executed_in_balance_dirty_pages * ---------- dirty_rate where whatever_ratelimit_executed_in_balance_dirty_pages ~= bdi->dirty_ratelimit * pos_ratio bdi->dirty_ratelimit ~= balanced_rate > Hence a tasks's balanced rate from (9) and (10) is. > > task_ratelimit = bdi->dirty_ratelimit * pos_ratio (11) > So my understanding about (10) and (11) is wrong? if no, then question > comes that (11) in itself is right. It's the exact form used in code. > bdi->dirty_ratelimit is supposed to be keeping track of > write bandwidth variations only. Yes in a stable workload. Besides, if the number of dd tasks (N) changed, dirty_ratelimit will adapt to new value (write_bw / N). > And in turn task ratelimit will be > driven by both bandwidth varation as well as pos_ratio variation. That's right. > But you seem to be doing following. > > bdi->dirty_ratelimit = adjust based on a cobination of bandwidth feedback > and pos_ratio feedback. > > task_ratelimit = bdi->dirty_ratelimit * pos_ratio (12) > > So my question is that when task_ratelimit is finally being adjusted > based on pos_ratio feedback, why bdi->dirty_ratelimit also needs to > take that into account. In _concept_, bdi->dirty_ratelimit only depends on whatever_ratelimit_executed_in_balance_dirty_pages. Then, we try to estimate the latter with formula whatever_ratelimit_executed_in_balance_dirty_pages ~= bdi->dirty_ratelimit * pos_ratio That is the main reason we want to limit the step size of bdi->dirty_ratelimit: otherwise the above estimation will have big errors if bdi->dirty_ratelimit has changed a lot during the past 200ms. That's also the reason balanced_rate will have larger errors when close to @limit: because there pos_ratio drops _quickly_ to 0, hence the regular fluctuations in dirty pages will result in big fluctuations in the _relative_ value of pos_ratio. > I know you have tried explaining it, but sorry, I did not get it. May > be give it another shot in a layman's terms and I might understand it. Sorry for that. I can explain if you have more questions :) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-21 3:46 ` Wu Fengguang @ 2011-08-22 17:22 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-22 17:22 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sun, Aug 21, 2011 at 11:46:58AM +0800, Wu Fengguang wrote: > On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote: > > On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote: > > > Hi Vivek, > > > > > > > > + base_rate = bdi->dirty_ratelimit; > > > > > + pos_ratio = bdi_position_ratio(bdi, dirty_thresh, > > > > > + background_thresh, nr_dirty, > > > > > + bdi_thresh, bdi_dirty); > > > > > + if (unlikely(pos_ratio == 0)) { > > > > > + pause = MAX_PAUSE; > > > > > + goto pause; > > > > > } > > > > > + task_ratelimit = (u64)base_rate * > > > > > + pos_ratio >> RATELIMIT_CALC_SHIFT; > > > > > > > > Hi Fenguaang, > > > > > > > > I am little confused here. I see that you have already taken pos_ratio > > > > into account in bdi_update_dirty_ratelimit() and wondering why to take > > > > that into account again in balance_diry_pages(). > > > > > > > > We calculated the pos_rate and balanced_rate and adjusted the > > > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit(). > > > > > > Good question. There are some inter-dependencies in the calculation, > > > and the dependency chain is the opposite to the one in your mind: > > > balance_dirty_pages() used pos_ratio in the first place, so that > > > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation > > > of the balanced dirty rate, too. > > > > > > Let's return to how the balanced dirty rate is estimated. Please pay > > > special attention to the last paragraphs below the "......" line. > > > > > > Start by throttling each dd task at rate > > > > > > task_ratelimit = task_ratelimit_0 (1) > > > (any non-zero initial value is OK) > > > > > > After 200ms, we measured > > > > > > dirty_rate = # of pages dirtied by all dd's / 200ms > > > write_bw = # of pages written to the disk / 200ms > > > > > > For the aggressive dd dirtiers, the equality holds > > > > > > dirty_rate == N * task_rate > > > == N * task_ratelimit > > > == N * task_ratelimit_0 (2) > > > Or > > > task_ratelimit_0 = dirty_rate / N (3) > > > > > > Now we conclude that the balanced task ratelimit can be estimated by > > > > > > balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (4) > > > > > > Because with (2) and (3), (4) yields the desired equality (1): > > > > > > balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate) > > > == write_bw / N > > > > Hi Fengguang, > > > > Following is my understanding. Please correct me where I got it wrong. > > > > Ok, I think I follow till this point. I think what you are saying is > > that following is our goal in a stable system. > > > > task_ratelimit = write_bw/N (6) > > > > So we measure the write_bw of a bdi over a period of time and use that > > as feedback loop to modify bdi->dirty_ratelimit which inturn modifies > > task_ratelimit and hence we achieve the balance. So we will start with > > some arbitrary task limit say task_ratelimit_0, and modify that limit > > over a period of time based on our feedback loop to achieve a balanced > > system. And following seems to be the formula. > > write_bw > > task_ratelimit = task_ratelimit_0 * ------- (7) > > dirty_rate > > > > Now I also understand that by using (2) and (3), you proved that > > how (7) will lead to (6) and that is our deisred goal. > > That's right. > > > > > > > ............................................................................. > > > > > > Now let's revisit (1). Since balance_dirty_pages() chooses to execute > > > the ratelimit > > > > > > task_ratelimit = task_ratelimit_0 > > > = dirty_ratelimit * pos_ratio (5) > > > > > > > So balance_drity_pages() chose to take into account pos_ratio() also > > because for various reason like just taking into account only bandwidth > > variation as feedback was not sufficient. So we also took pos_ratio > > into account which in-trun is dependent on gloabal dirty pages and per > > bdi dirty_pages/rate. > > That's right so far. balance_drity_pages() needs to do dirty position > control, so used formula (5). > > > So we refined the formula for calculating a tasks's effective rate > > over a period of time to following. > > write_bw > > task_ratelimit = task_ratelimit_0 * ------- * pos_ratio (9) > > dirty_rate > > > > That's not true. It should still be formula (7) when > balance_drity_pages() considers pos_ratio. Why it is not true? If I do some math, it sounds right. Let me summarize my understanding again. - In a steady state stable system, we want dirty_bw = write_bw, IOW. dirty_bw/write_bw = 1 (1) If we can achieve above then that means we are throttling tasks at just right rate. Or - dirty_bw == write_bw N * task_ratelimit == write_bw task_ratelimit = write_bw/N (2) So as long as we can come up with a system where balance_dirty_pages() calculates task_ratelimit to be write_bw/N, we should be fine. - But this does not take care of imbalances. So if system goes out of balance before feedback loop kicks in and dirty rate shoots up, then cache size will grow and number of dirty pages will shoot up. Hence we brought in the notion of position ratio where we also vary a tasks's dirty ratelimit based on number of dirty pages. So our effective formula became. task_ratelimit = write_bw/N * pos_ratio (3) So as long as we meet (3), we should reach to stable state. - But here N is unknown in advance so balance_drity_pages() can not make use of this formula directly. But write_bw and dirty_bw from previous 200ms are known. So following can replace (3). write_bw task_ratelimit = task_ratelimit_0 * --------- * pos_ratio (4) dirty_bw dirty_bw = tas_ratelimit_0 * N (5) Substitute (5) in (4) task_ratelimit = write_bw/N * pos_ratio (6) (6) is same as (3) which has been derived from (4) and that means at any given point of time (4) can be used by balance_drity_pages() to calculate a tasks's throttling rate. - Now going back to (4). Because we have a feedback loop where we continuously update a previous number based on feedback, we can track previous value in bdi->dirty_ratelimit. write_bw task_ratelimit = task_ratelimit_0 * --------- * pos_ratio dirty_bw Or task_ratelimit = bdi->dirty_ratelimit * pos_ratio (7) where write_bw bdi->dirty_ratelimit = task_ratelimit_0 * --------- dirty_bw Because task_ratelimit_0 is initial value to begin with and we will keep on coming with new value every 200ms, we should be able to write above as follows. write_bw bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * -------- (8) dirty_bw Effectively we start with an initial value of task_ratelimit_0 and then keep on updating it based on rate change feedback every 200ms. To summarize, We need to achieve (3) for a balanced system. Because we don't know the value of N in advance, we can use (4) to achieve effect of (3). So we start with a default value of task_ratelimit_0 and update that every 200ms based on how write and dirty rate on device is changing (8). We also further refine that rate by pos_ratio so that any variations in number of dirty pages due to temporary imbalances in the system can be accounted for (7). I see that you also use (7). I think only contention point is how (8) is perceived. So can you please explain why do you think that above calculation or (9) is wrong. I can kind of understand that you have done various adjustments to keep the task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that I am not able to understand your calculations in updating bdi->dirty_ratelimit. Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-22 17:22 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-22 17:22 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Sun, Aug 21, 2011 at 11:46:58AM +0800, Wu Fengguang wrote: > On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote: > > On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote: > > > Hi Vivek, > > > > > > > > + base_rate = bdi->dirty_ratelimit; > > > > > + pos_ratio = bdi_position_ratio(bdi, dirty_thresh, > > > > > + background_thresh, nr_dirty, > > > > > + bdi_thresh, bdi_dirty); > > > > > + if (unlikely(pos_ratio == 0)) { > > > > > + pause = MAX_PAUSE; > > > > > + goto pause; > > > > > } > > > > > + task_ratelimit = (u64)base_rate * > > > > > + pos_ratio >> RATELIMIT_CALC_SHIFT; > > > > > > > > Hi Fenguaang, > > > > > > > > I am little confused here. I see that you have already taken pos_ratio > > > > into account in bdi_update_dirty_ratelimit() and wondering why to take > > > > that into account again in balance_diry_pages(). > > > > > > > > We calculated the pos_rate and balanced_rate and adjusted the > > > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit(). > > > > > > Good question. There are some inter-dependencies in the calculation, > > > and the dependency chain is the opposite to the one in your mind: > > > balance_dirty_pages() used pos_ratio in the first place, so that > > > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation > > > of the balanced dirty rate, too. > > > > > > Let's return to how the balanced dirty rate is estimated. Please pay > > > special attention to the last paragraphs below the "......" line. > > > > > > Start by throttling each dd task at rate > > > > > > task_ratelimit = task_ratelimit_0 (1) > > > (any non-zero initial value is OK) > > > > > > After 200ms, we measured > > > > > > dirty_rate = # of pages dirtied by all dd's / 200ms > > > write_bw = # of pages written to the disk / 200ms > > > > > > For the aggressive dd dirtiers, the equality holds > > > > > > dirty_rate == N * task_rate > > > == N * task_ratelimit > > > == N * task_ratelimit_0 (2) > > > Or > > > task_ratelimit_0 = dirty_rate / N (3) > > > > > > Now we conclude that the balanced task ratelimit can be estimated by > > > > > > balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (4) > > > > > > Because with (2) and (3), (4) yields the desired equality (1): > > > > > > balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate) > > > == write_bw / N > > > > Hi Fengguang, > > > > Following is my understanding. Please correct me where I got it wrong. > > > > Ok, I think I follow till this point. I think what you are saying is > > that following is our goal in a stable system. > > > > task_ratelimit = write_bw/N (6) > > > > So we measure the write_bw of a bdi over a period of time and use that > > as feedback loop to modify bdi->dirty_ratelimit which inturn modifies > > task_ratelimit and hence we achieve the balance. So we will start with > > some arbitrary task limit say task_ratelimit_0, and modify that limit > > over a period of time based on our feedback loop to achieve a balanced > > system. And following seems to be the formula. > > write_bw > > task_ratelimit = task_ratelimit_0 * ------- (7) > > dirty_rate > > > > Now I also understand that by using (2) and (3), you proved that > > how (7) will lead to (6) and that is our deisred goal. > > That's right. > > > > > > > ............................................................................. > > > > > > Now let's revisit (1). Since balance_dirty_pages() chooses to execute > > > the ratelimit > > > > > > task_ratelimit = task_ratelimit_0 > > > = dirty_ratelimit * pos_ratio (5) > > > > > > > So balance_drity_pages() chose to take into account pos_ratio() also > > because for various reason like just taking into account only bandwidth > > variation as feedback was not sufficient. So we also took pos_ratio > > into account which in-trun is dependent on gloabal dirty pages and per > > bdi dirty_pages/rate. > > That's right so far. balance_drity_pages() needs to do dirty position > control, so used formula (5). > > > So we refined the formula for calculating a tasks's effective rate > > over a period of time to following. > > write_bw > > task_ratelimit = task_ratelimit_0 * ------- * pos_ratio (9) > > dirty_rate > > > > That's not true. It should still be formula (7) when > balance_drity_pages() considers pos_ratio. Why it is not true? If I do some math, it sounds right. Let me summarize my understanding again. - In a steady state stable system, we want dirty_bw = write_bw, IOW. dirty_bw/write_bw = 1 (1) If we can achieve above then that means we are throttling tasks at just right rate. Or - dirty_bw == write_bw N * task_ratelimit == write_bw task_ratelimit = write_bw/N (2) So as long as we can come up with a system where balance_dirty_pages() calculates task_ratelimit to be write_bw/N, we should be fine. - But this does not take care of imbalances. So if system goes out of balance before feedback loop kicks in and dirty rate shoots up, then cache size will grow and number of dirty pages will shoot up. Hence we brought in the notion of position ratio where we also vary a tasks's dirty ratelimit based on number of dirty pages. So our effective formula became. task_ratelimit = write_bw/N * pos_ratio (3) So as long as we meet (3), we should reach to stable state. - But here N is unknown in advance so balance_drity_pages() can not make use of this formula directly. But write_bw and dirty_bw from previous 200ms are known. So following can replace (3). write_bw task_ratelimit = task_ratelimit_0 * --------- * pos_ratio (4) dirty_bw dirty_bw = tas_ratelimit_0 * N (5) Substitute (5) in (4) task_ratelimit = write_bw/N * pos_ratio (6) (6) is same as (3) which has been derived from (4) and that means at any given point of time (4) can be used by balance_drity_pages() to calculate a tasks's throttling rate. - Now going back to (4). Because we have a feedback loop where we continuously update a previous number based on feedback, we can track previous value in bdi->dirty_ratelimit. write_bw task_ratelimit = task_ratelimit_0 * --------- * pos_ratio dirty_bw Or task_ratelimit = bdi->dirty_ratelimit * pos_ratio (7) where write_bw bdi->dirty_ratelimit = task_ratelimit_0 * --------- dirty_bw Because task_ratelimit_0 is initial value to begin with and we will keep on coming with new value every 200ms, we should be able to write above as follows. write_bw bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * -------- (8) dirty_bw Effectively we start with an initial value of task_ratelimit_0 and then keep on updating it based on rate change feedback every 200ms. To summarize, We need to achieve (3) for a balanced system. Because we don't know the value of N in advance, we can use (4) to achieve effect of (3). So we start with a default value of task_ratelimit_0 and update that every 200ms based on how write and dirty rate on device is changing (8). We also further refine that rate by pos_ratio so that any variations in number of dirty pages due to temporary imbalances in the system can be accounted for (7). I see that you also use (7). I think only contention point is how (8) is perceived. So can you please explain why do you think that above calculation or (9) is wrong. I can kind of understand that you have done various adjustments to keep the task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that I am not able to understand your calculations in updating bdi->dirty_ratelimit. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-22 17:22 ` Vivek Goyal @ 2011-08-23 1:07 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-23 1:07 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 23, 2011 at 01:22:30AM +0800, Vivek Goyal wrote: > On Sun, Aug 21, 2011 at 11:46:58AM +0800, Wu Fengguang wrote: > > On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote: > > > On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote: > > > > Hi Vivek, > > > > > > > > > > + base_rate = bdi->dirty_ratelimit; > > > > > > + pos_ratio = bdi_position_ratio(bdi, dirty_thresh, > > > > > > + background_thresh, nr_dirty, > > > > > > + bdi_thresh, bdi_dirty); > > > > > > + if (unlikely(pos_ratio == 0)) { > > > > > > + pause = MAX_PAUSE; > > > > > > + goto pause; > > > > > > } > > > > > > + task_ratelimit = (u64)base_rate * > > > > > > + pos_ratio >> RATELIMIT_CALC_SHIFT; > > > > > > > > > > Hi Fenguaang, > > > > > > > > > > I am little confused here. I see that you have already taken pos_ratio > > > > > into account in bdi_update_dirty_ratelimit() and wondering why to take > > > > > that into account again in balance_diry_pages(). > > > > > > > > > > We calculated the pos_rate and balanced_rate and adjusted the > > > > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit(). > > > > > > > > Good question. There are some inter-dependencies in the calculation, > > > > and the dependency chain is the opposite to the one in your mind: > > > > balance_dirty_pages() used pos_ratio in the first place, so that > > > > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation > > > > of the balanced dirty rate, too. > > > > > > > > Let's return to how the balanced dirty rate is estimated. Please pay > > > > special attention to the last paragraphs below the "......" line. > > > > > > > > Start by throttling each dd task at rate > > > > > > > > task_ratelimit = task_ratelimit_0 (1) > > > > (any non-zero initial value is OK) > > > > > > > > After 200ms, we measured > > > > > > > > dirty_rate = # of pages dirtied by all dd's / 200ms > > > > write_bw = # of pages written to the disk / 200ms > > > > > > > > For the aggressive dd dirtiers, the equality holds > > > > > > > > dirty_rate == N * task_rate > > > > == N * task_ratelimit > > > > == N * task_ratelimit_0 (2) > > > > Or > > > > task_ratelimit_0 = dirty_rate / N (3) > > > > > > > > Now we conclude that the balanced task ratelimit can be estimated by > > > > > > > > balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (4) > > > > > > > > Because with (2) and (3), (4) yields the desired equality (1): > > > > > > > > balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate) > > > > == write_bw / N > > > > > > Hi Fengguang, > > > > > > Following is my understanding. Please correct me where I got it wrong. > > > > > > Ok, I think I follow till this point. I think what you are saying is > > > that following is our goal in a stable system. > > > > > > task_ratelimit = write_bw/N (6) > > > > > > So we measure the write_bw of a bdi over a period of time and use that > > > as feedback loop to modify bdi->dirty_ratelimit which inturn modifies > > > task_ratelimit and hence we achieve the balance. So we will start with > > > some arbitrary task limit say task_ratelimit_0, and modify that limit > > > over a period of time based on our feedback loop to achieve a balanced > > > system. And following seems to be the formula. > > > write_bw > > > task_ratelimit = task_ratelimit_0 * ------- (7) > > > dirty_rate > > > > > > Now I also understand that by using (2) and (3), you proved that > > > how (7) will lead to (6) and that is our deisred goal. > > > > That's right. > > > > > > > > > > ............................................................................. > > > > > > > > Now let's revisit (1). Since balance_dirty_pages() chooses to execute > > > > the ratelimit > > > > > > > > task_ratelimit = task_ratelimit_0 > > > > = dirty_ratelimit * pos_ratio (5) > > > > > > > > > > So balance_drity_pages() chose to take into account pos_ratio() also > > > because for various reason like just taking into account only bandwidth > > > variation as feedback was not sufficient. So we also took pos_ratio > > > into account which in-trun is dependent on gloabal dirty pages and per > > > bdi dirty_pages/rate. > > > > That's right so far. balance_drity_pages() needs to do dirty position > > control, so used formula (5). > > > > > So we refined the formula for calculating a tasks's effective rate > > > over a period of time to following. > > > write_bw > > > task_ratelimit = task_ratelimit_0 * ------- * pos_ratio (9) > > > dirty_rate > > > > > > > That's not true. It should still be formula (7) when > > balance_drity_pages() considers pos_ratio. > > Why it is not true? If I do some math, it sounds right. Let me summarize > my understanding again. Ah sorry! (9) actually holds true, as made clear by your below reasoning. > - In a steady state stable system, we want dirty_bw = write_bw, IOW. > > dirty_bw/write_bw = 1 (1) > > If we can achieve above then that means we are throttling tasks at > just right rate. > > Or > - dirty_bw == write_bw > N * task_ratelimit == write_bw > task_ratelimit = write_bw/N (2) > > So as long as we can come up with a system where balance_dirty_pages() > calculates task_ratelimit to be write_bw/N, we should be fine. Right. > - But this does not take care of imbalances. So if system goes out of > balance before feedback loop kicks in and dirty rate shoots up, then > cache size will grow and number of dirty pages will shoot up. Hence > we brought in the notion of position ratio where we also vary a > tasks's dirty ratelimit based on number of dirty pages. So our > effective formula became. > > task_ratelimit = write_bw/N * pos_ratio (3) > > So as long as we meet (3), we should reach to stable state. Right. > - But here N is unknown in advance so balance_drity_pages() can not make > use of this formula directly. But write_bw and dirty_bw from previous > 200ms are known. So following can replace (3). > > write_bw > task_ratelimit = task_ratelimit_0 * --------- * pos_ratio (4) > dirty_bw > > dirty_bw = task_ratelimit_0 * N (5) > > Substitute (5) in (4) > > task_ratelimit = write_bw/N * pos_ratio (6) > > (6) is same as (3) which has been derived from (4) and that means at any > given point of time (4) can be used by balance_drity_pages() to calculate > a tasks's throttling rate. Right. Sorry what's in my mind was write_bw balanced_rate = task_ratelimit_0 * -------- dirty_bw task_ratelimit = balanced_rate * pos_ratio which is effective the same to your combined equation (4). > - Now going back to (4). Because we have a feedback loop where we > continuously update a previous number based on feedback, we can track > previous value in bdi->dirty_ratelimit. > > write_bw > task_ratelimit = task_ratelimit_0 * --------- * pos_ratio > dirty_bw > > Or > > task_ratelimit = bdi->dirty_ratelimit * pos_ratio (7) > > where > write_bw > bdi->dirty_ratelimit = task_ratelimit_0 * --------- > dirty_bw Right. > Because task_ratelimit_0 is initial value to begin with and we will > keep on coming with new value every 200ms, we should be able to write > above as follows. > > write_bw > bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * -------- (8) > dirty_bw > > Effectively we start with an initial value of task_ratelimit_0 and > then keep on updating it based on rate change feedback every 200ms. Right. > To summarize, > > We need to achieve (3) for a balanced system. Because we don't know the > value of N in advance, we can use (4) to achieve effect of (3). So we > start with a default value of task_ratelimit_0 and update that every > 200ms based on how write and dirty rate on device is changing (8). We also > further refine that rate by pos_ratio so that any variations in number > of dirty pages due to temporary imbalances in the system can be > accounted for (7). > > I see that you also use (7). I think only contention point is how > (8) is perceived. So can you please explain why do you think that > above calculation or (9) is wrong. There is no contention point and (9) is right..Sorry it's my fault. We are well aligned in the above reasoning :) > I can kind of understand that you have done various adjustments to keep the > task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that > I am not able to understand your calculations in updating bdi->dirty_ratelimit. You mean the below chunk of code? Which is effectively the same as this _one_ line of code bdi->dirty_ratelimit = balanced_rate; except for doing some tricks (conditional update and limiting step size) to stabilize bdi->dirty_ratelimit: unsigned long base_rate = bdi->dirty_ratelimit; /* * Use a different name for the same value to distinguish the concepts. * Only the relative value of * (pos_rate - base_rate) = (pos_ratio - 1) * base_rate * will be used below, which reflects the direction and size of dirty * position error. */ pos_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT; /* * dirty_ratelimit will follow balanced_rate iff pos_rate is on the * same side of dirty_ratelimit, too. * For example, * - (base_rate > balanced_rate) => dirty rate is too high * - (base_rate > pos_rate) => dirty pages are above setpoint * so lowering base_rate will help meet both the position and rate * control targets. Otherwise, don't update base_rate if it will only * help meet the rate target. After all, what the users ultimately feel * and care are stable dirty rate and small position error. This * update policy can also prevent dirty_ratelimit from being driven * away by possible systematic errors in balanced_rate. * * |base_rate - pos_rate| is also used to limit the step size for * filtering out the sigular points of balanced_rate, which keeps * jumping around randomly and can even leap far away at times due to * the small 200ms estimation period of dirty_rate (we want to keep * that period small to reduce time lags). */ delta = 0; if (base_rate < balanced_rate) { if (base_rate < pos_rate) delta = min(balanced_rate, pos_rate) - base_rate; } else { if (base_rate > pos_rate) delta = base_rate - max(balanced_rate, pos_rate); } /* * Don't pursue 100% rate matching. It's impossible since the balanced * rate itself is constantly fluctuating. So decrease the track speed * when it gets close to the target. Helps eliminate pointless tremors. */ delta >>= base_rate / (8 * delta + 1); /* * Limit the tracking speed to avoid overshooting. */ delta = (delta + 7) / 8; if (base_rate < balanced_rate) base_rate += delta; else base_rate -= delta; bdi->dirty_ratelimit = max(base_rate, 1UL); Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-23 1:07 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-23 1:07 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 23, 2011 at 01:22:30AM +0800, Vivek Goyal wrote: > On Sun, Aug 21, 2011 at 11:46:58AM +0800, Wu Fengguang wrote: > > On Sat, Aug 20, 2011 at 03:00:37AM +0800, Vivek Goyal wrote: > > > On Fri, Aug 19, 2011 at 10:54:06AM +0800, Wu Fengguang wrote: > > > > Hi Vivek, > > > > > > > > > > + base_rate = bdi->dirty_ratelimit; > > > > > > + pos_ratio = bdi_position_ratio(bdi, dirty_thresh, > > > > > > + background_thresh, nr_dirty, > > > > > > + bdi_thresh, bdi_dirty); > > > > > > + if (unlikely(pos_ratio == 0)) { > > > > > > + pause = MAX_PAUSE; > > > > > > + goto pause; > > > > > > } > > > > > > + task_ratelimit = (u64)base_rate * > > > > > > + pos_ratio >> RATELIMIT_CALC_SHIFT; > > > > > > > > > > Hi Fenguaang, > > > > > > > > > > I am little confused here. I see that you have already taken pos_ratio > > > > > into account in bdi_update_dirty_ratelimit() and wondering why to take > > > > > that into account again in balance_diry_pages(). > > > > > > > > > > We calculated the pos_rate and balanced_rate and adjusted the > > > > > bdi->dirty_ratelimit accordingly in bdi_update_dirty_ratelimit(). > > > > > > > > Good question. There are some inter-dependencies in the calculation, > > > > and the dependency chain is the opposite to the one in your mind: > > > > balance_dirty_pages() used pos_ratio in the first place, so that > > > > bdi_update_dirty_ratelimit() have to use pos_ratio in the calculation > > > > of the balanced dirty rate, too. > > > > > > > > Let's return to how the balanced dirty rate is estimated. Please pay > > > > special attention to the last paragraphs below the "......" line. > > > > > > > > Start by throttling each dd task at rate > > > > > > > > task_ratelimit = task_ratelimit_0 (1) > > > > (any non-zero initial value is OK) > > > > > > > > After 200ms, we measured > > > > > > > > dirty_rate = # of pages dirtied by all dd's / 200ms > > > > write_bw = # of pages written to the disk / 200ms > > > > > > > > For the aggressive dd dirtiers, the equality holds > > > > > > > > dirty_rate == N * task_rate > > > > == N * task_ratelimit > > > > == N * task_ratelimit_0 (2) > > > > Or > > > > task_ratelimit_0 = dirty_rate / N (3) > > > > > > > > Now we conclude that the balanced task ratelimit can be estimated by > > > > > > > > balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate) (4) > > > > > > > > Because with (2) and (3), (4) yields the desired equality (1): > > > > > > > > balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate) > > > > == write_bw / N > > > > > > Hi Fengguang, > > > > > > Following is my understanding. Please correct me where I got it wrong. > > > > > > Ok, I think I follow till this point. I think what you are saying is > > > that following is our goal in a stable system. > > > > > > task_ratelimit = write_bw/N (6) > > > > > > So we measure the write_bw of a bdi over a period of time and use that > > > as feedback loop to modify bdi->dirty_ratelimit which inturn modifies > > > task_ratelimit and hence we achieve the balance. So we will start with > > > some arbitrary task limit say task_ratelimit_0, and modify that limit > > > over a period of time based on our feedback loop to achieve a balanced > > > system. And following seems to be the formula. > > > write_bw > > > task_ratelimit = task_ratelimit_0 * ------- (7) > > > dirty_rate > > > > > > Now I also understand that by using (2) and (3), you proved that > > > how (7) will lead to (6) and that is our deisred goal. > > > > That's right. > > > > > > > > > > ............................................................................. > > > > > > > > Now let's revisit (1). Since balance_dirty_pages() chooses to execute > > > > the ratelimit > > > > > > > > task_ratelimit = task_ratelimit_0 > > > > = dirty_ratelimit * pos_ratio (5) > > > > > > > > > > So balance_drity_pages() chose to take into account pos_ratio() also > > > because for various reason like just taking into account only bandwidth > > > variation as feedback was not sufficient. So we also took pos_ratio > > > into account which in-trun is dependent on gloabal dirty pages and per > > > bdi dirty_pages/rate. > > > > That's right so far. balance_drity_pages() needs to do dirty position > > control, so used formula (5). > > > > > So we refined the formula for calculating a tasks's effective rate > > > over a period of time to following. > > > write_bw > > > task_ratelimit = task_ratelimit_0 * ------- * pos_ratio (9) > > > dirty_rate > > > > > > > That's not true. It should still be formula (7) when > > balance_drity_pages() considers pos_ratio. > > Why it is not true? If I do some math, it sounds right. Let me summarize > my understanding again. Ah sorry! (9) actually holds true, as made clear by your below reasoning. > - In a steady state stable system, we want dirty_bw = write_bw, IOW. > > dirty_bw/write_bw = 1 (1) > > If we can achieve above then that means we are throttling tasks at > just right rate. > > Or > - dirty_bw == write_bw > N * task_ratelimit == write_bw > task_ratelimit = write_bw/N (2) > > So as long as we can come up with a system where balance_dirty_pages() > calculates task_ratelimit to be write_bw/N, we should be fine. Right. > - But this does not take care of imbalances. So if system goes out of > balance before feedback loop kicks in and dirty rate shoots up, then > cache size will grow and number of dirty pages will shoot up. Hence > we brought in the notion of position ratio where we also vary a > tasks's dirty ratelimit based on number of dirty pages. So our > effective formula became. > > task_ratelimit = write_bw/N * pos_ratio (3) > > So as long as we meet (3), we should reach to stable state. Right. > - But here N is unknown in advance so balance_drity_pages() can not make > use of this formula directly. But write_bw and dirty_bw from previous > 200ms are known. So following can replace (3). > > write_bw > task_ratelimit = task_ratelimit_0 * --------- * pos_ratio (4) > dirty_bw > > dirty_bw = task_ratelimit_0 * N (5) > > Substitute (5) in (4) > > task_ratelimit = write_bw/N * pos_ratio (6) > > (6) is same as (3) which has been derived from (4) and that means at any > given point of time (4) can be used by balance_drity_pages() to calculate > a tasks's throttling rate. Right. Sorry what's in my mind was write_bw balanced_rate = task_ratelimit_0 * -------- dirty_bw task_ratelimit = balanced_rate * pos_ratio which is effective the same to your combined equation (4). > - Now going back to (4). Because we have a feedback loop where we > continuously update a previous number based on feedback, we can track > previous value in bdi->dirty_ratelimit. > > write_bw > task_ratelimit = task_ratelimit_0 * --------- * pos_ratio > dirty_bw > > Or > > task_ratelimit = bdi->dirty_ratelimit * pos_ratio (7) > > where > write_bw > bdi->dirty_ratelimit = task_ratelimit_0 * --------- > dirty_bw Right. > Because task_ratelimit_0 is initial value to begin with and we will > keep on coming with new value every 200ms, we should be able to write > above as follows. > > write_bw > bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * -------- (8) > dirty_bw > > Effectively we start with an initial value of task_ratelimit_0 and > then keep on updating it based on rate change feedback every 200ms. Right. > To summarize, > > We need to achieve (3) for a balanced system. Because we don't know the > value of N in advance, we can use (4) to achieve effect of (3). So we > start with a default value of task_ratelimit_0 and update that every > 200ms based on how write and dirty rate on device is changing (8). We also > further refine that rate by pos_ratio so that any variations in number > of dirty pages due to temporary imbalances in the system can be > accounted for (7). > > I see that you also use (7). I think only contention point is how > (8) is perceived. So can you please explain why do you think that > above calculation or (9) is wrong. There is no contention point and (9) is right..Sorry it's my fault. We are well aligned in the above reasoning :) > I can kind of understand that you have done various adjustments to keep the > task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that > I am not able to understand your calculations in updating bdi->dirty_ratelimit. You mean the below chunk of code? Which is effectively the same as this _one_ line of code bdi->dirty_ratelimit = balanced_rate; except for doing some tricks (conditional update and limiting step size) to stabilize bdi->dirty_ratelimit: unsigned long base_rate = bdi->dirty_ratelimit; /* * Use a different name for the same value to distinguish the concepts. * Only the relative value of * (pos_rate - base_rate) = (pos_ratio - 1) * base_rate * will be used below, which reflects the direction and size of dirty * position error. */ pos_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT; /* * dirty_ratelimit will follow balanced_rate iff pos_rate is on the * same side of dirty_ratelimit, too. * For example, * - (base_rate > balanced_rate) => dirty rate is too high * - (base_rate > pos_rate) => dirty pages are above setpoint * so lowering base_rate will help meet both the position and rate * control targets. Otherwise, don't update base_rate if it will only * help meet the rate target. After all, what the users ultimately feel * and care are stable dirty rate and small position error. This * update policy can also prevent dirty_ratelimit from being driven * away by possible systematic errors in balanced_rate. * * |base_rate - pos_rate| is also used to limit the step size for * filtering out the sigular points of balanced_rate, which keeps * jumping around randomly and can even leap far away at times due to * the small 200ms estimation period of dirty_rate (we want to keep * that period small to reduce time lags). */ delta = 0; if (base_rate < balanced_rate) { if (base_rate < pos_rate) delta = min(balanced_rate, pos_rate) - base_rate; } else { if (base_rate > pos_rate) delta = base_rate - max(balanced_rate, pos_rate); } /* * Don't pursue 100% rate matching. It's impossible since the balanced * rate itself is constantly fluctuating. So decrease the track speed * when it gets close to the target. Helps eliminate pointless tremors. */ delta >>= base_rate / (8 * delta + 1); /* * Limit the tracking speed to avoid overshooting. */ delta = (delta + 7) / 8; if (base_rate < balanced_rate) base_rate += delta; else base_rate -= delta; bdi->dirty_ratelimit = max(base_rate, 1UL); Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-23 1:07 ` Wu Fengguang @ 2011-08-23 3:53 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-23 3:53 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML > > Because task_ratelimit_0 is initial value to begin with and we will > > keep on coming with new value every 200ms, we should be able to write > > above as follows. > > > > write_bw > > bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * -------- (8) > > dirty_bw > > > > Effectively we start with an initial value of task_ratelimit_0 and > > then keep on updating it based on rate change feedback every 200ms. Ah sorry, based on the reply to Peter, there is no inherent dependency between balanced_rate_n and balanced_rate_(n-1). bdi->dirty_ratelimit does track balanced_rate in small steps, and hence will have some relationship with its previous value other than equation (8). So, although you may conduct equation (8) for balanced_rate, we'd better not understand things in that way. Keep this fundamental formula in mind and don't try to complicate it: balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-23 3:53 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-23 3:53 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML > > Because task_ratelimit_0 is initial value to begin with and we will > > keep on coming with new value every 200ms, we should be able to write > > above as follows. > > > > write_bw > > bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * -------- (8) > > dirty_bw > > > > Effectively we start with an initial value of task_ratelimit_0 and > > then keep on updating it based on rate change feedback every 200ms. Ah sorry, based on the reply to Peter, there is no inherent dependency between balanced_rate_n and balanced_rate_(n-1). bdi->dirty_ratelimit does track balanced_rate in small steps, and hence will have some relationship with its previous value other than equation (8). So, although you may conduct equation (8) for balanced_rate, we'd better not understand things in that way. Keep this fundamental formula in mind and don't try to complicate it: balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-23 1:07 ` Wu Fengguang @ 2011-08-23 13:53 ` Vivek Goyal -1 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-23 13:53 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 23, 2011 at 09:07:21AM +0800, Wu Fengguang wrote: [..] > > > > So we refined the formula for calculating a tasks's effective rate > > > > over a period of time to following. > > > > write_bw > > > > task_ratelimit = task_ratelimit_0 * ------- * pos_ratio (9) > > > > dirty_rate > > > > > > > > > > That's not true. It should still be formula (7) when > > > balance_drity_pages() considers pos_ratio. > > > > Why it is not true? If I do some math, it sounds right. Let me summarize > > my understanding again. > > Ah sorry! (9) actually holds true, as made clear by your below reasoning. > > > - In a steady state stable system, we want dirty_bw = write_bw, IOW. > > > > dirty_bw/write_bw = 1 (1) > > > > If we can achieve above then that means we are throttling tasks at > > just right rate. > > > > Or > > - dirty_bw == write_bw > > N * task_ratelimit == write_bw > > task_ratelimit = write_bw/N (2) > > > > So as long as we can come up with a system where balance_dirty_pages() > > calculates task_ratelimit to be write_bw/N, we should be fine. > > Right. > > > - But this does not take care of imbalances. So if system goes out of > > balance before feedback loop kicks in and dirty rate shoots up, then > > cache size will grow and number of dirty pages will shoot up. Hence > > we brought in the notion of position ratio where we also vary a > > tasks's dirty ratelimit based on number of dirty pages. So our > > effective formula became. > > > > task_ratelimit = write_bw/N * pos_ratio (3) > > > > So as long as we meet (3), we should reach to stable state. > > Right. > > > - But here N is unknown in advance so balance_drity_pages() can not make > > use of this formula directly. But write_bw and dirty_bw from previous > > 200ms are known. So following can replace (3). > > > > write_bw > > task_ratelimit = task_ratelimit_0 * --------- * pos_ratio (4) > > dirty_bw > > > > dirty_bw = task_ratelimit_0 * N (5) > > > > Substitute (5) in (4) > > > > task_ratelimit = write_bw/N * pos_ratio (6) > > > > (6) is same as (3) which has been derived from (4) and that means at any > > given point of time (4) can be used by balance_drity_pages() to calculate > > a tasks's throttling rate. > > Right. Sorry what's in my mind was > > write_bw > balanced_rate = task_ratelimit_0 * -------- > dirty_bw > > task_ratelimit = balanced_rate * pos_ratio > > which is effective the same to your combined equation (4). > > > - Now going back to (4). Because we have a feedback loop where we > > continuously update a previous number based on feedback, we can track > > previous value in bdi->dirty_ratelimit. > > > > write_bw > > task_ratelimit = task_ratelimit_0 * --------- * pos_ratio > > dirty_bw > > > > Or > > > > task_ratelimit = bdi->dirty_ratelimit * pos_ratio (7) > > > > where > > write_bw > > bdi->dirty_ratelimit = task_ratelimit_0 * --------- > > dirty_bw > > Right. > > > Because task_ratelimit_0 is initial value to begin with and we will > > keep on coming with new value every 200ms, we should be able to write > > above as follows. > > > > write_bw > > bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * -------- (8) > > dirty_bw > > > > Effectively we start with an initial value of task_ratelimit_0 and > > then keep on updating it based on rate change feedback every 200ms. > > Right. > > > To summarize, > > > > We need to achieve (3) for a balanced system. Because we don't know the > > value of N in advance, we can use (4) to achieve effect of (3). So we > > start with a default value of task_ratelimit_0 and update that every > > 200ms based on how write and dirty rate on device is changing (8). We also > > further refine that rate by pos_ratio so that any variations in number > > of dirty pages due to temporary imbalances in the system can be > > accounted for (7). > > > > I see that you also use (7). I think only contention point is how > > (8) is perceived. So can you please explain why do you think that > > above calculation or (9) is wrong. > > There is no contention point and (9) is right..Sorry it's my fault. > We are well aligned in the above reasoning :) Great. Now we are on same page now at least till this point. > > > I can kind of understand that you have done various adjustments to keep the > > task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that > > I am not able to understand your calculations in updating bdi->dirty_ratelimit. > > You mean the below chunk of code? Which is effectively the same as this _one_ > line of code > > bdi->dirty_ratelimit = balanced_rate; > > except for doing some tricks (conditional update and limiting step size) to > stabilize bdi->dirty_ratelimit: I am fine with bdi->dirty_ratelimit being called balanced rate. I am taking exception to the fact that you are also taking into accout pos_ratio while coming up with new balanced_rate after 200ms of feedback. We agreed to updating bdi->dirty_ratelimit as follows (8 above). write_bw bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * -------- (8) dirty_bw I think in your terminology it could be called. write_bw new_balanced_rate = prev_balanced_rate * ---------- (9) dirty_bw But what you seem to be doing is following. write_bw new_balanced_rate = prev_balanced_rate * pos_ratio * ----------- (10) dirty_bw Of course I have just tried to simlify your actual calculations to show why I am questioning the presence of pos_ratio while calculating the new bdi->dirty_ratelimit. I am fine with limiting the step size etc. So (9) and (10) don't match? Now going back to your code and show how I arrived at (10). executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT; (11) balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth, dirty_rate | 1); (12) Combining (11) and (12) gives us (10). write_bw balance_rate = base_rate * pos_ratio -------- dirty_rate Or write_bw bdi->dirty_ratelimit = base_rate * pos_ratio -------- dirty_rate To complicate the things you also have the notion of pos_rate and reduce the step size based on either pos_rate or balance_rate. pos_rate = executed_rate = base_rate * pos_ratio; write_bw balance_rate = base_rate * pos_ratio -------- dirty_rate bdi->dirty_rate_limit = min_change(pos_rate, balance_rate) (13) So for feedback, why are not sticking to simply (9) and limit the step size and not take pos_ratio into account. Even if you have to take it into account, it needs to be explained clearly and so many rate definitions confuse things more. Keeping name constant everywhere (even for local variables), helps understand the code better. Look at number of rates we have in code and it gets so confusing. balanced_rate base_rate bdi->dirty_ratelimit executed_rate pos_rate task_ratelimit dirty_rate write_bw Here balanced_rate, base_rate and bdi->dirty_ratelimit all seem to be referring to same thing and that is not obivious from the code. Looks like task->ratelimit and executed_rate and pos_rate are referring to same thing. So instead of 6 rates, we could atleast collpase the naming to 2 rates to keep the context clear. Just prefix/suffix more strings to highlight subtle difference between two rates. Thanks Vivek ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-23 13:53 ` Vivek Goyal 0 siblings, 0 replies; 301+ messages in thread From: Vivek Goyal @ 2011-08-23 13:53 UTC (permalink / raw) To: Wu Fengguang Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 23, 2011 at 09:07:21AM +0800, Wu Fengguang wrote: [..] > > > > So we refined the formula for calculating a tasks's effective rate > > > > over a period of time to following. > > > > write_bw > > > > task_ratelimit = task_ratelimit_0 * ------- * pos_ratio (9) > > > > dirty_rate > > > > > > > > > > That's not true. It should still be formula (7) when > > > balance_drity_pages() considers pos_ratio. > > > > Why it is not true? If I do some math, it sounds right. Let me summarize > > my understanding again. > > Ah sorry! (9) actually holds true, as made clear by your below reasoning. > > > - In a steady state stable system, we want dirty_bw = write_bw, IOW. > > > > dirty_bw/write_bw = 1 (1) > > > > If we can achieve above then that means we are throttling tasks at > > just right rate. > > > > Or > > - dirty_bw == write_bw > > N * task_ratelimit == write_bw > > task_ratelimit = write_bw/N (2) > > > > So as long as we can come up with a system where balance_dirty_pages() > > calculates task_ratelimit to be write_bw/N, we should be fine. > > Right. > > > - But this does not take care of imbalances. So if system goes out of > > balance before feedback loop kicks in and dirty rate shoots up, then > > cache size will grow and number of dirty pages will shoot up. Hence > > we brought in the notion of position ratio where we also vary a > > tasks's dirty ratelimit based on number of dirty pages. So our > > effective formula became. > > > > task_ratelimit = write_bw/N * pos_ratio (3) > > > > So as long as we meet (3), we should reach to stable state. > > Right. > > > - But here N is unknown in advance so balance_drity_pages() can not make > > use of this formula directly. But write_bw and dirty_bw from previous > > 200ms are known. So following can replace (3). > > > > write_bw > > task_ratelimit = task_ratelimit_0 * --------- * pos_ratio (4) > > dirty_bw > > > > dirty_bw = task_ratelimit_0 * N (5) > > > > Substitute (5) in (4) > > > > task_ratelimit = write_bw/N * pos_ratio (6) > > > > (6) is same as (3) which has been derived from (4) and that means at any > > given point of time (4) can be used by balance_drity_pages() to calculate > > a tasks's throttling rate. > > Right. Sorry what's in my mind was > > write_bw > balanced_rate = task_ratelimit_0 * -------- > dirty_bw > > task_ratelimit = balanced_rate * pos_ratio > > which is effective the same to your combined equation (4). > > > - Now going back to (4). Because we have a feedback loop where we > > continuously update a previous number based on feedback, we can track > > previous value in bdi->dirty_ratelimit. > > > > write_bw > > task_ratelimit = task_ratelimit_0 * --------- * pos_ratio > > dirty_bw > > > > Or > > > > task_ratelimit = bdi->dirty_ratelimit * pos_ratio (7) > > > > where > > write_bw > > bdi->dirty_ratelimit = task_ratelimit_0 * --------- > > dirty_bw > > Right. > > > Because task_ratelimit_0 is initial value to begin with and we will > > keep on coming with new value every 200ms, we should be able to write > > above as follows. > > > > write_bw > > bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * -------- (8) > > dirty_bw > > > > Effectively we start with an initial value of task_ratelimit_0 and > > then keep on updating it based on rate change feedback every 200ms. > > Right. > > > To summarize, > > > > We need to achieve (3) for a balanced system. Because we don't know the > > value of N in advance, we can use (4) to achieve effect of (3). So we > > start with a default value of task_ratelimit_0 and update that every > > 200ms based on how write and dirty rate on device is changing (8). We also > > further refine that rate by pos_ratio so that any variations in number > > of dirty pages due to temporary imbalances in the system can be > > accounted for (7). > > > > I see that you also use (7). I think only contention point is how > > (8) is perceived. So can you please explain why do you think that > > above calculation or (9) is wrong. > > There is no contention point and (9) is right..Sorry it's my fault. > We are well aligned in the above reasoning :) Great. Now we are on same page now at least till this point. > > > I can kind of understand that you have done various adjustments to keep the > > task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that > > I am not able to understand your calculations in updating bdi->dirty_ratelimit. > > You mean the below chunk of code? Which is effectively the same as this _one_ > line of code > > bdi->dirty_ratelimit = balanced_rate; > > except for doing some tricks (conditional update and limiting step size) to > stabilize bdi->dirty_ratelimit: I am fine with bdi->dirty_ratelimit being called balanced rate. I am taking exception to the fact that you are also taking into accout pos_ratio while coming up with new balanced_rate after 200ms of feedback. We agreed to updating bdi->dirty_ratelimit as follows (8 above). write_bw bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * -------- (8) dirty_bw I think in your terminology it could be called. write_bw new_balanced_rate = prev_balanced_rate * ---------- (9) dirty_bw But what you seem to be doing is following. write_bw new_balanced_rate = prev_balanced_rate * pos_ratio * ----------- (10) dirty_bw Of course I have just tried to simlify your actual calculations to show why I am questioning the presence of pos_ratio while calculating the new bdi->dirty_ratelimit. I am fine with limiting the step size etc. So (9) and (10) don't match? Now going back to your code and show how I arrived at (10). executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT; (11) balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth, dirty_rate | 1); (12) Combining (11) and (12) gives us (10). write_bw balance_rate = base_rate * pos_ratio -------- dirty_rate Or write_bw bdi->dirty_ratelimit = base_rate * pos_ratio -------- dirty_rate To complicate the things you also have the notion of pos_rate and reduce the step size based on either pos_rate or balance_rate. pos_rate = executed_rate = base_rate * pos_ratio; write_bw balance_rate = base_rate * pos_ratio -------- dirty_rate bdi->dirty_rate_limit = min_change(pos_rate, balance_rate) (13) So for feedback, why are not sticking to simply (9) and limit the step size and not take pos_ratio into account. Even if you have to take it into account, it needs to be explained clearly and so many rate definitions confuse things more. Keeping name constant everywhere (even for local variables), helps understand the code better. Look at number of rates we have in code and it gets so confusing. balanced_rate base_rate bdi->dirty_ratelimit executed_rate pos_rate task_ratelimit dirty_rate write_bw Here balanced_rate, base_rate and bdi->dirty_ratelimit all seem to be referring to same thing and that is not obivious from the code. Looks like task->ratelimit and executed_rate and pos_rate are referring to same thing. So instead of 6 rates, we could atleast collpase the naming to 2 rates to keep the context clear. Just prefix/suffix more strings to highlight subtle difference between two rates. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() 2011-08-23 13:53 ` Vivek Goyal @ 2011-08-24 3:09 ` Wu Fengguang -1 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-24 3:09 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 23, 2011 at 09:53:55PM +0800, Vivek Goyal wrote: > On Tue, Aug 23, 2011 at 09:07:21AM +0800, Wu Fengguang wrote: > > [..] > > > > > So we refined the formula for calculating a tasks's effective rate > > > > > over a period of time to following. > > > > > write_bw > > > > > task_ratelimit = task_ratelimit_0 * ------- * pos_ratio (9) > > > > > dirty_rate > > > > > > > > > > > > > That's not true. It should still be formula (7) when > > > > balance_drity_pages() considers pos_ratio. > > > > > > Why it is not true? If I do some math, it sounds right. Let me summarize > > > my understanding again. > > > > Ah sorry! (9) actually holds true, as made clear by your below reasoning. > > > > > - In a steady state stable system, we want dirty_bw = write_bw, IOW. > > > > > > dirty_bw/write_bw = 1 (1) > > > > > > If we can achieve above then that means we are throttling tasks at > > > just right rate. > > > > > > Or > > > - dirty_bw == write_bw > > > N * task_ratelimit == write_bw > > > task_ratelimit = write_bw/N (2) > > > > > > So as long as we can come up with a system where balance_dirty_pages() > > > calculates task_ratelimit to be write_bw/N, we should be fine. > > > > Right. > > > > > - But this does not take care of imbalances. So if system goes out of > > > balance before feedback loop kicks in and dirty rate shoots up, then > > > cache size will grow and number of dirty pages will shoot up. Hence > > > we brought in the notion of position ratio where we also vary a > > > tasks's dirty ratelimit based on number of dirty pages. So our > > > effective formula became. > > > > > > task_ratelimit = write_bw/N * pos_ratio (3) > > > > > > So as long as we meet (3), we should reach to stable state. > > > > Right. > > > > > - But here N is unknown in advance so balance_drity_pages() can not make > > > use of this formula directly. But write_bw and dirty_bw from previous > > > 200ms are known. So following can replace (3). > > > > > > write_bw > > > task_ratelimit = task_ratelimit_0 * --------- * pos_ratio (4) > > > dirty_bw > > > > > > dirty_bw = task_ratelimit_0 * N (5) > > > > > > Substitute (5) in (4) > > > > > > task_ratelimit = write_bw/N * pos_ratio (6) > > > > > > (6) is same as (3) which has been derived from (4) and that means at any > > > given point of time (4) can be used by balance_drity_pages() to calculate > > > a tasks's throttling rate. > > > > Right. Sorry what's in my mind was > > > > write_bw > > balanced_rate = task_ratelimit_0 * -------- > > dirty_bw > > > > task_ratelimit = balanced_rate * pos_ratio > > > > which is effective the same to your combined equation (4). > > > > > - Now going back to (4). Because we have a feedback loop where we > > > continuously update a previous number based on feedback, we can track > > > previous value in bdi->dirty_ratelimit. > > > > > > write_bw > > > task_ratelimit = task_ratelimit_0 * --------- * pos_ratio > > > dirty_bw > > > > > > Or > > > > > > task_ratelimit = bdi->dirty_ratelimit * pos_ratio (7) > > > > > > where > > > write_bw > > > bdi->dirty_ratelimit = task_ratelimit_0 * --------- > > > dirty_bw > > > > Right. > > > > > Because task_ratelimit_0 is initial value to begin with and we will > > > keep on coming with new value every 200ms, we should be able to write > > > above as follows. > > > > > > write_bw > > > bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * -------- (8) > > > dirty_bw > > > > > > Effectively we start with an initial value of task_ratelimit_0 and > > > then keep on updating it based on rate change feedback every 200ms. > > > > Right. > > > > > To summarize, > > > > > > We need to achieve (3) for a balanced system. Because we don't know the > > > value of N in advance, we can use (4) to achieve effect of (3). So we > > > start with a default value of task_ratelimit_0 and update that every > > > 200ms based on how write and dirty rate on device is changing (8). We also > > > further refine that rate by pos_ratio so that any variations in number > > > of dirty pages due to temporary imbalances in the system can be > > > accounted for (7). > > > > > > I see that you also use (7). I think only contention point is how > > > (8) is perceived. So can you please explain why do you think that > > > above calculation or (9) is wrong. > > > > There is no contention point and (9) is right..Sorry it's my fault. > > We are well aligned in the above reasoning :) > > Great. Now we are on same page now at least till this point. > > > > > > I can kind of understand that you have done various adjustments to keep the > > > task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that > > > I am not able to understand your calculations in updating bdi->dirty_ratelimit. > > > > You mean the below chunk of code? Which is effectively the same as this _one_ > > line of code > > > > bdi->dirty_ratelimit = balanced_rate; > > > > except for doing some tricks (conditional update and limiting step size) to > > stabilize bdi->dirty_ratelimit: > > I am fine with bdi->dirty_ratelimit being called balanced rate. I am > taking exception to the fact that you are also taking into accout > pos_ratio while coming up with new balanced_rate after 200ms of feedback. > > We agreed to updating bdi->dirty_ratelimit as follows (8 above). > > > write_bw > bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * -------- (8) > dirty_bw > > I think in your terminology it could be called. > write_bw > new_balanced_rate = prev_balanced_rate * ---------- (9) > dirty_bw > > But what you seem to be doing is following. > write_bw > new_balanced_rate = prev_balanced_rate * pos_ratio * ----------- (10) > dirty_bw > > Of course I have just tried to simlify your actual calculations to > show why I am questioning the presence of pos_ratio while calculating > the new bdi->dirty_ratelimit. I am fine with limiting the step size etc. > > So (9) and (10) don't match? > > Now going back to your code and show how I arrived at (10). > > executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT; (11) > balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth, > dirty_rate | 1); (12) > > Combining (11) and (12) gives us (10). > write_bw > balance_rate = base_rate * pos_ratio -------- > dirty_rate > > Or > write_bw > bdi->dirty_ratelimit = base_rate * pos_ratio -------- > dirty_rate I hope the other email on the balanced_rate estimation equation can clarify the questions on pos_ratio.. > To complicate the things you also have the notion of pos_rate and reduce > the step size based on either pos_rate or balance_rate. > > pos_rate = executed_rate = base_rate * pos_ratio; > > write_bw > balance_rate = base_rate * pos_ratio -------- > dirty_rate > > bdi->dirty_rate_limit = min_change(pos_rate, balance_rate) (13) > > So for feedback, why are not sticking to simply (9) and limit the step > size and not take pos_ratio into account. pos_rate is used to limit the step size. This reply to Peter has more details: http://www.spinics.net/lists/linux-fsdevel/msg47991.html > Even if you have to take it into account, it needs to be explained clearly > and so many rate definitions confuse things more. Keeping name constant > everywhere (even for local variables), helps understand the code better. > Good idea! There are two many names that differs subtly.. > Look at number of rates we have in code and it gets so confusing. > > balanced_rate > base_rate > bdi->dirty_ratelimit > > executed_rate > pos_rate > task_ratelimit > > dirty_rate > write_bw > > Here balanced_rate, base_rate and bdi->dirty_ratelimit all seem to be > referring to same thing and that is not obivious from the code. Looks > like task->ratelimit and executed_rate and pos_rate are referring to same > thing. Right. > So instead of 6 rates, we could atleast collpase the naming to 2 rates > to keep the context clear. Just prefix/suffix more strings to highlight > subtle difference between two rates. How about balanced_rate => balanced_dirty_ratelimit base_rate => dirty_ratelimit bdi->dirty_ratelimit == bdi->dirty_ratelimit pos_rate => task_ratelimit executed_rate => task_ratelimit task_ratelimit == task_ratelimit Thanks, Fengguang ^ permalink raw reply [flat|nested] 301+ messages in thread
* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages() @ 2011-08-24 3:09 ` Wu Fengguang 0 siblings, 0 replies; 301+ messages in thread From: Wu Fengguang @ 2011-08-24 3:09 UTC (permalink / raw) To: Vivek Goyal Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm, LKML On Tue, Aug 23, 2011 at 09:53:55PM +0800, Vivek Goyal wrote: > On Tue, Aug 23, 2011 at 09:07:21AM +0800, Wu Fengguang wrote: > > [..] > > > > > So we refined the formula for calculating a tasks's effective rate > > > > > over a period of time to following. > > > > > write_bw > > > > > task_ratelimit = task_ratelimit_0 * ------- * pos_ratio (9) > > > > > dirty_rate > > > > > > > > > > > > > That's not true. It should still be formula (7) when > > > > balance_drity_pages() considers pos_ratio. > > > > > > Why it is not true? If I do some math, it sounds right. Let me summarize > > > my understanding again. > > > > Ah sorry! (9) actually holds true, as made clear by your below reasoning. > > > > > - In a steady state stable system, we want dirty_bw = write_bw, IOW. > > > > > > dirty_bw/write_bw = 1 (1) > > > > > > If we can achieve above then that means we are throttling tasks at > > > just right rate. > > > > > > Or > > > - dirty_bw == write_bw > > > N * task_ratelimit == write_bw > > > task_ratelimit = write_bw/N (2) > > > > > > So as long as we can come up with a system where balance_dirty_pages() > > > calculates task_ratelimit to be write_bw/N, we should be fine. > > > > Right. > > > > > - But this does not take care of imbalances. So if system goes out of > > > balance before feedback loop kicks in and dirty rate shoots up, then > > > cache size will grow and number of dirty pages will shoot up. Hence > > > we brought in the notion of position ratio where we also vary a > > > tasks's dirty ratelimit based on number of dirty pages. So our > > > effective formula became. > > > > > > task_ratelimit = write_bw/N * pos_ratio (3) > > > > > > So as long as we meet (3), we should reach to stable state. > > > > Right. > > > > > - But here N is unknown in advance so balance_drity_pages() can not make > > > use of this formula directly. But write_bw and dirty_bw from previous > > > 200ms are known. So following can replace (3). > > > > > > write_bw > > > task_ratelimit = task_ratelimit_0 * --------- * pos_ratio (4) > > > dirty_bw > > > > > > dirty_bw = task_ratelimit_0 * N (5) > > > > > > Substitute (5) in (4) > > > > > > task_ratelimit = write_bw/N * pos_ratio (6) > > > > > > (6) is same as (3) which has been derived from (4) and that means at any > > > given point of time (4) can be used by balance_drity_pages() to calculate > > > a tasks's throttling rate. > > > > Right. Sorry what's in my mind was > > > > write_bw > > balanced_rate = task_ratelimit_0 * -------- > > dirty_bw > > > > task_ratelimit = balanced_rate * pos_ratio > > > > which is effective the same to your combined equation (4). > > > > > - Now going back to (4). Because we have a feedback loop where we > > > continuously update a previous number based on feedback, we can track > > > previous value in bdi->dirty_ratelimit. > > > > > > write_bw > > > task_ratelimit = task_ratelimit_0 * --------- * pos_ratio > > > dirty_bw > > > > > > Or > > > > > > task_ratelimit = bdi->dirty_ratelimit * pos_ratio (7) > > > > > > where > > > write_bw > > > bdi->dirty_ratelimit = task_ratelimit_0 * --------- > > > dirty_bw > > > > Right. > > > > > Because task_ratelimit_0 is initial value to begin with and we will > > > keep on coming with new value every 200ms, we should be able to write > > > above as follows. > > > > > > write_bw > > > bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * -------- (8) > > > dirty_bw > > > > > > Effectively we start with an initial value of task_ratelimit_0 and > > > then keep on updating it based on rate change feedback every 200ms. > > > > Right. > > > > > To summarize, > > > > > > We need to achieve (3) for a balanced system. Because we don't know the > > > value of N in advance, we can use (4) to achieve effect of (3). So we > > > start with a default value of task_ratelimit_0 and update that every > > > 200ms based on how write and dirty rate on device is changing (8). We also > > > further refine that rate by pos_ratio so that any variations in number > > > of dirty pages due to temporary imbalances in the system can be > > > accounted for (7). > > > > > > I see that you also use (7). I think only contention point is how > > > (8) is perceived. So can you please explain why do you think that > > > above calculation or (9) is wrong. > > > > There is no contention point and (9) is right..Sorry it's my fault. > > We are well aligned in the above reasoning :) > > Great. Now we are on same page now at least till this point. > > > > > > I can kind of understand that you have done various adjustments to keep the > > > task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that > > > I am not able to understand your calculations in updating bdi->dirty_ratelimit. > > > > You mean the below chunk of code? Which is effectively the same as this _one_ > > line of code > > > > bdi->dirty_ratelimit = balanced_rate; > > > > except for doing some tricks (conditional update and limiting step size) to > > stabilize bdi->dirty_ratelimit: > > I am fine with bdi->dirty_ratelimit being called balanced rate. I am > taking exception to the fact that you are also taking into accout > pos_ratio while coming up with new balanced_rate after 200ms of feedback. > > We agreed to updating bdi->dirty_ratelimit as follows (8 above). > > > write_bw > bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * -------- (8) > dirty_bw > > I think in your terminology it could be called. > write_bw > new_balanced_rate = prev_balanced_rate * ---------- (9) > dirty_bw > > But what you seem to be doing is following. > write_bw > new_balanced_rate = prev_balanced_rate * pos_ratio * ----------- (10) > dirty_bw > > Of course I have just tried to simlify your actual calculations to > show why I am questioning the presence of pos_ratio while calculating > the new bdi->dirty_ratelimit. I am fine with limiting the step size etc. > > So (9) and (10) don't match? > > Now going back to your code and show how I arrived at (10). > > executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT; (11) > balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth, > dirty_rate | 1); (12) > > Combining (11) and (12) gives us (10). > write_bw > balance_rate = base_rate * pos_ratio -------- > dirty_rate > > Or > write_bw > bdi->dirty_ratelimit = base_rate * pos_ratio -------- > dirty_rate I hope the other email on the balanced_rate estimation equation can clarify the questions on pos_ratio.. > To complicate the things you also have the notion of pos_rate and reduce > the step size based on either pos_rate or balance_rate. > > pos_rate = executed_rate = base_rate * pos_ratio; > > write_bw > balance_rate = base_rate * pos_ratio -------- > dirty_rate > > bdi->dirty_rate_limit = min_change(pos_rate, balance_rate) (13) > > So for feedback, why are not sticking to simply (9) and limit the step > size and not take pos_ratio into account. pos_rate is used to limit the step size. This reply to Peter has more details: http://www.spinics.net/lists/linux-fsdevel/msg47991.html > Even if you have to take it into account, it needs to be explained clearly > and so many rate definitions confuse things more. Keeping name constant > everywhere (even for local variables), helps understand the code better. > Good idea! There are two many names that differs subtly.. > Look at number of rates we have in code and it gets so confusing. > > balanced_rate > base_rate > bdi->dirty_ratelimit > > executed_rate > pos_rate > task_ratelimit > > dirty_rate > write_bw > > Here balanced_rate, base_rate and bdi->dirty_ratelimit all seem to be > referring to same thing and that is not obivious from the code. Looks > like task->ratelimit and executed_rate and pos_rate are referring to same > thing. Right. > So instead of 6 rates, we could atleast collpase the naming to 2 rates > to keep the context clear. Just prefix/suffix more strings to highlight > subtle difference between two rates. How about balanced_rate => balanced_dirty_ratelimit base_rate => dirty_ratelimit bdi->dirty_ratelimit == bdi->dirty_ratelimit pos_rate => task_ratelimit executed_rate => task_ratelimit task_ratelimit == task_ratelimit Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 301+ messages in thread
end of thread, other threads:[~2011-09-06 12:40 UTC | newest] Thread overview: 301+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-08-06 8:44 [PATCH 0/5] IO-less dirty throttling v8 Wu Fengguang 2011-08-06 8:44 ` Wu Fengguang 2011-08-06 8:44 ` Wu Fengguang 2011-08-06 8:44 ` [PATCH 1/5] writeback: account per-bdi accumulated dirtied pages Wu Fengguang 2011-08-06 8:44 ` Wu Fengguang 2011-08-06 8:44 ` Wu Fengguang 2011-08-06 8:44 ` [PATCH 2/5] writeback: dirty position control Wu Fengguang 2011-08-06 8:44 ` Wu Fengguang 2011-08-06 8:44 ` Wu Fengguang 2011-08-08 13:46 ` Peter Zijlstra 2011-08-08 13:46 ` Peter Zijlstra 2011-08-08 13:46 ` Peter Zijlstra 2011-08-08 14:11 ` Wu Fengguang 2011-08-08 14:11 ` Wu Fengguang 2011-08-08 14:31 ` Peter Zijlstra 2011-08-08 14:31 ` Peter Zijlstra 2011-08-08 14:31 ` Peter Zijlstra 2011-08-08 22:47 ` Wu Fengguang 2011-08-08 22:47 ` Wu Fengguang 2011-08-09 9:31 ` Peter Zijlstra 2011-08-09 9:31 ` Peter Zijlstra 2011-08-09 9:31 ` Peter Zijlstra 2011-08-10 12:28 ` Wu Fengguang 2011-08-10 12:28 ` Wu Fengguang 2011-08-08 14:41 ` Peter Zijlstra 2011-08-08 14:41 ` Peter Zijlstra 2011-08-08 14:41 ` Peter Zijlstra 2011-08-08 23:05 ` Wu Fengguang 2011-08-08 23:05 ` Wu Fengguang 2011-08-09 10:32 ` Peter Zijlstra 2011-08-09 10:32 ` Peter Zijlstra 2011-08-09 10:32 ` Peter Zijlstra 2011-08-09 17:20 ` Peter Zijlstra 2011-08-09 17:20 ` Peter Zijlstra 2011-08-09 17:20 ` Peter Zijlstra 2011-08-10 22:34 ` Jan Kara 2011-08-10 22:34 ` Jan Kara 2011-08-11 2:29 ` Wu Fengguang 2011-08-11 2:29 ` Wu Fengguang 2011-08-11 11:14 ` Jan Kara 2011-08-11 11:14 ` Jan Kara 2011-08-16 8:35 ` Wu Fengguang 2011-08-16 8:35 ` Wu Fengguang 2011-08-12 13:19 ` Wu Fengguang 2011-08-12 13:19 ` Wu Fengguang 2011-08-10 21:40 ` Vivek Goyal 2011-08-10 21:40 ` Vivek Goyal 2011-08-16 8:55 ` Wu Fengguang 2011-08-16 8:55 ` Wu Fengguang 2011-08-11 22:56 ` Peter Zijlstra 2011-08-11 22:56 ` Peter Zijlstra 2011-08-11 22:56 ` Peter Zijlstra 2011-08-12 2:43 ` Wu Fengguang 2011-08-12 2:43 ` Wu Fengguang 2011-08-12 3:18 ` Wu Fengguang 2011-08-12 5:45 ` Wu Fengguang 2011-08-12 5:45 ` Wu Fengguang 2011-08-12 9:45 ` Peter Zijlstra 2011-08-12 9:45 ` Peter Zijlstra 2011-08-12 9:45 ` Peter Zijlstra 2011-08-12 11:07 ` Wu Fengguang 2011-08-12 11:07 ` Wu Fengguang 2011-08-12 12:17 ` Peter Zijlstra 2011-08-12 12:17 ` Peter Zijlstra 2011-08-12 12:17 ` Peter Zijlstra 2011-08-12 9:47 ` Peter Zijlstra 2011-08-12 9:47 ` Peter Zijlstra 2011-08-12 9:47 ` Peter Zijlstra 2011-08-12 11:11 ` Wu Fengguang 2011-08-12 11:11 ` Wu Fengguang 2011-08-12 12:54 ` Peter Zijlstra 2011-08-12 12:54 ` Peter Zijlstra 2011-08-12 12:54 ` Peter Zijlstra 2011-08-12 12:59 ` Wu Fengguang 2011-08-12 12:59 ` Wu Fengguang 2011-08-12 13:08 ` Peter Zijlstra 2011-08-12 13:08 ` Peter Zijlstra 2011-08-12 13:08 ` Peter Zijlstra 2011-08-12 13:04 ` Peter Zijlstra 2011-08-12 13:04 ` Peter Zijlstra 2011-08-12 13:04 ` Peter Zijlstra 2011-08-12 14:20 ` Wu Fengguang 2011-08-12 14:20 ` Wu Fengguang 2011-08-22 15:38 ` Peter Zijlstra 2011-08-22 15:38 ` Peter Zijlstra 2011-08-22 15:38 ` Peter Zijlstra 2011-08-23 3:40 ` Wu Fengguang 2011-08-23 3:40 ` Wu Fengguang 2011-08-23 10:01 ` Peter Zijlstra 2011-08-23 10:01 ` Peter Zijlstra 2011-08-23 10:01 ` Peter Zijlstra 2011-08-23 14:15 ` Wu Fengguang 2011-08-23 14:15 ` Wu Fengguang 2011-08-23 17:47 ` Vivek Goyal 2011-08-23 17:47 ` Vivek Goyal 2011-08-24 0:12 ` Wu Fengguang 2011-08-24 0:12 ` Wu Fengguang 2011-08-24 16:12 ` Peter Zijlstra 2011-08-24 16:12 ` Peter Zijlstra 2011-08-26 0:18 ` Wu Fengguang 2011-08-26 0:18 ` Wu Fengguang 2011-08-26 9:04 ` Peter Zijlstra 2011-08-26 9:04 ` Peter Zijlstra 2011-08-26 10:04 ` Wu Fengguang 2011-08-26 10:04 ` Wu Fengguang 2011-08-26 10:42 ` Peter Zijlstra 2011-08-26 10:42 ` Peter Zijlstra 2011-08-26 10:52 ` Wu Fengguang 2011-08-26 10:52 ` Wu Fengguang 2011-08-26 11:26 ` Wu Fengguang 2011-08-26 12:11 ` Peter Zijlstra 2011-08-26 12:11 ` Peter Zijlstra 2011-08-26 12:20 ` Wu Fengguang 2011-08-26 12:20 ` Wu Fengguang 2011-08-26 13:13 ` Wu Fengguang 2011-08-26 13:18 ` Peter Zijlstra 2011-08-26 13:18 ` Peter Zijlstra 2011-08-26 13:24 ` Wu Fengguang 2011-08-26 13:24 ` Wu Fengguang 2011-08-24 18:00 ` Vivek Goyal 2011-08-24 18:00 ` Vivek Goyal 2011-08-25 3:19 ` Wu Fengguang 2011-08-25 3:19 ` Wu Fengguang 2011-08-25 22:20 ` Vivek Goyal 2011-08-25 22:20 ` Vivek Goyal 2011-08-26 1:56 ` Wu Fengguang 2011-08-26 1:56 ` Wu Fengguang 2011-08-26 8:56 ` Peter Zijlstra 2011-08-26 8:56 ` Peter Zijlstra 2011-08-26 9:53 ` Wu Fengguang 2011-08-26 9:53 ` Wu Fengguang 2011-08-29 13:12 ` Peter Zijlstra 2011-08-29 13:12 ` Peter Zijlstra 2011-08-29 13:37 ` Wu Fengguang 2011-08-29 13:37 ` Wu Fengguang 2011-09-02 12:16 ` Peter Zijlstra 2011-09-02 12:16 ` Peter Zijlstra 2011-09-06 12:40 ` Peter Zijlstra 2011-09-06 12:40 ` Peter Zijlstra 2011-08-24 15:57 ` Peter Zijlstra 2011-08-24 15:57 ` Peter Zijlstra 2011-08-24 15:57 ` Peter Zijlstra 2011-08-25 5:30 ` Wu Fengguang 2011-08-25 5:30 ` Wu Fengguang 2011-08-23 14:36 ` Vivek Goyal 2011-08-23 14:36 ` Vivek Goyal 2011-08-09 2:08 ` Vivek Goyal 2011-08-09 2:08 ` Vivek Goyal 2011-08-16 8:59 ` Wu Fengguang 2011-08-16 8:59 ` Wu Fengguang 2011-08-06 8:44 ` [PATCH 3/5] writeback: dirty rate control Wu Fengguang 2011-08-06 8:44 ` Wu Fengguang 2011-08-06 8:44 ` Wu Fengguang 2011-08-09 14:54 ` Vivek Goyal 2011-08-09 14:54 ` Vivek Goyal 2011-08-11 3:42 ` Wu Fengguang 2011-08-11 3:42 ` Wu Fengguang 2011-08-09 14:57 ` Peter Zijlstra 2011-08-09 14:57 ` Peter Zijlstra 2011-08-09 14:57 ` Peter Zijlstra 2011-08-10 11:07 ` Wu Fengguang 2011-08-10 11:07 ` Wu Fengguang 2011-08-10 16:17 ` Peter Zijlstra 2011-08-10 16:17 ` Peter Zijlstra 2011-08-10 16:17 ` Peter Zijlstra 2011-08-15 14:08 ` Wu Fengguang 2011-08-15 14:08 ` Wu Fengguang 2011-08-09 15:50 ` Vivek Goyal 2011-08-09 15:50 ` Vivek Goyal 2011-08-09 16:16 ` Peter Zijlstra 2011-08-09 16:16 ` Peter Zijlstra 2011-08-09 16:16 ` Peter Zijlstra 2011-08-09 16:19 ` Peter Zijlstra 2011-08-09 16:19 ` Peter Zijlstra 2011-08-09 16:19 ` Peter Zijlstra 2011-08-10 14:07 ` Wu Fengguang 2011-08-10 14:07 ` Wu Fengguang 2011-08-10 14:00 ` Wu Fengguang 2011-08-10 14:00 ` Wu Fengguang 2011-08-10 17:10 ` Peter Zijlstra 2011-08-10 17:10 ` Peter Zijlstra 2011-08-15 14:11 ` Wu Fengguang 2011-08-15 14:11 ` Wu Fengguang 2011-08-09 16:56 ` Peter Zijlstra 2011-08-09 16:56 ` Peter Zijlstra 2011-08-09 16:56 ` Peter Zijlstra 2011-08-10 14:10 ` Wu Fengguang 2011-08-09 17:02 ` Peter Zijlstra 2011-08-09 17:02 ` Peter Zijlstra 2011-08-09 17:02 ` Peter Zijlstra 2011-08-10 14:15 ` Wu Fengguang 2011-08-10 14:15 ` Wu Fengguang 2011-08-06 8:44 ` [PATCH 4/5] writeback: per task dirty rate limit Wu Fengguang 2011-08-06 8:44 ` Wu Fengguang 2011-08-06 8:44 ` Wu Fengguang 2011-08-06 14:35 ` Andrea Righi 2011-08-06 14:35 ` Andrea Righi 2011-08-07 6:19 ` Wu Fengguang 2011-08-07 6:19 ` Wu Fengguang 2011-08-08 13:47 ` Peter Zijlstra 2011-08-08 13:47 ` Peter Zijlstra 2011-08-08 13:47 ` Peter Zijlstra 2011-08-08 14:21 ` Wu Fengguang 2011-08-08 14:21 ` Wu Fengguang 2011-08-08 23:32 ` Wu Fengguang 2011-08-08 23:32 ` Wu Fengguang 2011-08-08 14:23 ` Wu Fengguang 2011-08-08 14:23 ` Wu Fengguang 2011-08-08 14:26 ` Peter Zijlstra 2011-08-08 14:26 ` Peter Zijlstra 2011-08-08 14:26 ` Peter Zijlstra 2011-08-08 22:38 ` Wu Fengguang 2011-08-08 22:38 ` Wu Fengguang 2011-08-13 16:28 ` Andrea Righi 2011-08-13 16:28 ` Andrea Righi 2011-08-15 14:21 ` Wu Fengguang 2011-08-15 14:26 ` Andrea Righi 2011-08-15 14:26 ` Andrea Righi 2011-08-09 17:46 ` Vivek Goyal 2011-08-09 17:46 ` Vivek Goyal 2011-08-10 3:29 ` Wu Fengguang 2011-08-10 3:29 ` Wu Fengguang 2011-08-10 18:18 ` Vivek Goyal 2011-08-10 18:18 ` Vivek Goyal 2011-08-11 0:55 ` Wu Fengguang 2011-08-11 0:55 ` Wu Fengguang 2011-08-09 18:35 ` Peter Zijlstra 2011-08-09 18:35 ` Peter Zijlstra 2011-08-09 18:35 ` Peter Zijlstra 2011-08-10 3:40 ` Wu Fengguang 2011-08-10 3:40 ` Wu Fengguang 2011-08-10 10:25 ` Peter Zijlstra 2011-08-10 10:25 ` Peter Zijlstra 2011-08-10 10:25 ` Peter Zijlstra 2011-08-10 11:13 ` Wu Fengguang 2011-08-10 11:13 ` Wu Fengguang 2011-08-06 8:44 ` [PATCH 5/5] writeback: IO-less balance_dirty_pages() Wu Fengguang 2011-08-06 8:44 ` Wu Fengguang 2011-08-06 8:44 ` Wu Fengguang 2011-08-06 14:48 ` Andrea Righi 2011-08-06 14:48 ` Andrea Righi 2011-08-06 14:48 ` Andrea Righi 2011-08-07 6:44 ` Wu Fengguang 2011-08-07 6:44 ` Wu Fengguang 2011-08-07 6:44 ` Wu Fengguang 2011-08-06 16:46 ` Andrea Righi 2011-08-06 16:46 ` Andrea Righi 2011-08-07 7:18 ` Wu Fengguang 2011-08-07 9:50 ` Andrea Righi 2011-08-07 9:50 ` Andrea Righi 2011-08-09 18:15 ` Vivek Goyal 2011-08-09 18:15 ` Vivek Goyal 2011-08-09 18:41 ` Peter Zijlstra 2011-08-09 18:41 ` Peter Zijlstra 2011-08-09 18:41 ` Peter Zijlstra 2011-08-10 3:22 ` Wu Fengguang 2011-08-10 3:22 ` Wu Fengguang 2011-08-10 3:26 ` Wu Fengguang 2011-08-10 3:26 ` Wu Fengguang 2011-08-09 19:16 ` Vivek Goyal 2011-08-09 19:16 ` Vivek Goyal 2011-08-10 4:33 ` Wu Fengguang 2011-08-09 2:01 ` [PATCH 0/5] IO-less dirty throttling v8 Vivek Goyal 2011-08-09 2:01 ` Vivek Goyal 2011-08-09 5:55 ` Dave Chinner 2011-08-09 5:55 ` Dave Chinner 2011-08-09 14:04 ` Vivek Goyal 2011-08-09 14:04 ` Vivek Goyal 2011-08-10 7:41 ` Greg Thelen 2011-08-10 7:41 ` Greg Thelen 2011-08-10 7:41 ` Greg Thelen 2011-08-10 18:40 ` Vivek Goyal 2011-08-10 18:40 ` Vivek Goyal 2011-08-10 18:40 ` Vivek Goyal 2011-08-11 3:21 ` Wu Fengguang 2011-08-11 3:21 ` Wu Fengguang 2011-08-11 20:42 ` Vivek Goyal 2011-08-11 20:42 ` Vivek Goyal 2011-08-11 21:00 ` Vivek Goyal 2011-08-11 21:00 ` Vivek Goyal 2011-08-16 2:20 [PATCH 0/5] IO-less dirty throttling v9 Wu Fengguang 2011-08-16 2:20 ` [PATCH 5/5] writeback: IO-less balance_dirty_pages() Wu Fengguang 2011-08-16 2:20 ` Wu Fengguang 2011-08-16 2:20 ` Wu Fengguang 2011-08-19 2:06 ` Vivek Goyal 2011-08-19 2:06 ` Vivek Goyal 2011-08-19 2:54 ` Wu Fengguang 2011-08-19 2:54 ` Wu Fengguang 2011-08-19 19:00 ` Vivek Goyal 2011-08-19 19:00 ` Vivek Goyal 2011-08-21 3:46 ` Wu Fengguang 2011-08-21 3:46 ` Wu Fengguang 2011-08-22 17:22 ` Vivek Goyal 2011-08-22 17:22 ` Vivek Goyal 2011-08-23 1:07 ` Wu Fengguang 2011-08-23 1:07 ` Wu Fengguang 2011-08-23 3:53 ` Wu Fengguang 2011-08-23 3:53 ` Wu Fengguang 2011-08-23 13:53 ` Vivek Goyal 2011-08-23 13:53 ` Vivek Goyal 2011-08-24 3:09 ` Wu Fengguang 2011-08-24 3:09 ` Wu Fengguang
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.