linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 01/13] writeback: IO-less balance_dirty_pages()
@ 2010-11-17  3:58 Wu Fengguang
  2010-11-17  4:19 ` Wu Fengguang
  2010-11-17  4:30 ` Wu Fengguang
  0 siblings, 2 replies; 18+ messages in thread
From: Wu Fengguang @ 2010-11-17  3:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, Chris Mason, Dave Chinner, Jan Kara,
	Peter Zijlstra, Jens Axboe, Wu Fengguang, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Christoph Hellwig, linux-mm,
	linux-fsdevel, LKML

Andrew,
References: <20101117035821.000579293@intel.com>
Content-Disposition: inline; filename=writeback-bw-throttle.patch

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

This patch introduces the basic framework, which will be further
consolidated by the next patches.

RATIONALS
=========

The current balance_dirty_pages() is rather IO inefficient.

- concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

For the above two reasons, it's much better to shift IO to the flusher
threads and let balance_dirty_pages() just wait for enough time or progress.

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling 
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than  10ms, which burns CPU power)
- avoid too large pause time (more than 100ms, which hurts responsiveness)
- avoid big fluctuations of pause times

For example, when doing a simple cp on ext4 with mem=4G HZ=250.

before patch, the pause time fluctuates from 0 to 324ms
(and the stall time may grow very large for slow devices)

[ 1237.139962] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1237.207489] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.225190] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.234488] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.244692] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.375231] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.443035] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.574630] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.642394] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.666320] balance_dirty_pages: write_chunk=1536 pages_written=57 pause=5
[ 1237.973365] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=81
[ 1238.212626] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1238.280431] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1238.412029] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1238.412791] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0

after patch, the pause time remains stable around 32ms

cp-2687  [002]  1452.237012: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [002]  1452.246157: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [006]  1452.253043: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [006]  1452.261899: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [006]  1452.268939: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [002]  1452.276932: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [002]  1452.285889: balance_dirty_pages: weight=57% dirtied=128 pause=8

CONTROL SYSTEM
==============

The current task_dirty_limit() adjusts bdi_dirty_limit to get
task_dirty_limit according to the dirty "weight" of the current task,
which is the percent of pages recently dirtied by the task. If 100%
pages are recently dirtied by the task, it will lower bdi_dirty_limit by
1/8. If only 1% pages are dirtied by the task, it will return almost
unmodified bdi_dirty_limit. In this way, a heavy dirtier will get
blocked at task_dirty_limit=(bdi_dirty_limit-bdi_dirty_limit/8) while
allowing a light dirtier to progress (the latter won't be blocked
because R << B in fig.1).

Fig.1 before patch, a heavy dirtier and a light dirtier
                                                R
----------------------------------------------+-o---------------------------*--|
                                              L A                           B  T
  T: bdi_dirty_limit, as returned by bdi_dirty_limit()
  L: T - T/8

  R: bdi_reclaimable + bdi_writeback

  A: task_dirty_limit for a heavy dirtier ~= R ~= L
  B: task_dirty_limit for a light dirtier ~= T

Since each process has its own dirty limit, we reuse A/B for the tasks as
well as their dirty limits.

If B is a newly started heavy dirtier, then it will slowly gain weight
and A will lose weight.  The task_dirty_limit for A and B will be
approaching the center of region (L, T) and eventually stabilize there.

Fig.2 before patch, two heavy dirtiers converging to the same threshold
                                                             R
----------------------------------------------+--------------o-*---------------|
                                              L              A B               T

Fig.3 after patch, one heavy dirtier
                                                |
    throttle_bandwidth ~= bdi_bandwidth  =>     o
                                                | o
                                                |   o
                                                |     o
                                                |       o
                                                |         o
                                              La|           o
----------------------------------------------+-+-------------o----------------|
                                                R             A                T
  T: bdi_dirty_limit
  A: task_dirty_limit      = T - Wa * T/16
  La: task_throttle_thresh = A - A/16

  R: bdi_dirty_pages = bdi_reclaimable + bdi_writeback ~= La

Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
way. In fig.3, a soft dirty limit region (La, A) is introduced. When R enters
this region, the task may be throttled for J jiffies on every N pages it dirtied.
Let's call (N/J) the "throttle bandwidth". It is computed by the following formula:

        throttle_bandwidth = bdi_bandwidth * (A - R) / (A - La)
where
	A = T - Wa * T/16
        La = A - A/16
where Wa is task weight for A. It's 0 for very light dirtier and 1 for
the one heavy dirtier (that consumes 100% bdi write bandwidth).  The
task weight will be updated independently by task_dirty_inc() at
set_page_dirty() time.

When R < La, we don't throttle it at all.
When R > A, the code will detect the negativeness and choose to pause
100ms (the upper pause boundary), then loop over again.


PSEUDO CODE
===========

balance_dirty_pages():

	/* soft throttling */
	if (task_throttle_thresh exceeded)
		sleep (task_dirtied_pages / throttle_bandwidth)

	/* hard throttling */
	while (task_dirty_limit exceeded) {
		sleep 100ms
		if (bdi_dirty_pages dropped more than task_dirtied_pages)
			break
	}

	/* global hard limit */
	while (dirty_limit exceeded)
		sleep 100ms

Basically there are three level of throttling now.

- normally the dirtier will be adaptively throttled with good timing

- when task_dirty_limit is exceeded, the task will be throttled until
  bdi dirty/writeback pages go down reasonably large

- when dirty_thresh is exceeded, the task can be throttled for arbitrary
  long time


BEHAVIOR CHANGE
===============

Users will notice that the applications will get throttled once the
crossing the global (background + dirty)/2=15% threshold. For a single
"cp", it could be soft throttled at 2*bdi->write_bandwidth around 15%
dirty pages, and be balanced at speed bdi->write_bandwidth around 17.5%
dirty pages. Before patch, the behavior is to just throttle it at 17.5%
dirty pages.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than ~15% memory.


BENCHMARKS
==========

The test box has a 4-core 3.2GHz CPU, 4GB mem and a SATA disk.

For each filesystem, the following command is run 3 times.

time (dd if=/dev/zero of=/tmp/10G bs=1M count=10240; sync); rm /tmp/10G

	    2.6.36-rc2-mm1	2.6.36-rc2-mm1+balance_dirty_pages
average real time
ext2        236.377s            232.144s              -1.8%
ext3        226.245s            225.751s              -0.2%
ext4        178.742s            179.343s              +0.3%
xfs         183.562s            179.808s              -2.0%
btrfs       179.044s            179.461s              +0.2%
NFS         645.627s            628.937s              -2.6%

average system time
ext2         22.142s             19.656s             -11.2%
ext3         34.175s             32.462s              -5.0%
ext4         23.440s             21.162s              -9.7%
xfs          19.089s             16.069s             -15.8%
btrfs        12.212s             11.670s              -4.4%
NFS          16.807s             17.410s              +3.6%

total user time
sum           0.136s              0.084s             -38.2%

In a more recent run of the tests, it's in fact slightly slower.

ext2         49.500 MB/s         49.200 MB/s          -0.6%
ext3         50.133 MB/s         50.000 MB/s          -0.3%
ext4         64.000 MB/s         63.200 MB/s          -1.2%
xfs          63.500 MB/s         63.167 MB/s          -0.5%
btrfs        63.133 MB/s         63.033 MB/s          -0.2%
NFS          16.833 MB/s         16.867 MB/s          +0.2%

In general there are no big IO performance changes for desktop users,
except for some noticeable reduction of CPU overheads. It mainly
benefits file servers with heavy concurrent writers on fast storage
arrays. As can be demonstrated by 10/100 concurrent dd's on xfs:

- 1 dirtier case:    the same
- 10 dirtiers case:  CPU system time is reduced to 50%
- 100 dirtiers case: CPU system time is reduced to 10%, IO size and throughput increases by 10%

			2.6.37-rc2				2.6.37-rc1-next-20101115+
        ----------------------------------------        ----------------------------------------
	%system		wkB/s		avgrq-sz	%system		wkB/s		avgrq-sz
100dd	30.916		37843.000	748.670		3.079		41654.853	822.322
100dd	30.501		37227.521	735.754		3.744		41531.725	820.360

10dd	39.442		47745.021	900.935		20.756		47951.702	901.006
10dd	39.204		47484.616	899.330		20.550		47970.093	900.247

1dd	13.046		57357.468	910.659		13.060		57632.715	909.212
1dd	12.896		56433.152	909.861		12.467		56294.440	909.644

The CPU overheads in 2.6.37-rc1-next-20101115+ is higher than
2.6.36-rc2-mm1+balance_dirty_pages, this may be due to the pause time
stablizing at lower values due to some algorithm adjustments (eg.
reduce the minimal pause time from 10ms to 1jiffy in new version)
leading to much more balance_dirty_pages() calls. The different pause
time also explains the different system time for 1/10/100dd cases on
the same 2.6.37-rc1-next-20101115+.

CC: Chris Mason <chris.mason@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/filesystems/writeback-throttling-design.txt |  210 ++++++++++
 include/linux/writeback.h                                 |   10 
 mm/page-writeback.c                                       |   85 +---
 3 files changed, 249 insertions(+), 56 deletions(-)

--- linux-next.orig/include/linux/writeback.h	2010-11-15 19:49:41.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-11-15 19:49:42.000000000 +0800
@@ -12,6 +12,16 @@ struct backing_dev_info;
 extern spinlock_t inode_lock;
 
 /*
+ * The 1/8 region under the bdi dirty threshold is set aside for elastic
+ * throttling. In rare cases when the threshold is exceeded, more rigid
+ * throttling will be imposed, which will inevitably stall the dirtier task
+ * for seconds (or more) at _one_ time. The rare case could be a fork bomb
+ * where every new task dirties some more pages.
+ */
+#define BDI_SOFT_DIRTY_LIMIT	8
+#define TASK_SOFT_DIRTY_LIMIT	(BDI_SOFT_DIRTY_LIMIT * 2)
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {
--- linux-next.orig/mm/page-writeback.c	2010-11-15 19:49:41.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-15 19:50:16.000000000 +0800
@@ -42,20 +42,6 @@
  */
 static long ratelimit_pages = 32;
 
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -279,7 +265,7 @@ static unsigned long task_dirty_limit(st
 {
 	long numerator, denominator;
 	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty >> 3;
+	u64 inv = dirty / TASK_SOFT_DIRTY_LIMIT;
 
 	task_dirties_fraction(tsk, &numerator, &denominator);
 	inv *= numerator;
@@ -473,26 +459,25 @@ unsigned long bdi_dirty_limit(struct bac
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	unsigned long bw;
+	unsigned long pause;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK);
@@ -529,6 +514,23 @@ static void balance_dirty_pages(struct a
 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		if (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) {
+			pause = HZ/10;
+			goto pause;
+		}
+
+		bw = 100 << 20; /* use static 100MB/s for the moment */
+
+		bw = bw * (bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback));
+		bw = bw / (bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
+
+		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
+		pause = clamp_val(pause, 1, HZ/10);
+
+pause:
+		__set_current_state(TASK_INTERRUPTIBLE);
+		io_schedule_timeout(pause);
+
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the
 		 * global "hard" limit. The former helps to prevent heavy IO
@@ -544,35 +546,6 @@ static void balance_dirty_pages(struct a
 
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
-
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
-		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
-		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
-		__set_current_state(TASK_INTERRUPTIBLE);
-		io_schedule_timeout(pause);
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
 	if (!dirty_exceeded && bdi->dirty_exceeded)
@@ -589,7 +562,7 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
+	if ((laptop_mode && dirty_exceeded) ||
 	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_background_writeback(bdi);
 }
@@ -638,7 +611,7 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(bdp_ratelimits);
 	*p += nr_pages_dirtied;
 	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
+		ratelimit = *p;
 		*p = 0;
 		preempt_enable();
 		balance_dirty_pages(mapping, ratelimit);
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/Documentation/filesystems/writeback-throttling-design.txt	2010-11-15 19:49:42.000000000 +0800
@@ -0,0 +1,210 @@
+writeback throttling design
+---------------------------
+
+introduction to dirty throttling
+--------------------------------
+
+The write(2) is normally buffered write that creates dirty page cache pages
+for holding the data and return immediately. The dirty pages will eventually
+be written to disk, or be dropped by unlink()/truncate().
+
+The delayed writeback of dirty pages enables the kernel to optimize the IO:
+
+- turn IO into async ones, which avoids blocking the tasks
+- submit IO as a batch for better throughput
+- avoid IO at all for temp files
+
+However, there have to be some limits on the number of allowable dirty pages.
+Typically applications are able to dirty pages more quickly than storage
+devices can write them. When approaching the dirty limits, the dirtier tasks
+will be throttled (put to brief sleeps from time to time) by
+balance_dirty_pages() in order to balance the dirty speed and writeback speed.
+
+dirty limits
+------------
+
+The dirty limit defaults to 20% reclaimable memory, and can be tuned via one of
+the following sysctl interfaces:
+
+	/proc/sys/vm/dirty_ratio
+	/proc/sys/vm/dirty_bytes
+
+The ultimate goal of balance_dirty_pages() is to keep the global dirty pages
+under control.
+
+	dirty_limit = dirty_ratio * free_reclaimable_pages
+
+However a global threshold may create deadlock for stacked BDIs (loop, FUSE and
+local NFS mounts). When A writes to B, and A generates enough dirty pages to
+get throttled, B will never start writeback until the dirty pages go away.
+
+Another problem is inter device starvation. When there are concurrent writes to
+a slow device and a fast one, the latter may well be starved due to unnecessary
+throttling on its dirtier tasks, leading to big IO performance drop.
+
+The solution is to split the global dirty limit into per-bdi limits among all
+the backing devices and scale writeback cache per backing device, proportional
+to its writeout speed.
+
+	bdi_dirty_limit = bdi_weight * dirty_limit
+
+where bdi_weight (ranging from 0 to 1) reflects the recent writeout speed of
+the BDI.
+
+We further scale the bdi dirty limit inversly with the task's dirty rate.
+This makes heavy writers have a lower dirty limit than the occasional writer,
+to prevent a heavy dd from slowing down all other light writers in the system.
+
+	task_dirty_limit = bdi_dirty_limit - task_weight * bdi_dirty_limit/16
+
+pause time
+----------
+
+The main task of dirty throttling is to determine when and how long to pause
+the current dirtier task.  Basically we want to
+
+- avoid too small pause time (less than 1 jiffy, which burns CPU power)
+- avoid too large pause time (more than 100ms, which hurts responsiveness)
+- avoid big fluctuations of pause times
+
+To smoothly control the pause time, we do soft throttling in a small region
+under task_dirty_limit, starting from
+
+	task_throttle_thresh = task_dirty_limit - task_dirty_limit/16
+
+In fig.1, when bdi_dirty_pages falls into
+
+    [0, La]:    do nothing
+    [La, A]:    do soft throttling
+    [A, inf]:   do hard throttling
+
+Where hard throttling is to wait until bdi_dirty_pages falls more than
+task_dirtied_pages (the pages dirtied by the task since its last throttle
+time). It's "hard" because it may end up waiting for long time.
+
+Fig.1 dirty throttling regions
+                                              o
+                                                o
+                                                  o
+                                                    o
+                                                      o
+                                                        o
+                                                          o
+                                                            o
+----------------------------------------------+---------------o----------------|
+                                              La              A                T
+                no throttle                     soft throttle   hard throttle
+  T: bdi_dirty_limit
+  A: task_dirty_limit      = T - task_weight * T/16
+  La: task_throttle_thresh = A - A/16
+
+Soft dirty throttling is to pause the dirtier task for J:pause_time jiffies on
+every N:task_dirtied_pages pages it dirtied.  Let's call (N/J) the "throttle
+bandwidth". It is computed by the following formula:
+
+                                     task_dirty_limit - bdi_dirty_pages
+throttle_bandwidth = bdi_bandwidth * ----------------------------------
+                                           task_dirty_limit/16
+
+where bdi_bandwidth is the BDI's estimated write speed.
+
+Given the throttle_bandwidth for a task, we select a suitable N, so that when
+the task dirties so much pages, it enters balance_dirty_pages() to sleep for
+roughly J jiffies. N is adaptive to storage and task write speeds, so that the
+task always get suitable (not too long or small) pause time.
+
+dynamics
+--------
+
+When there is one heavy dirtier, bdi_dirty_pages will keep growing until
+exceeding the low threshold of the task's soft throttling region [La, A].
+At which point (La) the task will be controlled under speed
+throttle_bandwidth=bdi_bandwidth (fig.2) and remain stable there.
+
+Fig.2 one heavy dirtier
+
+    throttle_bandwidth ~= bdi_bandwidth  =>   o
+                                              | o
+                                              |   o
+                                              |     o
+                                              |       o
+                                              |         o
+                                              |           o
+                                            La|             o
+----------------------------------------------+---------------o----------------|
+                                              R               A                T
+  R: bdi_dirty_pages ~= La
+
+When there comes a new dd task B, task_weight_B will gradually grow from 0 to
+50% while task_weight_A will decrease from 100% to 50%.  When task_weight_B is
+still small, B is considered a light dirtier and is allowed to dirty pages much
+faster than the bdi write bandwidth. In fact initially it won't be throttled at
+all when R < Lb where Lb = B - B/16 and B ~= T.
+
+Fig.3 an old dd (A) + a newly started dd (B)
+
+                      throttle bandwidth  =>    *
+                                                | *
+                                                |   *
+                                                |     *
+                                                |       *
+                                                |         *
+                                                |           *
+                                                |             *
+                      throttle bandwidth  =>    o               *
+                                                | o               *
+                                                |   o               *
+                                                |     o               *
+                                                |       o               *
+                                                |         o               *
+                                                |           o               *
+------------------------------------------------+-------------o---------------*|
+                                                R             A               BT
+
+So R:bdi_dirty_pages will grow large. As task_weight_A and task_weight_B
+converge to 50%, the points A, B will go towards each other (fig.4) and
+eventually coincide with each other. R will stabilize around A-A/32 where
+A=B=T-0.5*T/16.  throttle_bandwidth will stabilize around bdi_bandwidth/2.
+
+Note that the application "think+dirty time" is ignored for simplicity in the
+above discussions. With non-zero user space think time, the balance point will
+slightly drift and not a big deal otherwise.
+
+Fig.4 the two dd's converging to the same bandwidth
+
+                                                         |
+                                 throttle bandwidth  =>  *
+                                                         | *
+                                 throttle bandwidth  =>  o   *
+                                                         | o   *
+                                                         |   o   *
+                                                         |     o   *
+                                                         |       o   *
+                                                         |         o   *
+---------------------------------------------------------+-----------o---*-----|
+                                                         R           A   B     T
+
+There won't be big oscillations between A and B, because as soon as A coincides
+with B, their throttle_bandwidth and hence dirty speed will be equal, A's
+weight will stop decreasing and B's weight will stop growing, so the two points
+won't keep moving and cross each other.
+
+Sure there are always oscillations of bdi_dirty_pages as long as the dirtier
+task alternatively do dirty and pause. But it will be bounded. When there is 1
+heavy dirtier, the error bound will be (pause_time * bdi_bandwidth). When there
+are 2 heavy dirtiers, the max error is 2 * (pause_time * bdi_bandwidth/2),
+which remains the same as 1 dirtier case (given the same pause time). In fact
+the more dirtier tasks, the less errors will be, since the dirtier tasks are
+not likely going to sleep at the same time.
+
+References
+----------
+
+Smarter write throttling
+http://lwn.net/Articles/245600/
+
+Flushing out pdflush
+http://lwn.net/Articles/326552/
+
+Dirty throttling slides
+http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling.pdf



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-17  3:58 [PATCH 01/13] writeback: IO-less balance_dirty_pages() Wu Fengguang
@ 2010-11-17  4:19 ` Wu Fengguang
  2010-11-17  8:33   ` Wu Fengguang
  2010-11-17  4:30 ` Wu Fengguang
  1 sibling, 1 reply; 18+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, Chris Mason, Dave Chinner, Jan Kara,
	Peter Zijlstra, Jens Axboe, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Christoph Hellwig, linux-mm, linux-fsdevel,
	LKML

> BEHAVIOR CHANGE
> ===============
> 
> Users will notice that the applications will get throttled once the
> crossing the global (background + dirty)/2=15% threshold. For a single
> "cp", it could be soft throttled at 2*bdi->write_bandwidth around 15%

s/2/8/

Sorry, the initial soft throttle bandwidth for "cp" is about 8 times
of bdi bandwidth when reaching 15% dirty pages.

> dirty pages, and be balanced at speed bdi->write_bandwidth around 17.5%
> dirty pages. Before patch, the behavior is to just throttle it at 17.5%
> dirty pages.
> 
> Since the task will be soft throttled earlier than before, it may be
> perceived by end users as performance "slow down" if his application
> happens to dirty more than ~15% memory.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-17  3:58 [PATCH 01/13] writeback: IO-less balance_dirty_pages() Wu Fengguang
  2010-11-17  4:19 ` Wu Fengguang
@ 2010-11-17  4:30 ` Wu Fengguang
  1 sibling, 0 replies; 18+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, Chris Mason, Dave Chinner, Jan Kara,
	Peter Zijlstra, Jens Axboe, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Christoph Hellwig, linux-mm, linux-fsdevel,
	LKML

On Wed, Nov 17, 2010 at 11:58:22AM +0800, Wu, Fengguang wrote:
> Andrew,
> References: <20101117035821.000579293@intel.com>
> Content-Disposition: inline; filename=writeback-bw-throttle.patch

Ah missed an extra empty line to quilt. Sorry, I'll re-submit.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-17  4:19 ` Wu Fengguang
@ 2010-11-17  8:33   ` Wu Fengguang
  0 siblings, 0 replies; 18+ messages in thread
From: Wu Fengguang @ 2010-11-17  8:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, Chris Mason, Dave Chinner, Jan Kara,
	Peter Zijlstra, Jens Axboe, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Christoph Hellwig, linux-mm, linux-fsdevel,
	LKML

On Wed, Nov 17, 2010 at 12:19:26PM +0800, Wu Fengguang wrote:
> > BEHAVIOR CHANGE
> > ===============
> > 
> > Users will notice that the applications will get throttled once the
> > crossing the global (background + dirty)/2=15% threshold. For a single
> > "cp", it could be soft throttled at 2*bdi->write_bandwidth around 15%
> 
> s/2/8/
> 
> Sorry, the initial soft throttle bandwidth for "cp" is about 8 times
> of bdi bandwidth when reaching 15% dirty pages.

Actually it's x8 for light dirtier and x6 for heavy dirtier. There are
two control lines in the following code. The task control line is
introduced in this patch, while the bdi control line is introduced in
"[PATCH 11/13] writeback: scale down max throttle bandwidth on
concurrent dirtiers".

baseline
                bw = bdi->write_bandwidth;

bdi control line
                bw = bw * (bdi_thresh - bdi_dirty);               
                bw = bw / (bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
        
task control line
                bw = bw * (task_thresh - bdi_dirty);
                bw = bw / (bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);

These figures demonstrate how they work together:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/heavy-dirtier-control-line.svg
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/light-dirtier-control-line.svg

Thanks,
Fengguang

> > dirty pages, and be balanced at speed bdi->write_bandwidth around 17.5%
> > dirty pages. Before patch, the behavior is to just throttle it at 17.5%
> > dirty pages.
> > 
> > Since the task will be soft throttled earlier than before, it may be
> > perceived by end users as performance "slow down" if his application
> > happens to dirty more than ~15% memory.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-12-06  9:52                     ` Dmitry
@ 2010-12-06 12:34                       ` Ted Ts'o
  0 siblings, 0 replies; 18+ messages in thread
From: Ted Ts'o @ 2010-12-06 12:34 UTC (permalink / raw)
  To: Dmitry
  Cc: Wu Fengguang, Peter Zijlstra, Andrew Morton, Chris Mason,
	Dave Chinner, Jan Kara, Jens Axboe, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Christoph Hellwig, linux-mm, linux-fsdevel,
	LKML, Tang, Feng, linux-ext4

On Mon, Dec 06, 2010 at 12:52:21PM +0300, Dmitry wrote:
> May be it is reasonable to introduce new mount option which control
> dynamic delalloc on/off behavior for example like this:
> 0) -odelalloc=off : analog of nodelalloc
> 1) -odelalloc=normal : Default mode (disable delalloc if close to full fs)
> 2) -odelalloc=force  : delalloc mode always enabled, so we have to do
>                      writeback more aggressive in case of ENOSPC.
> 
> So one can force delalloc and can safely use this writeback mode in 
> multi-user environment. Openvz already has this. I'll prepare the patch
> if you are interesting in that feature?

Yeah, I'd really rather not do that.  There are significant downsides
with your proposed odelalloc=force mode.  One of which is that we
could run out of space and not notice.  If the application doesn't
call fsync() and check the return value, and simply closes()'s the
file and then exits, when the writeback threads do get around to
writing the file, the block allocation could fail, and oops, data gets
lost.  There's a _reason_ why we disable delalloc when we're close to
a full fs.  The only alternative is to super conservative when doing
your block reservation calculations, and in that case, you end up
returning ENOSPC far too soon.

						- Ted

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-12-06  2:42                   ` Ted Ts'o
@ 2010-12-06  9:52                     ` Dmitry
  2010-12-06 12:34                       ` Ted Ts'o
  0 siblings, 1 reply; 18+ messages in thread
From: Dmitry @ 2010-12-06  9:52 UTC (permalink / raw)
  To: Ted Ts'o, Wu Fengguang
  Cc: Peter Zijlstra, Andrew Morton, Chris Mason, Dave Chinner,
	Jan Kara, Jens Axboe, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Christoph Hellwig, linux-mm, linux-fsdevel, LKML, Tang, Feng,
	linux-ext4

On Sun, 5 Dec 2010 21:42:31 -0500, Ted Ts'o <tytso@mit.edu> wrote:
> On Mon, Dec 06, 2010 at 12:14:35AM +0800, Wu Fengguang wrote:
> > 
> > Ah I seem to find the root cause. See the attached graphs. Ext4 should
> > be calling redirty_page_for_writepage() to redirty ~300MB pages on
> > every ~10s. The redirties happen in big bursts, so not surprisingly
> > the dd task's dirty weight will suddenly drop to 0.
> > 
> > It should be the same ext4 issue discussed here:
> > 
> >         http://www.spinics.net/lists/linux-fsdevel/msg39555.html
> 
> Yeah, unfortunately the fix suggested isn't the right one.
> 
> The right fix is going to involve making much more radical changes to
> the ext4 write submission path, which is on my todo queue.  For now,
> if people don't like these nasty writeback dynamics, my suggestion for
> now is to mount the filesystem data=writeback.
> 
> This is basically the clean equivalent of the patch suggested by Feng
> Tang in his e-mail referenced above.  Given that ext4 uses delayed
> allocation, most of the time unwritten blocks are not allocated, and
> so stale data isn't exposed.
May be it is reasonable to introduce new mount option which control
dynamic delalloc on/off behavior for example like this:
0) -odelalloc=off : analog of nodelalloc
1) -odelalloc=normal : Default mode (disable delalloc if close to full fs)
2) -odelalloc=force  : delalloc mode always enabled, so we have to do
                     writeback more aggressive in case of ENOSPC.

So one can force delalloc and can safely use this writeback mode in 
multi-user environment. Openvz already has this. I'll prepare the patch
if you are interesting in that feature?
> 
> The case which you're seeing here is where both the jbd2 data=order
> forced writeback is colliding with the writeback thread, and
> unfortunately, the forced writeback in the jbd2 layer is done in an
> extremely inefficient manner.  So data=writeback is the workaround,
> and unlike ext3, it's not a serious security leak.  It is possible for
> some stale data to get exposed if you get unlucky when you crash,
> though, so there is a potential for some security exposure.
> 
> The long-term solution to this problem is to rework the ext4 writeback
> path so that we write the data blocks when they are newly allocated,
> and then only update fs metadata once they are written.  As I said,
> it's on my queue.  Until then, the only suggestion I can give folks is
> data=writeback.
> 
> 						- Ted
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-12-05 16:14                 ` Wu Fengguang
@ 2010-12-06  2:42                   ` Ted Ts'o
  2010-12-06  9:52                     ` Dmitry
  0 siblings, 1 reply; 18+ messages in thread
From: Ted Ts'o @ 2010-12-06  2:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, Andrew Morton, Chris Mason, Dave Chinner,
	Jan Kara, Jens Axboe, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Christoph Hellwig, linux-mm, linux-fsdevel, LKML, Tang, Feng,
	linux-ext4

On Mon, Dec 06, 2010 at 12:14:35AM +0800, Wu Fengguang wrote:
> 
> Ah I seem to find the root cause. See the attached graphs. Ext4 should
> be calling redirty_page_for_writepage() to redirty ~300MB pages on
> every ~10s. The redirties happen in big bursts, so not surprisingly
> the dd task's dirty weight will suddenly drop to 0.
> 
> It should be the same ext4 issue discussed here:
> 
>         http://www.spinics.net/lists/linux-fsdevel/msg39555.html

Yeah, unfortunately the fix suggested isn't the right one.

The right fix is going to involve making much more radical changes to
the ext4 write submission path, which is on my todo queue.  For now,
if people don't like these nasty writeback dynamics, my suggestion for
now is to mount the filesystem data=writeback.

This is basically the clean equivalent of the patch suggested by Feng
Tang in his e-mail referenced above.  Given that ext4 uses delayed
allocation, most of the time unwritten blocks are not allocated, and
so stale data isn't exposed.

The case which you're seeing here is where both the jbd2 data=order
forced writeback is colliding with the writeback thread, and
unfortunately, the forced writeback in the jbd2 layer is done in an
extremely inefficient manner.  So data=writeback is the workaround,
and unlike ext3, it's not a serious security leak.  It is possible for
some stale data to get exposed if you get unlucky when you crash,
though, so there is a potential for some security exposure.

The long-term solution to this problem is to rework the ext4 writeback
path so that we write the data blocks when they are newly allocated,
and then only update fs metadata once they are written.  As I said,
it's on my queue.  Until then, the only suggestion I can give folks is
data=writeback.

						- Ted

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
       [not found]               ` <20101201133818.GA13377@localhost>
  2010-12-01 23:03                 ` Andrew Morton
@ 2010-12-05 16:14                 ` Wu Fengguang
  2010-12-06  2:42                   ` Ted Ts'o
  1 sibling, 1 reply; 18+ messages in thread
From: Wu Fengguang @ 2010-12-05 16:14 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Morton
  Cc: Theodore Ts'o, Chris Mason, Dave Chinner, Jan Kara,
	Jens Axboe, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Christoph Hellwig, linux-mm, linux-fsdevel, LKML, Tang, Feng,
	linux-ext4

[-- Attachment #1: Type: text/plain, Size: 3733 bytes --]

On Wed, Dec 01, 2010 at 09:38:18PM +0800, Wu Fengguang wrote:
> [restore CC list for new findings]
> 
> On Wed, Dec 01, 2010 at 06:39:25AM +0800, Peter Zijlstra wrote:
> > On Tue, 2010-11-30 at 23:35 +0100, Peter Zijlstra wrote:
> > > On Tue, 2010-11-30 at 12:37 +0800, Wu Fengguang wrote:
> > > > On Tue, Nov 30, 2010 at 04:53:33AM +0800, Peter Zijlstra wrote:
> > > > > On Mon, 2010-11-29 at 23:17 +0800, Wu Fengguang wrote:
> > > > > > Hi Peter,
> > > > > >
> > > > > > I'm drawing funny graphs to track the writeback dynamics :)
> > > > > >
> > > > > > In the attached graphs, I find abnormals in dirty-pages-3000.png and
> > > > > > dirty-pages-200.png.  The task limit is what's returned by
> > > > > > task_dirty_limit(), which should be very stable. However from the
> > > > > > graph it seems the task weight (numerator/denominator) will suddenly
> > > > > > drop to near 0 on every 9-10 seconds.  Do you have immediate insight
> > > > > > on what's going on? If not, I'm going to do some tracing to track down
> > > > > > how the numbers change over time.
> > > > >
> > > > > No immediate thoughts there.. I need to look through the math again, but
> > > > > I'm kinda swamped atm. (and my primary dev machine had its disk die this
> > > > > morning). I'll try and get around to it soon..
> > > >
> > > > Peter, I did a simple debug patch (attached) and collected these
> > > > numbers. I noticed that at the "task_weight=27%" and "task_weight=14%"
> > > > lines, "period" increases, "num" is decreased while "den" is still
> > > > increasing.
> > > >
> > > > num=db2e den=e8c0 period=3f8000 shift=10
> > > > num=e04c den=ede0 period=3f8000 shift=10
> > > > num=e56a den=f300 period=3f8000 shift=10
> > >
> > > > num=3e78 den=e400 period=408000 shift=10
> > >
> > > > num=1341 den=8900 period=418000 shift=10
> > > > num=185f den=8e20 period=418000 shift=10
> > > > num=1d7d den=9340 period=418000 shift=10
> > > > num=229b den=9860 period=418000 shift=10
> > > > num=27b9 den=9da0 period=418000 shift=10
> > > > num=2cd7 den=a2c0 period=418000 shift=10
> > >
> > >
> > > This looks sane.. the period indicates someone else was dirtying lots of
> > > pages. Every time the period increases (its shifted right by shift) we
> > > divide the events (num) by 2.
> >
> > Its actually shifted left by shift-1.. see prop_norm_single(), which
> > would make the below:
> >
> > > So the increment from 3f8000 to 408000 is 4064 to 4128, or 64, that
> > > should reset events to 0, seeing that it didn't means it got incremented
> > > as well.
> > >
> > > Funny enough, the second jump is again exactly 64..
> > >
> > > Anyway, as you can see, den increases as long as period stays constant,
> > > it takes a dip when period increments.
> >
> > two steps of 128, which is terribly large.
> >
> > then again, a period of 512 pages is very very small.
> 
> Peter, I also collected prop_norm_single() traces, hope it helps.
> 
> Again, you can find time points when the task limit suddenly skip high
> in graphs "dirty-pages*.png", and then find the corresponding data
> point in file "trace". Sorry I compute something wrong: the "ratio"
> field in the trace data is always 0, please just ignore them.
> 
> I noticed that jbd2/sda8-8-2811 dirtied lots of pages, perhaps by
> ext4_bio_write_page(). This should happen only on -ENOMEM.  I also

Ah I seem to find the root cause. See the attached graphs. Ext4 should
be calling redirty_page_for_writepage() to redirty ~300MB pages on
every ~10s. The redirties happen in big bursts, so not surprisingly
the dd task's dirty weight will suddenly drop to 0.

It should be the same ext4 issue discussed here:

        http://www.spinics.net/lists/linux-fsdevel/msg39555.html

Thanks,
Fengguang

[-- Attachment #2: vmstat-written-300.png --]
[-- Type: image/png, Size: 44152 bytes --]

[-- Attachment #3: vmstat-written.png --]
[-- Type: image/png, Size: 40715 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-12-01 23:03                 ` Andrew Morton
@ 2010-12-02  1:56                   ` Wu Fengguang
  0 siblings, 0 replies; 18+ messages in thread
From: Wu Fengguang @ 2010-12-02  1:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Theodore Ts'o, Chris Mason, Dave Chinner,
	Jan Kara, Jens Axboe, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Christoph Hellwig, linux-mm, linux-fsdevel, LKML

On Thu, Dec 02, 2010 at 07:03:33AM +0800, Andrew Morton wrote:
> On Wed, 1 Dec 2010 21:38:18 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > It shows that
> > 
> > 1) io_schedule_timeout(200ms) always return immediately for iostat,
> >    forming a busy loop.  How can this happen? When iostat received
> >    some signal? Then we may have to break out of the loop on catching
> >    signals. Note that I already have
> >                 if (fatal_signal_pending(current))
> >                         break;
> >    in the balance_dirty_pages() loop. Obviously that's not enough.
> 
> Presumably the calling task has singal_pending().
> 
> Using TASK_INTERRUPTIBLE in balance_dirty_pages() seems wrong.  If it's
> going to do that then it must break out if signal_pending(), otherwise
> it's pretty much guaranteed to degenerate into a busywait loop.

Right. It seems not rewarding enough to check signal_pending().  We've
already been able to response to signals much faster than before
(which takes more time to block in get_request_wait()).

> Plus we *do* want these processes to appear in D state and to
> contribute to load average.
> 
> So it should be TASK_UNINTERRUPTIBLE.

Fair enough. I do missed the D state (without the long wait :).
Here is the patch.

Thanks,
Fengguang
---
Subject: writeback: do uninterruptible sleep in balance_dirty_pages()
Date: Thu Dec 02 09:31:19 CST 2010

Using TASK_INTERRUPTIBLE in balance_dirty_pages() seems wrong.  If it's
going to do that then it must break out if signal_pending(), otherwise
it's pretty much guaranteed to degenerate into a busywait loop.  Plus
we *do* want these processes to appear in D state and to contribute to
load average.

So it should be TASK_UNINTERRUPTIBLE.                 -- Andrew Morton

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-02 09:30:29.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-02 09:30:34.000000000 +0800
@@ -636,7 +636,7 @@ pause:
 					  pages_dirtied,
 					  pause);
 		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
-		__set_current_state(TASK_INTERRUPTIBLE);
+		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
 		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
       [not found]               ` <20101201133818.GA13377@localhost>
@ 2010-12-01 23:03                 ` Andrew Morton
  2010-12-02  1:56                   ` Wu Fengguang
  2010-12-05 16:14                 ` Wu Fengguang
  1 sibling, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2010-12-01 23:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, Theodore Ts'o, Chris Mason, Dave Chinner,
	Jan Kara, Jens Axboe, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Christoph Hellwig, linux-mm, linux-fsdevel, LKML

On Wed, 1 Dec 2010 21:38:18 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> It shows that
> 
> 1) io_schedule_timeout(200ms) always return immediately for iostat,
>    forming a busy loop.  How can this happen? When iostat received
>    some signal? Then we may have to break out of the loop on catching
>    signals. Note that I already have
>                 if (fatal_signal_pending(current))
>                         break;
>    in the balance_dirty_pages() loop. Obviously that's not enough.

Presumably the calling task has singal_pending().

Using TASK_INTERRUPTIBLE in balance_dirty_pages() seems wrong.  If it's
going to do that then it must break out if signal_pending(), otherwise
it's pretty much guaranteed to degenerate into a busywait loop.  Plus
we *do* want these processes to appear in D state and to contribute to
load average.

So it should be TASK_UNINTERRUPTIBLE.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-17 10:34   ` Minchan Kim
@ 2010-11-22  2:01     ` Wu Fengguang
  0 siblings, 0 replies; 18+ messages in thread
From: Wu Fengguang @ 2010-11-22  2:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Jan Kara, Chris Mason, Dave Chinner,
	Peter Zijlstra, Jens Axboe, Christoph Hellwig, Theodore Ts'o,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel, LKML

Hi Minchan,

On Wed, Nov 17, 2010 at 06:34:26PM +0800, Minchan Kim wrote:
> Hi Wu,
> 
> As you know, I am not a expert in this area.
> So I hope my review can help understanding other newbie like me and
> make clear this document. :)
> I didn't look into the code. before it, I would like to clear your concept.

Yeah, it's some big change of "concept" :)

Sorry for the late reply, as I'm still tuning things and some details
may change as a result. The biggest challenge now is the stability of
the control algorithms. Everything is floating around and I'm trying
to keep the fluctuations down by borrowing some equation from the
optimal control theory.

> On Wed, Nov 17, 2010 at 1:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > As proposed by Chris, Dave and Jan, don't start foreground writeback IO
> > inside balance_dirty_pages(). Instead, simply let it idle sleep for some
> > time to throttle the dirtying task. In the mean while, kick off the
> > per-bdi flusher thread to do background writeback IO.
> >
> > This patch introduces the basic framework, which will be further
> > consolidated by the next patches.
> >
> > RATIONALS
> > =========
> >
> > The current balance_dirty_pages() is rather IO inefficient.
> >
> > - concurrent writeback of multiple inodes (Dave Chinner)
> >
> >  If every thread doing writes and being throttled start foreground
> >  writeback, it leads to N IO submitters from at least N different
> >  inodes at the same time, end up with N different sets of IO being
> >  issued with potentially zero locality to each other, resulting in
> >  much lower elevator sort/merge efficiency and hence we seek the disk
> >  all over the place to service the different sets of IO.
> >  OTOH, if there is only one submission thread, it doesn't jump between
> >  inodes in the same way when congestion clears - it keeps writing to
> >  the same inode, resulting in large related chunks of sequential IOs
> >  being issued to the disk. This is more efficient than the above
> >  foreground writeback because the elevator works better and the disk
> >  seeks less.
> >
> > - IO size too small for fast arrays and too large for slow USB sticks
> >
> >  The write_chunk used by current balance_dirty_pages() cannot be
> >  directly set to some large value (eg. 128MB) for better IO efficiency.
> >  Because it could lead to more than 1 second user perceivable stalls.
> >  Even the current 4MB write size may be too large for slow USB sticks.
> >  The fact that balance_dirty_pages() starts IO on itself couples the
> >  IO size to wait time, which makes it hard to do suitable IO size while
> >  keeping the wait time under control.
> >
> > For the above two reasons, it's much better to shift IO to the flusher
> > threads and let balance_dirty_pages() just wait for enough time or progress.
> >
> > Jan Kara, Dave Chinner and me explored the scheme to let
> > balance_dirty_pages() wait for enough writeback IO completions to
> > safeguard the dirty limit. However it's found to have two problems:
> >
> > - in large NUMA systems, the per-cpu counters may have big accounting
> >  errors, leading to big throttle wait time and jitters.
> >
> > - NFS may kill large amount of unstable pages with one single COMMIT.
> >  Because NFS server serves COMMIT with expensive fsync() IOs, it is
> >  desirable to delay and reduce the number of COMMITs. So it's not
> >  likely to optimize away such kind of bursty IO completions, and the
> >  resulted large (and tiny) stall times in IO completion based throttling.
> >
> > So here is a pause time oriented approach, which tries to control the
> > pause time in each balance_dirty_pages() invocations, by controlling
> > the number of pages dirtied before calling balance_dirty_pages(), for
> > smooth and efficient dirty throttling:
> >
> > - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> > - avoid too small pause time (less than  10ms, which burns CPU power)
> > - avoid too large pause time (more than 100ms, which hurts responsiveness)
> > - avoid big fluctuations of pause times
> >
> > For example, when doing a simple cp on ext4 with mem=4G HZ=250.
> >
> > before patch, the pause time fluctuates from 0 to 324ms
> > (and the stall time may grow very large for slow devices)
> >
> > [ 1237.139962] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
> > [ 1237.207489] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> > [ 1237.225190] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> > [ 1237.234488] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> > [ 1237.244692] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> > [ 1237.375231] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
> > [ 1237.443035] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
> > [ 1237.574630] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
> > [ 1237.642394] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
> > [ 1237.666320] balance_dirty_pages: write_chunk=1536 pages_written=57 pause=5
> > [ 1237.973365] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=81
> > [ 1238.212626] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
> > [ 1238.280431] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
> > [ 1238.412029] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
> > [ 1238.412791] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> >
> > after patch, the pause time remains stable around 32ms
> >
> > cp-2687  [002]  1452.237012: balance_dirty_pages: weight=56% dirtied=128 pause=8
> > cp-2687  [002]  1452.246157: balance_dirty_pages: weight=56% dirtied=128 pause=8
> > cp-2687  [006]  1452.253043: balance_dirty_pages: weight=56% dirtied=128 pause=8
> > cp-2687  [006]  1452.261899: balance_dirty_pages: weight=57% dirtied=128 pause=8
> > cp-2687  [006]  1452.268939: balance_dirty_pages: weight=57% dirtied=128 pause=8
> > cp-2687  [002]  1452.276932: balance_dirty_pages: weight=57% dirtied=128 pause=8
> > cp-2687  [002]  1452.285889: balance_dirty_pages: weight=57% dirtied=128 pause=8
> >
> > CONTROL SYSTEM
> > ==============
> >
> > The current task_dirty_limit() adjusts bdi_dirty_limit to get
> > task_dirty_limit according to the dirty "weight" of the current task,
> > which is the percent of pages recently dirtied by the task. If 100%
> > pages are recently dirtied by the task, it will lower bdi_dirty_limit by
> > 1/8. If only 1% pages are dirtied by the task, it will return almost
> > unmodified bdi_dirty_limit. In this way, a heavy dirtier will get
> > blocked at task_dirty_limit=(bdi_dirty_limit-bdi_dirty_limit/8) while
> > allowing a light dirtier to progress (the latter won't be blocked
> > because R << B in fig.1).
> >
> > Fig.1 before patch, a heavy dirtier and a light dirtier
> >                                                R
> > ----------------------------------------------+-o---------------------------*--|
> >                                              L A                           B  T
> >  T: bdi_dirty_limit, as returned by bdi_dirty_limit()
> >  L: T - T/8
> >
> >  R: bdi_reclaimable + bdi_writeback
> >
> >  A: task_dirty_limit for a heavy dirtier ~= R ~= L
> >  B: task_dirty_limit for a light dirtier ~= T
> >
> > Since each process has its own dirty limit, we reuse A/B for the tasks as
> > well as their dirty limits.
> >
> > If B is a newly started heavy dirtier, then it will slowly gain weight
> > and A will lose weight.  The task_dirty_limit for A and B will be
> > approaching the center of region (L, T) and eventually stabilize there.
> >
> > Fig.2 before patch, two heavy dirtiers converging to the same threshold
> >                                                             R
> > ----------------------------------------------+--------------o-*---------------|
> >                                              L              A B               T
> 
> Seems good until now.
> So, What's the problem if two heavy dirtiers have a same threshold?

That's not a problem. It's the proper behavior to converge for two
"dd"s.

> > Fig.3 after patch, one heavy dirtier
> >                                                |
> >    throttle_bandwidth ~= bdi_bandwidth  =>     o
> >                                                | o
> >                                                |   o
> >                                                |     o
> >                                                |       o
> >                                                |         o
> >                                              La|           o
> > ----------------------------------------------+-+-------------o----------------|
> >                                                R             A                T
> >  T: bdi_dirty_limit
> >  A: task_dirty_limit      = T - Wa * T/16
> >  La: task_throttle_thresh = A - A/16
> >
> >  R: bdi_dirty_pages = bdi_reclaimable + bdi_writeback ~= La
> >
> > Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
> > way. In fig.3, a soft dirty limit region (La, A) is introduced. When R enters
> > this region, the task may be throttled for J jiffies on every N pages it dirtied.
> > Let's call (N/J) the "throttle bandwidth". It is computed by the following formula:
> >
> >        throttle_bandwidth = bdi_bandwidth * (A - R) / (A - La)
> > where
> >        A = T - Wa * T/16
> >        La = A - A/16
> > where Wa is task weight for A. It's 0 for very light dirtier and 1 for
> > the one heavy dirtier (that consumes 100% bdi write bandwidth).  The
> > task weight will be updated independently by task_dirty_inc() at
> > set_page_dirty() time.
> 
> 
> Dumb question.
> 
> I can't see the difference between old and new,
> La depends on A.
> A depends on Wa.
> T is constant?

T is the bdi's share of the global dirty limit. It's stable in normal,
and here we use it as the reference point for per-bdi dirty throttling.

> Then, throttle_bandwidth depends on Wa.

Sure, each task will be throttled at different bandwidth if there
"Wa" are different.

> Wa depends on the number of dirtied pages during some interval.
> So if light dirtier become heavy, at last light dirtier and heavy
> dirtier will have a same weight.
> It means throttle_bandwidth is same. It's a same with old result.

Yeah. Wa and throttle_bandwidth is changing over time.
 
> Please, open my eyes. :)

You get the dynamics right :)

> Thanks for the great work.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-18 13:40       ` Peter Zijlstra
@ 2010-11-18 14:02         ` Wu Fengguang
  0 siblings, 0 replies; 18+ messages in thread
From: Wu Fengguang @ 2010-11-18 14:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Chris Mason, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Theodore Ts'o, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML, tglx

On Thu, Nov 18, 2010 at 09:40:06PM +0800, Peter Zijlstra wrote:
> On Thu, 2010-11-18 at 21:26 +0800, Wu Fengguang wrote:
> > On Thu, Nov 18, 2010 at 09:04:34PM +0800, Peter Zijlstra wrote:
> > > On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> > > > - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> > > > - avoid too small pause time (less than  10ms, which burns CPU power)
> > > > - avoid too large pause time (more than 100ms, which hurts responsiveness)
> > > > - avoid big fluctuations of pause times 
> > > 
> > > If you feel like playing with sub-jiffies timeouts (a way to avoid that
> > > HZ=>100 assumption), the below (totally untested) patch might be of
> > > help..
> > 
> > Assuming there are HZ=10 users.
> > 
> > - when choosing such a coarse granularity, do they really care about
> >   responsiveness? :)
> 
> No of course not, they usually care about booting their system,.. I've
> been told booting Linux on a 10Mhz FPGA is 'fun' :-)

Wow, it's amazing Linux can run on it at all :)

> > - will the use of hrtimer add a little code size and/or runtime
> >   overheads, and hence hurt the majority HZ=100 users?
> 
> Yes it will add code and runtime overhead, but it would allow you to
> have 1ms timeouts even on a HZ=100 system, as opposed to a 10ms minimum.

Yeah, Dave Chinner once pointed out 1ms sleep may be desirable on
really fast storage. That may help if there is only one really fast
dirtier. Let's see if there will come such user demands.

But for now, amusingly, the demand is to have 100-200ms pause time for
reducing CPU overheads when there are hundreds of concurrent dirtiers.
The number is pretty easy to tune in itself, but I find the downside
of much bigger fluctuations. So I'm now trying ways to keep it under
control..

> Anyway, I'm not saying you should do it, I just wondered if we had the
> API, saw we didn't and thought it might be nice to offer it if desired.

Thanks for the offer. We can sure do it when there comes about some
loud user complaint :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-18 13:26     ` Wu Fengguang
@ 2010-11-18 13:40       ` Peter Zijlstra
  2010-11-18 14:02         ` Wu Fengguang
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2010-11-18 13:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Chris Mason, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Theodore Ts'o, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML, tglx

On Thu, 2010-11-18 at 21:26 +0800, Wu Fengguang wrote:
> On Thu, Nov 18, 2010 at 09:04:34PM +0800, Peter Zijlstra wrote:
> > On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> > > - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> > > - avoid too small pause time (less than  10ms, which burns CPU power)
> > > - avoid too large pause time (more than 100ms, which hurts responsiveness)
> > > - avoid big fluctuations of pause times 
> > 
> > If you feel like playing with sub-jiffies timeouts (a way to avoid that
> > HZ=>100 assumption), the below (totally untested) patch might be of
> > help..
> 
> Assuming there are HZ=10 users.
> 
> - when choosing such a coarse granularity, do they really care about
>   responsiveness? :)

No of course not, they usually care about booting their system,.. I've
been told booting Linux on a 10Mhz FPGA is 'fun' :-)

> - will the use of hrtimer add a little code size and/or runtime
>   overheads, and hence hurt the majority HZ=100 users?

Yes it will add code and runtime overhead, but it would allow you to
have 1ms timeouts even on a HZ=100 system, as opposed to a 10ms minimum.

Anyway, I'm not saying you should do it, I just wondered if we had the
API, saw we didn't and thought it might be nice to offer it if desired.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-18 13:04   ` Peter Zijlstra
@ 2010-11-18 13:26     ` Wu Fengguang
  2010-11-18 13:40       ` Peter Zijlstra
       [not found]     ` <20101129151719.GA30590@localhost>
  1 sibling, 1 reply; 18+ messages in thread
From: Wu Fengguang @ 2010-11-18 13:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Chris Mason, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Theodore Ts'o, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML, tglx

On Thu, Nov 18, 2010 at 09:04:34PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> > - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> > - avoid too small pause time (less than  10ms, which burns CPU power)
> > - avoid too large pause time (more than 100ms, which hurts responsiveness)
> > - avoid big fluctuations of pause times 
> 
> If you feel like playing with sub-jiffies timeouts (a way to avoid that
> HZ=>100 assumption), the below (totally untested) patch might be of
> help..

Assuming there are HZ=10 users.

- when choosing such a coarse granularity, do they really care about
  responsiveness? :)

- will the use of hrtimer add a little code size and/or runtime
  overheads, and hence hurt the majority HZ=100 users?

Thanks,
Fengguang

> 
> ---
> Subject: hrtimer: Provide io_schedule_timeout*() functions
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/hrtimer.h |    7 +++++++
>  kernel/hrtimer.c        |   15 +++++++++++++++
>  kernel/sched.c          |   17 +++++++++++++++++
>  3 files changed, 39 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
> index dd9954b..9e0f67e 100644
> --- a/include/linux/hrtimer.h
> +++ b/include/linux/hrtimer.h
> @@ -419,6 +419,13 @@ extern long hrtimer_nanosleep_restart(struct restart_block *restart_block);
>  extern void hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
>  				 struct task_struct *tsk);
>  
> +extern int io_schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
> +						const enum hrtimer_mode mode);
> +extern int io_schedule_hrtimeout_range_clock(ktime_t *expires,
> +		unsigned long delta, const enum hrtimer_mode mode, int clock);
> +extern int io_schedule_hrtimeout(ktime_t *expires, const enum hrtimer_mode mode);
> +
> +
>  extern int schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
>  						const enum hrtimer_mode mode);
>  extern int schedule_hrtimeout_range_clock(ktime_t *expires,
> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> index 72206cf..ef2d93c 100644
> --- a/kernel/hrtimer.c
> +++ b/kernel/hrtimer.c
> @@ -1838,6 +1838,14 @@ int __sched schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
>  }
>  EXPORT_SYMBOL_GPL(schedule_hrtimeout_range);
>  
> +int __sched io_schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
> +				     const enum hrtimer_mode mode)
> +{
> +	return io_schedule_hrtimeout_range_clock(expires, delta, mode,
> +					      CLOCK_MONOTONIC);
> +}
> +EXPORT_SYMBOL_GPL(io_schedule_hrtimeout_range);
> +
>  /**
>   * schedule_hrtimeout - sleep until timeout
>   * @expires:	timeout value (ktime_t)
> @@ -1866,3 +1874,10 @@ int __sched schedule_hrtimeout(ktime_t *expires,
>  	return schedule_hrtimeout_range(expires, 0, mode);
>  }
>  EXPORT_SYMBOL_GPL(schedule_hrtimeout);
> +
> +int __sched io_schedule_hrtimeout(ktime_t *expires,
> +			       const enum hrtimer_mode mode)
> +{
> +	return io_schedule_hrtimeout_range(expires, 0, mode);
> +}
> +EXPORT_SYMBOL_GPL(io_schedule_hrtimeout);
> diff --git a/kernel/sched.c b/kernel/sched.c
> index d5564a8..ac84455 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5303,6 +5303,23 @@ long __sched io_schedule_timeout(long timeout)
>  	return ret;
>  }
>  
> +int __sched
> +io_schedule_hrtimeout_range_clock(ktime_t *expires, unsigned long delta,
> +			       const enum hrtimer_mode mode, int clock)
> +{
> +	struct rq *rq = raw_rq();
> +	long ret;
> +
> +	delayacct_blkio_start();
> +	atomic_inc(&rq->nr_iowait);
> +	current->in_iowait = 1;
> +	ret = schedule_hrtimeout_range_clock(expires, delta, mode, clock);
> +	current->in_iowait = 0;
> +	atomic_dec(&rq->nr_iowait);
> +	delayacct_blkio_end();
> +	return ret;
> +}
> +
>  /**
>   * sys_sched_get_priority_max - return maximum RT priority.
>   * @policy: scheduling class.
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-17  4:27 ` [PATCH 01/13] writeback: IO-less balance_dirty_pages() Wu Fengguang
  2010-11-17 10:34   ` Minchan Kim
  2010-11-17 23:08   ` Andrew Morton
@ 2010-11-18 13:04   ` Peter Zijlstra
  2010-11-18 13:26     ` Wu Fengguang
       [not found]     ` <20101129151719.GA30590@localhost>
  2 siblings, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2010-11-18 13:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Chris Mason, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Theodore Ts'o, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML, tglx

On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> - avoid too small pause time (less than  10ms, which burns CPU power)
> - avoid too large pause time (more than 100ms, which hurts responsiveness)
> - avoid big fluctuations of pause times 

If you feel like playing with sub-jiffies timeouts (a way to avoid that
HZ=>100 assumption), the below (totally untested) patch might be of
help..


---
Subject: hrtimer: Provide io_schedule_timeout*() functions

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/hrtimer.h |    7 +++++++
 kernel/hrtimer.c        |   15 +++++++++++++++
 kernel/sched.c          |   17 +++++++++++++++++
 3 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index dd9954b..9e0f67e 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -419,6 +419,13 @@ extern long hrtimer_nanosleep_restart(struct restart_block *restart_block);
 extern void hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
 				 struct task_struct *tsk);
 
+extern int io_schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
+						const enum hrtimer_mode mode);
+extern int io_schedule_hrtimeout_range_clock(ktime_t *expires,
+		unsigned long delta, const enum hrtimer_mode mode, int clock);
+extern int io_schedule_hrtimeout(ktime_t *expires, const enum hrtimer_mode mode);
+
+
 extern int schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
 						const enum hrtimer_mode mode);
 extern int schedule_hrtimeout_range_clock(ktime_t *expires,
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 72206cf..ef2d93c 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1838,6 +1838,14 @@ int __sched schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
 }
 EXPORT_SYMBOL_GPL(schedule_hrtimeout_range);
 
+int __sched io_schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
+				     const enum hrtimer_mode mode)
+{
+	return io_schedule_hrtimeout_range_clock(expires, delta, mode,
+					      CLOCK_MONOTONIC);
+}
+EXPORT_SYMBOL_GPL(io_schedule_hrtimeout_range);
+
 /**
  * schedule_hrtimeout - sleep until timeout
  * @expires:	timeout value (ktime_t)
@@ -1866,3 +1874,10 @@ int __sched schedule_hrtimeout(ktime_t *expires,
 	return schedule_hrtimeout_range(expires, 0, mode);
 }
 EXPORT_SYMBOL_GPL(schedule_hrtimeout);
+
+int __sched io_schedule_hrtimeout(ktime_t *expires,
+			       const enum hrtimer_mode mode)
+{
+	return io_schedule_hrtimeout_range(expires, 0, mode);
+}
+EXPORT_SYMBOL_GPL(io_schedule_hrtimeout);
diff --git a/kernel/sched.c b/kernel/sched.c
index d5564a8..ac84455 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5303,6 +5303,23 @@ long __sched io_schedule_timeout(long timeout)
 	return ret;
 }
 
+int __sched
+io_schedule_hrtimeout_range_clock(ktime_t *expires, unsigned long delta,
+			       const enum hrtimer_mode mode, int clock)
+{
+	struct rq *rq = raw_rq();
+	long ret;
+
+	delayacct_blkio_start();
+	atomic_inc(&rq->nr_iowait);
+	current->in_iowait = 1;
+	ret = schedule_hrtimeout_range_clock(expires, delta, mode, clock);
+	current->in_iowait = 0;
+	atomic_dec(&rq->nr_iowait);
+	delayacct_blkio_end();
+	return ret;
+}
+
 /**
  * sys_sched_get_priority_max - return maximum RT priority.
  * @policy: scheduling class.


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-17  4:27 ` [PATCH 01/13] writeback: IO-less balance_dirty_pages() Wu Fengguang
  2010-11-17 10:34   ` Minchan Kim
@ 2010-11-17 23:08   ` Andrew Morton
  2010-11-18 13:04   ` Peter Zijlstra
  2 siblings, 0 replies; 18+ messages in thread
From: Andrew Morton @ 2010-11-17 23:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Chris Mason, Dave Chinner, Peter Zijlstra, Jens Axboe,
	Christoph Hellwig, Theodore Ts'o, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 17 Nov 2010 12:27:21 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Since the task will be soft throttled earlier than before, it may be
> perceived by end users as performance "slow down" if his application
> happens to dirty more than ~15% memory.

writeback has always had these semi-bogus assumptions that all pages
are the same, and it can sometimes go very wrong.

A chronic case would be a 4GB i386 machine where only 1/4 of memory is
useable for GFP_KERNEL allocations, filesystem metadata and /dev/sdX
pagecache.

When you think about it, a lot of the throttling work being done in
writeback is really being done on behalf of the page allocator (and
hence page reclaim).  But what happens if the workload is mainly
hammering away at ZONE_NORMAL, but writeback is considering ZONE_NORMAL
to be the same thing as ZONE_HIGHMEM?

Or vice versa, where page-dirtyings are all happening in lowmem?  Can
writeback then think that there are plenty of clean pages (because it's
looking at highmem as well) so little or no throttling is happening? 
If so, what effect does this have upon GFP_KERNEL/GFP_USER allocation?

And bear in mind that the user can tune the dirty levels.  If they're
set to 10% on a machine on which 25% of memory is lowmem then ill
effects might be rare.  But if the user tweaks the thresholds to 30%
then can we get into problems?  Such as a situation where 100% of
lowmem is dirty and throttling isn't cutting in?



So please have a think about that and see if you can think of ways in
which this assumption can cause things to go bad.  I'd suggest
writing some targetted tests which write to /dev/sdX (to generate
lowmem-only dirty pages) and which read from /dev/sdX (to request
allocation of lowmem pages).  Run these tests in conjunction with tests
which exercise the highmem zone as well and check that everything
behaves as expected.

Of course, this all assumes that you have a 4GB i386 box :( It's almost
getting to the stage where we need a fake-zone-highmem option for
x86_64 boxes just so we can test this stuff.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-17  4:27 ` [PATCH 01/13] writeback: IO-less balance_dirty_pages() Wu Fengguang
@ 2010-11-17 10:34   ` Minchan Kim
  2010-11-22  2:01     ` Wu Fengguang
  2010-11-17 23:08   ` Andrew Morton
  2010-11-18 13:04   ` Peter Zijlstra
  2 siblings, 1 reply; 18+ messages in thread
From: Minchan Kim @ 2010-11-17 10:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Chris Mason, Dave Chinner,
	Peter Zijlstra, Jens Axboe, Christoph Hellwig, Theodore Ts'o,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel, LKML

Hi Wu,

As you know, I am not a expert in this area.
So I hope my review can help understanding other newbie like me and
make clear this document. :)
I didn't look into the code. before it, I would like to clear your concept.

On Wed, Nov 17, 2010 at 1:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> As proposed by Chris, Dave and Jan, don't start foreground writeback IO
> inside balance_dirty_pages(). Instead, simply let it idle sleep for some
> time to throttle the dirtying task. In the mean while, kick off the
> per-bdi flusher thread to do background writeback IO.
>
> This patch introduces the basic framework, which will be further
> consolidated by the next patches.
>
> RATIONALS
> =========
>
> The current balance_dirty_pages() is rather IO inefficient.
>
> - concurrent writeback of multiple inodes (Dave Chinner)
>
>  If every thread doing writes and being throttled start foreground
>  writeback, it leads to N IO submitters from at least N different
>  inodes at the same time, end up with N different sets of IO being
>  issued with potentially zero locality to each other, resulting in
>  much lower elevator sort/merge efficiency and hence we seek the disk
>  all over the place to service the different sets of IO.
>  OTOH, if there is only one submission thread, it doesn't jump between
>  inodes in the same way when congestion clears - it keeps writing to
>  the same inode, resulting in large related chunks of sequential IOs
>  being issued to the disk. This is more efficient than the above
>  foreground writeback because the elevator works better and the disk
>  seeks less.
>
> - IO size too small for fast arrays and too large for slow USB sticks
>
>  The write_chunk used by current balance_dirty_pages() cannot be
>  directly set to some large value (eg. 128MB) for better IO efficiency.
>  Because it could lead to more than 1 second user perceivable stalls.
>  Even the current 4MB write size may be too large for slow USB sticks.
>  The fact that balance_dirty_pages() starts IO on itself couples the
>  IO size to wait time, which makes it hard to do suitable IO size while
>  keeping the wait time under control.
>
> For the above two reasons, it's much better to shift IO to the flusher
> threads and let balance_dirty_pages() just wait for enough time or progress.
>
> Jan Kara, Dave Chinner and me explored the scheme to let
> balance_dirty_pages() wait for enough writeback IO completions to
> safeguard the dirty limit. However it's found to have two problems:
>
> - in large NUMA systems, the per-cpu counters may have big accounting
>  errors, leading to big throttle wait time and jitters.
>
> - NFS may kill large amount of unstable pages with one single COMMIT.
>  Because NFS server serves COMMIT with expensive fsync() IOs, it is
>  desirable to delay and reduce the number of COMMITs. So it's not
>  likely to optimize away such kind of bursty IO completions, and the
>  resulted large (and tiny) stall times in IO completion based throttling.
>
> So here is a pause time oriented approach, which tries to control the
> pause time in each balance_dirty_pages() invocations, by controlling
> the number of pages dirtied before calling balance_dirty_pages(), for
> smooth and efficient dirty throttling:
>
> - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> - avoid too small pause time (less than  10ms, which burns CPU power)
> - avoid too large pause time (more than 100ms, which hurts responsiveness)
> - avoid big fluctuations of pause times
>
> For example, when doing a simple cp on ext4 with mem=4G HZ=250.
>
> before patch, the pause time fluctuates from 0 to 324ms
> (and the stall time may grow very large for slow devices)
>
> [ 1237.139962] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
> [ 1237.207489] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> [ 1237.225190] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> [ 1237.234488] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> [ 1237.244692] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
> [ 1237.375231] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
> [ 1237.443035] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
> [ 1237.574630] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
> [ 1237.642394] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
> [ 1237.666320] balance_dirty_pages: write_chunk=1536 pages_written=57 pause=5
> [ 1237.973365] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=81
> [ 1238.212626] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
> [ 1238.280431] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
> [ 1238.412029] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
> [ 1238.412791] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
>
> after patch, the pause time remains stable around 32ms
>
> cp-2687  [002]  1452.237012: balance_dirty_pages: weight=56% dirtied=128 pause=8
> cp-2687  [002]  1452.246157: balance_dirty_pages: weight=56% dirtied=128 pause=8
> cp-2687  [006]  1452.253043: balance_dirty_pages: weight=56% dirtied=128 pause=8
> cp-2687  [006]  1452.261899: balance_dirty_pages: weight=57% dirtied=128 pause=8
> cp-2687  [006]  1452.268939: balance_dirty_pages: weight=57% dirtied=128 pause=8
> cp-2687  [002]  1452.276932: balance_dirty_pages: weight=57% dirtied=128 pause=8
> cp-2687  [002]  1452.285889: balance_dirty_pages: weight=57% dirtied=128 pause=8
>
> CONTROL SYSTEM
> ==============
>
> The current task_dirty_limit() adjusts bdi_dirty_limit to get
> task_dirty_limit according to the dirty "weight" of the current task,
> which is the percent of pages recently dirtied by the task. If 100%
> pages are recently dirtied by the task, it will lower bdi_dirty_limit by
> 1/8. If only 1% pages are dirtied by the task, it will return almost
> unmodified bdi_dirty_limit. In this way, a heavy dirtier will get
> blocked at task_dirty_limit=(bdi_dirty_limit-bdi_dirty_limit/8) while
> allowing a light dirtier to progress (the latter won't be blocked
> because R << B in fig.1).
>
> Fig.1 before patch, a heavy dirtier and a light dirtier
>                                                R
> ----------------------------------------------+-o---------------------------*--|
>                                              L A                           B  T
>  T: bdi_dirty_limit, as returned by bdi_dirty_limit()
>  L: T - T/8
>
>  R: bdi_reclaimable + bdi_writeback
>
>  A: task_dirty_limit for a heavy dirtier ~= R ~= L
>  B: task_dirty_limit for a light dirtier ~= T
>
> Since each process has its own dirty limit, we reuse A/B for the tasks as
> well as their dirty limits.
>
> If B is a newly started heavy dirtier, then it will slowly gain weight
> and A will lose weight.  The task_dirty_limit for A and B will be
> approaching the center of region (L, T) and eventually stabilize there.
>
> Fig.2 before patch, two heavy dirtiers converging to the same threshold
>                                                             R
> ----------------------------------------------+--------------o-*---------------|
>                                              L              A B               T

Seems good until now.
So, What's the problem if two heavy dirtiers have a same threshold?

>
> Fig.3 after patch, one heavy dirtier
>                                                |
>    throttle_bandwidth ~= bdi_bandwidth  =>     o
>                                                | o
>                                                |   o
>                                                |     o
>                                                |       o
>                                                |         o
>                                              La|           o
> ----------------------------------------------+-+-------------o----------------|
>                                                R             A                T
>  T: bdi_dirty_limit
>  A: task_dirty_limit      = T - Wa * T/16
>  La: task_throttle_thresh = A - A/16
>
>  R: bdi_dirty_pages = bdi_reclaimable + bdi_writeback ~= La
>
> Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
> way. In fig.3, a soft dirty limit region (La, A) is introduced. When R enters
> this region, the task may be throttled for J jiffies on every N pages it dirtied.
> Let's call (N/J) the "throttle bandwidth". It is computed by the following formula:
>
>        throttle_bandwidth = bdi_bandwidth * (A - R) / (A - La)
> where
>        A = T - Wa * T/16
>        La = A - A/16
> where Wa is task weight for A. It's 0 for very light dirtier and 1 for
> the one heavy dirtier (that consumes 100% bdi write bandwidth).  The
> task weight will be updated independently by task_dirty_inc() at
> set_page_dirty() time.


Dumb question.

I can't see the difference between old and new,
La depends on A.
A depends on Wa.
T is constant?
Then, throttle_bandwidth depends on Wa.
Wa depends on the number of dirtied pages during some interval.
So if light dirtier become heavy, at last light dirtier and heavy
dirtier will have a same weight.
It means throttle_bandwidth is same. It's a same with old result.

Please, open my eyes. :)
Thanks for the great work.

>
> When R < La, we don't throttle it at all.
> When R > A, the code will detect the negativeness and choose to pause
> 100ms (the upper pause boundary), then loop over again.




-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 01/13] writeback: IO-less balance_dirty_pages()
  2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
@ 2010-11-17  4:27 ` Wu Fengguang
  2010-11-17 10:34   ` Minchan Kim
                     ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Chris Mason, Dave Chinner, Peter Zijlstra, Jens Axboe,
	Wu Fengguang, Christoph Hellwig, Theodore Ts'o, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bw-throttle.patch --]
[-- Type: text/plain, Size: 28688 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

This patch introduces the basic framework, which will be further
consolidated by the next patches.

RATIONALS
=========

The current balance_dirty_pages() is rather IO inefficient.

- concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

For the above two reasons, it's much better to shift IO to the flusher
threads and let balance_dirty_pages() just wait for enough time or progress.

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling 
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than  10ms, which burns CPU power)
- avoid too large pause time (more than 100ms, which hurts responsiveness)
- avoid big fluctuations of pause times

For example, when doing a simple cp on ext4 with mem=4G HZ=250.

before patch, the pause time fluctuates from 0 to 324ms
(and the stall time may grow very large for slow devices)

[ 1237.139962] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1237.207489] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.225190] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.234488] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.244692] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.375231] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.443035] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.574630] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.642394] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.666320] balance_dirty_pages: write_chunk=1536 pages_written=57 pause=5
[ 1237.973365] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=81
[ 1238.212626] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1238.280431] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1238.412029] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1238.412791] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0

after patch, the pause time remains stable around 32ms

cp-2687  [002]  1452.237012: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [002]  1452.246157: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [006]  1452.253043: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [006]  1452.261899: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [006]  1452.268939: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [002]  1452.276932: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [002]  1452.285889: balance_dirty_pages: weight=57% dirtied=128 pause=8

CONTROL SYSTEM
==============

The current task_dirty_limit() adjusts bdi_dirty_limit to get
task_dirty_limit according to the dirty "weight" of the current task,
which is the percent of pages recently dirtied by the task. If 100%
pages are recently dirtied by the task, it will lower bdi_dirty_limit by
1/8. If only 1% pages are dirtied by the task, it will return almost
unmodified bdi_dirty_limit. In this way, a heavy dirtier will get
blocked at task_dirty_limit=(bdi_dirty_limit-bdi_dirty_limit/8) while
allowing a light dirtier to progress (the latter won't be blocked
because R << B in fig.1).

Fig.1 before patch, a heavy dirtier and a light dirtier
                                                R
----------------------------------------------+-o---------------------------*--|
                                              L A                           B  T
  T: bdi_dirty_limit, as returned by bdi_dirty_limit()
  L: T - T/8

  R: bdi_reclaimable + bdi_writeback

  A: task_dirty_limit for a heavy dirtier ~= R ~= L
  B: task_dirty_limit for a light dirtier ~= T

Since each process has its own dirty limit, we reuse A/B for the tasks as
well as their dirty limits.

If B is a newly started heavy dirtier, then it will slowly gain weight
and A will lose weight.  The task_dirty_limit for A and B will be
approaching the center of region (L, T) and eventually stabilize there.

Fig.2 before patch, two heavy dirtiers converging to the same threshold
                                                             R
----------------------------------------------+--------------o-*---------------|
                                              L              A B               T

Fig.3 after patch, one heavy dirtier
                                                |
    throttle_bandwidth ~= bdi_bandwidth  =>     o
                                                | o
                                                |   o
                                                |     o
                                                |       o
                                                |         o
                                              La|           o
----------------------------------------------+-+-------------o----------------|
                                                R             A                T
  T: bdi_dirty_limit
  A: task_dirty_limit      = T - Wa * T/16
  La: task_throttle_thresh = A - A/16

  R: bdi_dirty_pages = bdi_reclaimable + bdi_writeback ~= La

Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
way. In fig.3, a soft dirty limit region (La, A) is introduced. When R enters
this region, the task may be throttled for J jiffies on every N pages it dirtied.
Let's call (N/J) the "throttle bandwidth". It is computed by the following formula:

        throttle_bandwidth = bdi_bandwidth * (A - R) / (A - La)
where
	A = T - Wa * T/16
        La = A - A/16
where Wa is task weight for A. It's 0 for very light dirtier and 1 for
the one heavy dirtier (that consumes 100% bdi write bandwidth).  The
task weight will be updated independently by task_dirty_inc() at
set_page_dirty() time.

When R < La, we don't throttle it at all.
When R > A, the code will detect the negativeness and choose to pause
100ms (the upper pause boundary), then loop over again.


PSEUDO CODE
===========

balance_dirty_pages():

	/* soft throttling */
	if (task_throttle_thresh exceeded)
		sleep (task_dirtied_pages / throttle_bandwidth)

	/* hard throttling */
	while (task_dirty_limit exceeded) {
		sleep 100ms
		if (bdi_dirty_pages dropped more than task_dirtied_pages)
			break
	}

	/* global hard limit */
	while (dirty_limit exceeded)
		sleep 100ms

Basically there are three level of throttling now.

- normally the dirtier will be adaptively throttled with good timing

- when task_dirty_limit is exceeded, the task will be throttled until
  bdi dirty/writeback pages go down reasonably large

- when dirty_thresh is exceeded, the task can be throttled for arbitrary
  long time


BEHAVIOR CHANGE
===============

Users will notice that the applications will get throttled once the
crossing the global (background + dirty)/2=15% threshold. For a single
"cp", it could be soft throttled at 8*bdi->write_bandwidth around 15%
dirty pages, and be balanced at speed bdi->write_bandwidth around 17.5%
dirty pages. Before patch, the behavior is to just throttle it at 17.5%
dirty pages.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than ~15% memory.


BENCHMARKS
==========

The test box has a 4-core 3.2GHz CPU, 4GB mem and a SATA disk.

For each filesystem, the following command is run 3 times.

time (dd if=/dev/zero of=/tmp/10G bs=1M count=10240; sync); rm /tmp/10G

	    2.6.36-rc2-mm1	2.6.36-rc2-mm1+balance_dirty_pages
average real time
ext2        236.377s            232.144s              -1.8%
ext3        226.245s            225.751s              -0.2%
ext4        178.742s            179.343s              +0.3%
xfs         183.562s            179.808s              -2.0%
btrfs       179.044s            179.461s              +0.2%
NFS         645.627s            628.937s              -2.6%

average system time
ext2         22.142s             19.656s             -11.2%
ext3         34.175s             32.462s              -5.0%
ext4         23.440s             21.162s              -9.7%
xfs          19.089s             16.069s             -15.8%
btrfs        12.212s             11.670s              -4.4%
NFS          16.807s             17.410s              +3.6%

total user time
sum           0.136s              0.084s             -38.2%

In a more recent run of the tests, it's in fact slightly slower.

ext2         49.500 MB/s         49.200 MB/s          -0.6%
ext3         50.133 MB/s         50.000 MB/s          -0.3%
ext4         64.000 MB/s         63.200 MB/s          -1.2%
xfs          63.500 MB/s         63.167 MB/s          -0.5%
btrfs        63.133 MB/s         63.033 MB/s          -0.2%
NFS          16.833 MB/s         16.867 MB/s          +0.2%

In general there are no big IO performance changes for desktop users,
except for some noticeable reduction of CPU overheads. It mainly
benefits file servers with heavy concurrent writers on fast storage
arrays. As can be demonstrated by 10/100 concurrent dd's on xfs:

- 1 dirtier case:    the same
- 10 dirtiers case:  CPU system time is reduced to 50%
- 100 dirtiers case: CPU system time is reduced to 10%, IO size and throughput increases by 10%

			2.6.37-rc2				2.6.37-rc1-next-20101115+
        ----------------------------------------        ----------------------------------------
	%system		wkB/s		avgrq-sz	%system		wkB/s		avgrq-sz
100dd	30.916		37843.000	748.670		3.079		41654.853	822.322
100dd	30.501		37227.521	735.754		3.744		41531.725	820.360

10dd	39.442		47745.021	900.935		20.756		47951.702	901.006
10dd	39.204		47484.616	899.330		20.550		47970.093	900.247

1dd	13.046		57357.468	910.659		13.060		57632.715	909.212
1dd	12.896		56433.152	909.861		12.467		56294.440	909.644

The CPU overheads in 2.6.37-rc1-next-20101115+ is higher than
2.6.36-rc2-mm1+balance_dirty_pages, this may be due to the pause time
stablizing at lower values due to some algorithm adjustments (eg.
reduce the minimal pause time from 10ms to 1jiffy in new version)
leading to much more balance_dirty_pages() calls. The different pause
time also explains the different system time for 1/10/100dd cases on
the same 2.6.37-rc1-next-20101115+.

CC: Chris Mason <chris.mason@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/filesystems/writeback-throttling-design.txt |  210 ++++++++++
 include/linux/writeback.h                                 |   10 
 mm/page-writeback.c                                       |   85 +---
 3 files changed, 249 insertions(+), 56 deletions(-)

--- linux-next.orig/include/linux/writeback.h	2010-11-15 19:49:41.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-11-15 19:49:42.000000000 +0800
@@ -12,6 +12,16 @@ struct backing_dev_info;
 extern spinlock_t inode_lock;
 
 /*
+ * The 1/8 region under the bdi dirty threshold is set aside for elastic
+ * throttling. In rare cases when the threshold is exceeded, more rigid
+ * throttling will be imposed, which will inevitably stall the dirtier task
+ * for seconds (or more) at _one_ time. The rare case could be a fork bomb
+ * where every new task dirties some more pages.
+ */
+#define BDI_SOFT_DIRTY_LIMIT	8
+#define TASK_SOFT_DIRTY_LIMIT	(BDI_SOFT_DIRTY_LIMIT * 2)
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {
--- linux-next.orig/mm/page-writeback.c	2010-11-15 19:49:41.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-15 19:50:16.000000000 +0800
@@ -42,20 +42,6 @@
  */
 static long ratelimit_pages = 32;
 
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -279,7 +265,7 @@ static unsigned long task_dirty_limit(st
 {
 	long numerator, denominator;
 	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty >> 3;
+	u64 inv = dirty / TASK_SOFT_DIRTY_LIMIT;
 
 	task_dirties_fraction(tsk, &numerator, &denominator);
 	inv *= numerator;
@@ -473,26 +459,25 @@ unsigned long bdi_dirty_limit(struct bac
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	unsigned long bw;
+	unsigned long pause;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK);
@@ -529,6 +514,23 @@ static void balance_dirty_pages(struct a
 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		if (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) {
+			pause = HZ/10;
+			goto pause;
+		}
+
+		bw = 100 << 20; /* use static 100MB/s for the moment */
+
+		bw = bw * (bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback));
+		bw = bw / (bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
+
+		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
+		pause = clamp_val(pause, 1, HZ/10);
+
+pause:
+		__set_current_state(TASK_INTERRUPTIBLE);
+		io_schedule_timeout(pause);
+
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the
 		 * global "hard" limit. The former helps to prevent heavy IO
@@ -544,35 +546,6 @@ static void balance_dirty_pages(struct a
 
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
-
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
-		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
-		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
-		__set_current_state(TASK_INTERRUPTIBLE);
-		io_schedule_timeout(pause);
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
 	if (!dirty_exceeded && bdi->dirty_exceeded)
@@ -589,7 +562,7 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
+	if ((laptop_mode && dirty_exceeded) ||
 	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_background_writeback(bdi);
 }
@@ -638,7 +611,7 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(bdp_ratelimits);
 	*p += nr_pages_dirtied;
 	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
+		ratelimit = *p;
 		*p = 0;
 		preempt_enable();
 		balance_dirty_pages(mapping, ratelimit);
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/Documentation/filesystems/writeback-throttling-design.txt	2010-11-15 19:49:42.000000000 +0800
@@ -0,0 +1,210 @@
+writeback throttling design
+---------------------------
+
+introduction to dirty throttling
+--------------------------------
+
+The write(2) is normally buffered write that creates dirty page cache pages
+for holding the data and return immediately. The dirty pages will eventually
+be written to disk, or be dropped by unlink()/truncate().
+
+The delayed writeback of dirty pages enables the kernel to optimize the IO:
+
+- turn IO into async ones, which avoids blocking the tasks
+- submit IO as a batch for better throughput
+- avoid IO at all for temp files
+
+However, there have to be some limits on the number of allowable dirty pages.
+Typically applications are able to dirty pages more quickly than storage
+devices can write them. When approaching the dirty limits, the dirtier tasks
+will be throttled (put to brief sleeps from time to time) by
+balance_dirty_pages() in order to balance the dirty speed and writeback speed.
+
+dirty limits
+------------
+
+The dirty limit defaults to 20% reclaimable memory, and can be tuned via one of
+the following sysctl interfaces:
+
+	/proc/sys/vm/dirty_ratio
+	/proc/sys/vm/dirty_bytes
+
+The ultimate goal of balance_dirty_pages() is to keep the global dirty pages
+under control.
+
+	dirty_limit = dirty_ratio * free_reclaimable_pages
+
+However a global threshold may create deadlock for stacked BDIs (loop, FUSE and
+local NFS mounts). When A writes to B, and A generates enough dirty pages to
+get throttled, B will never start writeback until the dirty pages go away.
+
+Another problem is inter device starvation. When there are concurrent writes to
+a slow device and a fast one, the latter may well be starved due to unnecessary
+throttling on its dirtier tasks, leading to big IO performance drop.
+
+The solution is to split the global dirty limit into per-bdi limits among all
+the backing devices and scale writeback cache per backing device, proportional
+to its writeout speed.
+
+	bdi_dirty_limit = bdi_weight * dirty_limit
+
+where bdi_weight (ranging from 0 to 1) reflects the recent writeout speed of
+the BDI.
+
+We further scale the bdi dirty limit inversly with the task's dirty rate.
+This makes heavy writers have a lower dirty limit than the occasional writer,
+to prevent a heavy dd from slowing down all other light writers in the system.
+
+	task_dirty_limit = bdi_dirty_limit - task_weight * bdi_dirty_limit/16
+
+pause time
+----------
+
+The main task of dirty throttling is to determine when and how long to pause
+the current dirtier task.  Basically we want to
+
+- avoid too small pause time (less than 1 jiffy, which burns CPU power)
+- avoid too large pause time (more than 100ms, which hurts responsiveness)
+- avoid big fluctuations of pause times
+
+To smoothly control the pause time, we do soft throttling in a small region
+under task_dirty_limit, starting from
+
+	task_throttle_thresh = task_dirty_limit - task_dirty_limit/16
+
+In fig.1, when bdi_dirty_pages falls into
+
+    [0, La]:    do nothing
+    [La, A]:    do soft throttling
+    [A, inf]:   do hard throttling
+
+Where hard throttling is to wait until bdi_dirty_pages falls more than
+task_dirtied_pages (the pages dirtied by the task since its last throttle
+time). It's "hard" because it may end up waiting for long time.
+
+Fig.1 dirty throttling regions
+                                              o
+                                                o
+                                                  o
+                                                    o
+                                                      o
+                                                        o
+                                                          o
+                                                            o
+----------------------------------------------+---------------o----------------|
+                                              La              A                T
+                no throttle                     soft throttle   hard throttle
+  T: bdi_dirty_limit
+  A: task_dirty_limit      = T - task_weight * T/16
+  La: task_throttle_thresh = A - A/16
+
+Soft dirty throttling is to pause the dirtier task for J:pause_time jiffies on
+every N:task_dirtied_pages pages it dirtied.  Let's call (N/J) the "throttle
+bandwidth". It is computed by the following formula:
+
+                                     task_dirty_limit - bdi_dirty_pages
+throttle_bandwidth = bdi_bandwidth * ----------------------------------
+                                           task_dirty_limit/16
+
+where bdi_bandwidth is the BDI's estimated write speed.
+
+Given the throttle_bandwidth for a task, we select a suitable N, so that when
+the task dirties so much pages, it enters balance_dirty_pages() to sleep for
+roughly J jiffies. N is adaptive to storage and task write speeds, so that the
+task always get suitable (not too long or small) pause time.
+
+dynamics
+--------
+
+When there is one heavy dirtier, bdi_dirty_pages will keep growing until
+exceeding the low threshold of the task's soft throttling region [La, A].
+At which point (La) the task will be controlled under speed
+throttle_bandwidth=bdi_bandwidth (fig.2) and remain stable there.
+
+Fig.2 one heavy dirtier
+
+    throttle_bandwidth ~= bdi_bandwidth  =>   o
+                                              | o
+                                              |   o
+                                              |     o
+                                              |       o
+                                              |         o
+                                              |           o
+                                            La|             o
+----------------------------------------------+---------------o----------------|
+                                              R               A                T
+  R: bdi_dirty_pages ~= La
+
+When there comes a new dd task B, task_weight_B will gradually grow from 0 to
+50% while task_weight_A will decrease from 100% to 50%.  When task_weight_B is
+still small, B is considered a light dirtier and is allowed to dirty pages much
+faster than the bdi write bandwidth. In fact initially it won't be throttled at
+all when R < Lb where Lb = B - B/16 and B ~= T.
+
+Fig.3 an old dd (A) + a newly started dd (B)
+
+                      throttle bandwidth  =>    *
+                                                | *
+                                                |   *
+                                                |     *
+                                                |       *
+                                                |         *
+                                                |           *
+                                                |             *
+                      throttle bandwidth  =>    o               *
+                                                | o               *
+                                                |   o               *
+                                                |     o               *
+                                                |       o               *
+                                                |         o               *
+                                                |           o               *
+------------------------------------------------+-------------o---------------*|
+                                                R             A               BT
+
+So R:bdi_dirty_pages will grow large. As task_weight_A and task_weight_B
+converge to 50%, the points A, B will go towards each other (fig.4) and
+eventually coincide with each other. R will stabilize around A-A/32 where
+A=B=T-0.5*T/16.  throttle_bandwidth will stabilize around bdi_bandwidth/2.
+
+Note that the application "think+dirty time" is ignored for simplicity in the
+above discussions. With non-zero user space think time, the balance point will
+slightly drift and not a big deal otherwise.
+
+Fig.4 the two dd's converging to the same bandwidth
+
+                                                         |
+                                 throttle bandwidth  =>  *
+                                                         | *
+                                 throttle bandwidth  =>  o   *
+                                                         | o   *
+                                                         |   o   *
+                                                         |     o   *
+                                                         |       o   *
+                                                         |         o   *
+---------------------------------------------------------+-----------o---*-----|
+                                                         R           A   B     T
+
+There won't be big oscillations between A and B, because as soon as A coincides
+with B, their throttle_bandwidth and hence dirty speed will be equal, A's
+weight will stop decreasing and B's weight will stop growing, so the two points
+won't keep moving and cross each other.
+
+Sure there are always oscillations of bdi_dirty_pages as long as the dirtier
+task alternatively do dirty and pause. But it will be bounded. When there is 1
+heavy dirtier, the error bound will be (pause_time * bdi_bandwidth). When there
+are 2 heavy dirtiers, the max error is 2 * (pause_time * bdi_bandwidth/2),
+which remains the same as 1 dirtier case (given the same pause time). In fact
+the more dirtier tasks, the less errors will be, since the dirtier tasks are
+not likely going to sleep at the same time.
+
+References
+----------
+
+Smarter write throttling
+http://lwn.net/Articles/245600/
+
+Flushing out pdflush
+http://lwn.net/Articles/326552/
+
+Dirty throttling slides
+http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling.pdf



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2010-12-06 12:36 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-17  3:58 [PATCH 01/13] writeback: IO-less balance_dirty_pages() Wu Fengguang
2010-11-17  4:19 ` Wu Fengguang
2010-11-17  8:33   ` Wu Fengguang
2010-11-17  4:30 ` Wu Fengguang
2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
2010-11-17  4:27 ` [PATCH 01/13] writeback: IO-less balance_dirty_pages() Wu Fengguang
2010-11-17 10:34   ` Minchan Kim
2010-11-22  2:01     ` Wu Fengguang
2010-11-17 23:08   ` Andrew Morton
2010-11-18 13:04   ` Peter Zijlstra
2010-11-18 13:26     ` Wu Fengguang
2010-11-18 13:40       ` Peter Zijlstra
2010-11-18 14:02         ` Wu Fengguang
     [not found]     ` <20101129151719.GA30590@localhost>
     [not found]       ` <1291064013.32004.393.camel@laptop>
     [not found]         ` <20101130043735.GA22947@localhost>
     [not found]           ` <1291156522.32004.1359.camel@laptop>
     [not found]             ` <1291156765.32004.1365.camel@laptop>
     [not found]               ` <20101201133818.GA13377@localhost>
2010-12-01 23:03                 ` Andrew Morton
2010-12-02  1:56                   ` Wu Fengguang
2010-12-05 16:14                 ` Wu Fengguang
2010-12-06  2:42                   ` Ted Ts'o
2010-12-06  9:52                     ` Dmitry
2010-12-06 12:34                       ` Ted Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).