All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/18] IO-less dirty throttling v11
@ 2011-09-04  1:53 ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML, Wu Fengguang

Hi,

Finally, the complete IO-less balance_dirty_pages(). NFS is observed to perform
better or worse depending on the memory size. Otherwise the added patches can
address all known regressions.

        git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v11
	(to be updated; currently it contains a pre-release v11)

Changes since v10:

- complete the renames
- add protections for IO queue underrun
  - pause time reduction
  - bdi reserve area
  - bdi underrun flag
- more accurate task dirty accounting for
  - sub-page writes
  - FS re-dirties
  - short lived tasks

Changes since v9:

- a lot of renames and comment/changelog rework, again
- seperate out the dirty_ratelimit update policy (as patch 04)
- add think time compensation
- add 3 trace events

Changes since v8:

- a lot of renames and comment/changelog rework
- use 3rd order polynomial as the global control line (Peter)
- stabilize dirty_ratelimit by decreasing update step size on small errors
- limit per-CPU dirtied pages to avoid dirty pages run away on 1k+ tasks (Peter)

Thanks a lot to Peter, Vivek, Andrea and Jan for the careful reviews!

shortlog:

	Wu Fengguang (18):
	      writeback: account per-bdi accumulated dirtied pages
	      writeback: dirty position control
	      writeback: dirty rate control
	      writeback: stabilize bdi->dirty_ratelimit
	      writeback: per task dirty rate limit
	      writeback: IO-less balance_dirty_pages()
	      writeback: dirty ratelimit - think time compensation
	      writeback: trace dirty_ratelimit
	      writeback: trace balance_dirty_pages
	      writeback: dirty position control - bdi reserve area
	      block: add bdi flag to indicate risk of io queue underrun
	      writeback: balanced_rate cannot exceed write bandwidth
	      writeback: limit max dirty pause time
	      writeback: control dirty pause time
	      writeback: charge leaked page dirties to active tasks
	      writeback: fix dirtied pages accounting on sub-page writes
	      writeback: fix dirtied pages accounting on redirty
	      btrfs: fix dirtied pages accounting on sub-page writes

diffstat:

	 block/blk-core.c                 |    7 
	 fs/btrfs/file.c                  |    3 
	 fs/fs-writeback.c                |    2 
	 include/linux/backing-dev.h      |   26 
	 include/linux/blkdev.h           |   12 
	 include/linux/sched.h            |    8 
	 include/linux/writeback.h        |    5 
	 include/trace/events/writeback.h |  151 ++++-
	 kernel/exit.c                    |    2 
	 kernel/fork.c                    |    4 
	 mm/backing-dev.c                 |    3 
	 mm/page-writeback.c              |  768 +++++++++++++++++++++++------
	 12 files changed, 816 insertions(+), 175 deletions(-)

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 00/18] IO-less dirty throttling v11
@ 2011-09-04  1:53 ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML, Wu Fengguang

Hi,

Finally, the complete IO-less balance_dirty_pages(). NFS is observed to perform
better or worse depending on the memory size. Otherwise the added patches can
address all known regressions.

        git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v11
	(to be updated; currently it contains a pre-release v11)

Changes since v10:

- complete the renames
- add protections for IO queue underrun
  - pause time reduction
  - bdi reserve area
  - bdi underrun flag
- more accurate task dirty accounting for
  - sub-page writes
  - FS re-dirties
  - short lived tasks

Changes since v9:

- a lot of renames and comment/changelog rework, again
- seperate out the dirty_ratelimit update policy (as patch 04)
- add think time compensation
- add 3 trace events

Changes since v8:

- a lot of renames and comment/changelog rework
- use 3rd order polynomial as the global control line (Peter)
- stabilize dirty_ratelimit by decreasing update step size on small errors
- limit per-CPU dirtied pages to avoid dirty pages run away on 1k+ tasks (Peter)

Thanks a lot to Peter, Vivek, Andrea and Jan for the careful reviews!

shortlog:

	Wu Fengguang (18):
	      writeback: account per-bdi accumulated dirtied pages
	      writeback: dirty position control
	      writeback: dirty rate control
	      writeback: stabilize bdi->dirty_ratelimit
	      writeback: per task dirty rate limit
	      writeback: IO-less balance_dirty_pages()
	      writeback: dirty ratelimit - think time compensation
	      writeback: trace dirty_ratelimit
	      writeback: trace balance_dirty_pages
	      writeback: dirty position control - bdi reserve area
	      block: add bdi flag to indicate risk of io queue underrun
	      writeback: balanced_rate cannot exceed write bandwidth
	      writeback: limit max dirty pause time
	      writeback: control dirty pause time
	      writeback: charge leaked page dirties to active tasks
	      writeback: fix dirtied pages accounting on sub-page writes
	      writeback: fix dirtied pages accounting on redirty
	      btrfs: fix dirtied pages accounting on sub-page writes

diffstat:

	 block/blk-core.c                 |    7 
	 fs/btrfs/file.c                  |    3 
	 fs/fs-writeback.c                |    2 
	 include/linux/backing-dev.h      |   26 
	 include/linux/blkdev.h           |   12 
	 include/linux/sched.h            |    8 
	 include/linux/writeback.h        |    5 
	 include/trace/events/writeback.h |  151 ++++-
	 kernel/exit.c                    |    2 
	 kernel/fork.c                    |    4 
	 mm/backing-dev.c                 |    3 
	 mm/page-writeback.c              |  768 +++++++++++++++++++++++------
	 12 files changed, 816 insertions(+), 175 deletions(-)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 00/18] IO-less dirty throttling v11
@ 2011-09-04  1:53 ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML, Wu Fengguang

Hi,

Finally, the complete IO-less balance_dirty_pages(). NFS is observed to perform
better or worse depending on the memory size. Otherwise the added patches can
address all known regressions.

        git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v11
	(to be updated; currently it contains a pre-release v11)

Changes since v10:

- complete the renames
- add protections for IO queue underrun
  - pause time reduction
  - bdi reserve area
  - bdi underrun flag
- more accurate task dirty accounting for
  - sub-page writes
  - FS re-dirties
  - short lived tasks

Changes since v9:

- a lot of renames and comment/changelog rework, again
- seperate out the dirty_ratelimit update policy (as patch 04)
- add think time compensation
- add 3 trace events

Changes since v8:

- a lot of renames and comment/changelog rework
- use 3rd order polynomial as the global control line (Peter)
- stabilize dirty_ratelimit by decreasing update step size on small errors
- limit per-CPU dirtied pages to avoid dirty pages run away on 1k+ tasks (Peter)

Thanks a lot to Peter, Vivek, Andrea and Jan for the careful reviews!

shortlog:

	Wu Fengguang (18):
	      writeback: account per-bdi accumulated dirtied pages
	      writeback: dirty position control
	      writeback: dirty rate control
	      writeback: stabilize bdi->dirty_ratelimit
	      writeback: per task dirty rate limit
	      writeback: IO-less balance_dirty_pages()
	      writeback: dirty ratelimit - think time compensation
	      writeback: trace dirty_ratelimit
	      writeback: trace balance_dirty_pages
	      writeback: dirty position control - bdi reserve area
	      block: add bdi flag to indicate risk of io queue underrun
	      writeback: balanced_rate cannot exceed write bandwidth
	      writeback: limit max dirty pause time
	      writeback: control dirty pause time
	      writeback: charge leaked page dirties to active tasks
	      writeback: fix dirtied pages accounting on sub-page writes
	      writeback: fix dirtied pages accounting on redirty
	      btrfs: fix dirtied pages accounting on sub-page writes

diffstat:

	 block/blk-core.c                 |    7 
	 fs/btrfs/file.c                  |    3 
	 fs/fs-writeback.c                |    2 
	 include/linux/backing-dev.h      |   26 
	 include/linux/blkdev.h           |   12 
	 include/linux/sched.h            |    8 
	 include/linux/writeback.h        |    5 
	 include/trace/events/writeback.h |  151 ++++-
	 kernel/exit.c                    |    2 
	 kernel/fork.c                    |    4 
	 mm/backing-dev.c                 |    3 
	 mm/page-writeback.c              |  768 +++++++++++++++++++++++------
	 12 files changed, 816 insertions(+), 175 deletions(-)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 01/18] writeback: account per-bdi accumulated dirtied pages
  2011-09-04  1:53 ` Wu Fengguang
  (?)
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Jan Kara, Michael Rubin, Wu Fengguang,
	Andrew Morton, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-bdi-dirtied.patch --]
[-- Type: text/plain, Size: 2019 bytes --]

Introduce the BDI_DIRTIED counter. It will be used for estimating the
bdi's dirty bandwidth.

CC: Jan Kara <jack@suse.cz>
CC: Michael Rubin <mrubin@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    2 ++
 mm/page-writeback.c         |    1 +
 3 files changed, 4 insertions(+)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-16 09:30:23.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-17 10:15:45.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_DIRTIED,
 	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
--- linux-next.orig/mm/page-writeback.c	2011-08-16 09:30:23.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
@@ -1333,6 +1333,7 @@ void account_page_dirtied(struct page *p
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
 	}
--- linux-next.orig/mm/backing-dev.c	2011-08-16 09:30:23.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-17 10:15:45.000000000 +0800
@@ -97,6 +97,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:     %10lu kB\n"
 		   "DirtyThresh:        %10lu kB\n"
 		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiDirtied:         %10lu kB\n"
 		   "BdiWritten:         %10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
@@ -109,6 +110,7 @@ static int bdi_debug_stats_show(struct s
 		   K(bdi_thresh),
 		   K(dirty_thresh),
 		   K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
 		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty,



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 01/18] writeback: account per-bdi accumulated dirtied pages
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Jan Kara, Michael Rubin, Wu Fengguang,
	Andrew Morton, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-bdi-dirtied.patch --]
[-- Type: text/plain, Size: 2322 bytes --]

Introduce the BDI_DIRTIED counter. It will be used for estimating the
bdi's dirty bandwidth.

CC: Jan Kara <jack@suse.cz>
CC: Michael Rubin <mrubin@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    2 ++
 mm/page-writeback.c         |    1 +
 3 files changed, 4 insertions(+)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-16 09:30:23.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-17 10:15:45.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_DIRTIED,
 	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
--- linux-next.orig/mm/page-writeback.c	2011-08-16 09:30:23.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
@@ -1333,6 +1333,7 @@ void account_page_dirtied(struct page *p
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
 	}
--- linux-next.orig/mm/backing-dev.c	2011-08-16 09:30:23.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-17 10:15:45.000000000 +0800
@@ -97,6 +97,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:     %10lu kB\n"
 		   "DirtyThresh:        %10lu kB\n"
 		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiDirtied:         %10lu kB\n"
 		   "BdiWritten:         %10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
@@ -109,6 +110,7 @@ static int bdi_debug_stats_show(struct s
 		   K(bdi_thresh),
 		   K(dirty_thresh),
 		   K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
 		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 01/18] writeback: account per-bdi accumulated dirtied pages
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Jan Kara, Michael Rubin, Wu Fengguang,
	Andrew Morton, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-bdi-dirtied.patch --]
[-- Type: text/plain, Size: 2322 bytes --]

Introduce the BDI_DIRTIED counter. It will be used for estimating the
bdi's dirty bandwidth.

CC: Jan Kara <jack@suse.cz>
CC: Michael Rubin <mrubin@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    2 ++
 mm/page-writeback.c         |    1 +
 3 files changed, 4 insertions(+)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-16 09:30:23.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-17 10:15:45.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_DIRTIED,
 	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
--- linux-next.orig/mm/page-writeback.c	2011-08-16 09:30:23.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
@@ -1333,6 +1333,7 @@ void account_page_dirtied(struct page *p
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
 	}
--- linux-next.orig/mm/backing-dev.c	2011-08-16 09:30:23.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-17 10:15:45.000000000 +0800
@@ -97,6 +97,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:     %10lu kB\n"
 		   "DirtyThresh:        %10lu kB\n"
 		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiDirtied:         %10lu kB\n"
 		   "BdiWritten:         %10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
@@ -109,6 +110,7 @@ static int bdi_debug_stats_show(struct s
 		   K(bdi_thresh),
 		   K(dirty_thresh),
 		   K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
 		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 02/18] writeback: dirty position control
  2011-09-04  1:53 ` Wu Fengguang
  (?)
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Jan Kara, Wu Fengguang, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 14322 bytes --]

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range

	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],

we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = bdi_setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the writeout bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to bdi_setpoint ~= setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

Given equations

        span = x_intercept - bdi_setpoint
        k = df/dx = - 1 / span

and the extremum values

        span = bdi_thresh
        dx = bdi_thresh

we get

        df = - dx / span = - 1.0

That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence
task ratelimit will fluctuate by -100%.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  213 +++++++++++++++++++++++++++++++++++-
 3 files changed, 210 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-26 15:57:18.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 15:57:34.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -409,6 +411,12 @@ int bdi_set_max_ratio(struct backing_dev
 }
 EXPORT_SYMBOL(bdi_set_max_ratio);
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -493,6 +501,197 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = dirty_ratelimit * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0              bdi_setpoint                    x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long write_bw = bdi->avg_write_bandwidth;
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* dirty pages' target balance point */
+	unsigned long bdi_setpoint;
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                           setpoint - dirty 3
+	 *        f(dirty) := 1.0 + (----------------)
+	 *                           limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup dirty_ratelimit reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx      <= 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * We have computed basic pos_ratio above based on global situation. If
+	 * the bdi is over/under its share of dirty pages, we want to scale
+	 * pos_ratio further down/up. That is done by the following mechanism.
+	 */
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
+	 *
+	 *                        x_intercept - bdi_dirty
+	 *                     := --------------------------
+	 *                        x_intercept - bdi_setpoint
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(bdi_setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
+	 * for various filesystems, where (2) can yield in a reasonable 12.5%
+	 * fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly and choose a slope that
+	 * yields 100% pos_ratio fluctuation on suddenly doubled bdi_thresh.
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:
+	 *	bdi_setpoint = setpoint * bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
+	bdi_setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(8*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 *
+	 *        bdi_thresh                    thresh - bdi_thresh
+	 * span = ---------- * (8 * write_bw) + ------------------- * bdi_thresh
+	 *          thresh                            thresh
+	 */
+	span = (thresh - bdi_thresh + 8 * write_bw) * (u64)x >> 16;
+	x_intercept = bdi_setpoint + span;
+
+	span >>= 1;
+	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			bdi_setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -591,6 +790,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -627,6 +827,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -635,8 +836,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -677,7 +878,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -721,8 +923,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-26 15:57:18.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-26 15:57:20.000000000 +0800
@@ -675,7 +675,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-26 15:57:18.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-26 15:57:20.000000000 +0800
@@ -141,6 +141,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 02/18] writeback: dirty position control
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Jan Kara, Wu Fengguang, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 14625 bytes --]

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range

	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],

we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = bdi_setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the writeout bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to bdi_setpoint ~= setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

Given equations

        span = x_intercept - bdi_setpoint
        k = df/dx = - 1 / span

and the extremum values

        span = bdi_thresh
        dx = bdi_thresh

we get

        df = - dx / span = - 1.0

That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence
task ratelimit will fluctuate by -100%.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  213 +++++++++++++++++++++++++++++++++++-
 3 files changed, 210 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-26 15:57:18.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 15:57:34.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -409,6 +411,12 @@ int bdi_set_max_ratio(struct backing_dev
 }
 EXPORT_SYMBOL(bdi_set_max_ratio);
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -493,6 +501,197 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = dirty_ratelimit * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0              bdi_setpoint                    x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long write_bw = bdi->avg_write_bandwidth;
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* dirty pages' target balance point */
+	unsigned long bdi_setpoint;
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                           setpoint - dirty 3
+	 *        f(dirty) := 1.0 + (----------------)
+	 *                           limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup dirty_ratelimit reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx      <= 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * We have computed basic pos_ratio above based on global situation. If
+	 * the bdi is over/under its share of dirty pages, we want to scale
+	 * pos_ratio further down/up. That is done by the following mechanism.
+	 */
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
+	 *
+	 *                        x_intercept - bdi_dirty
+	 *                     := --------------------------
+	 *                        x_intercept - bdi_setpoint
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(bdi_setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
+	 * for various filesystems, where (2) can yield in a reasonable 12.5%
+	 * fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly and choose a slope that
+	 * yields 100% pos_ratio fluctuation on suddenly doubled bdi_thresh.
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:
+	 *	bdi_setpoint = setpoint * bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
+	bdi_setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(8*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 *
+	 *        bdi_thresh                    thresh - bdi_thresh
+	 * span = ---------- * (8 * write_bw) + ------------------- * bdi_thresh
+	 *          thresh                            thresh
+	 */
+	span = (thresh - bdi_thresh + 8 * write_bw) * (u64)x >> 16;
+	x_intercept = bdi_setpoint + span;
+
+	span >>= 1;
+	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			bdi_setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -591,6 +790,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -627,6 +827,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -635,8 +836,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -677,7 +878,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -721,8 +923,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-26 15:57:18.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-26 15:57:20.000000000 +0800
@@ -675,7 +675,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-26 15:57:18.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-26 15:57:20.000000000 +0800
@@ -141,6 +141,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 02/18] writeback: dirty position control
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Jan Kara, Wu Fengguang, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 14625 bytes --]

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range

	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],

we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = bdi_setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the writeout bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to bdi_setpoint ~= setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

Given equations

        span = x_intercept - bdi_setpoint
        k = df/dx = - 1 / span

and the extremum values

        span = bdi_thresh
        dx = bdi_thresh

we get

        df = - dx / span = - 1.0

That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence
task ratelimit will fluctuate by -100%.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  213 +++++++++++++++++++++++++++++++++++-
 3 files changed, 210 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-26 15:57:18.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 15:57:34.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -409,6 +411,12 @@ int bdi_set_max_ratio(struct backing_dev
 }
 EXPORT_SYMBOL(bdi_set_max_ratio);
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -493,6 +501,197 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = dirty_ratelimit * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0              bdi_setpoint                    x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long write_bw = bdi->avg_write_bandwidth;
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* dirty pages' target balance point */
+	unsigned long bdi_setpoint;
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                           setpoint - dirty 3
+	 *        f(dirty) := 1.0 + (----------------)
+	 *                           limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup dirty_ratelimit reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx      <= 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * We have computed basic pos_ratio above based on global situation. If
+	 * the bdi is over/under its share of dirty pages, we want to scale
+	 * pos_ratio further down/up. That is done by the following mechanism.
+	 */
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
+	 *
+	 *                        x_intercept - bdi_dirty
+	 *                     := --------------------------
+	 *                        x_intercept - bdi_setpoint
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(bdi_setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
+	 * for various filesystems, where (2) can yield in a reasonable 12.5%
+	 * fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly and choose a slope that
+	 * yields 100% pos_ratio fluctuation on suddenly doubled bdi_thresh.
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:
+	 *	bdi_setpoint = setpoint * bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
+	bdi_setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(8*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 *
+	 *        bdi_thresh                    thresh - bdi_thresh
+	 * span = ---------- * (8 * write_bw) + ------------------- * bdi_thresh
+	 *          thresh                            thresh
+	 */
+	span = (thresh - bdi_thresh + 8 * write_bw) * (u64)x >> 16;
+	x_intercept = bdi_setpoint + span;
+
+	span >>= 1;
+	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			bdi_setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -591,6 +790,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -627,6 +827,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -635,8 +836,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -677,7 +878,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -721,8 +923,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-26 15:57:18.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-26 15:57:20.000000000 +0800
@@ -675,7 +675,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-26 15:57:18.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-26 15:57:20.000000000 +0800
@@ -141,6 +141,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 03/18] writeback: dirty rate control
  2011-09-04  1:53 ` Wu Fengguang
  (?)
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit --]
[-- Type: text/plain, Size: 9925 bytes --]

It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.

On write() syscall, use bdi->dirty_ratelimit
============================================

    balance_dirty_pages(pages_dirtied)
    {
        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
        pause = pages_dirtied / task_ratelimit;
        sleep(pause);
    }

On every 200ms, update bdi->dirty_ratelimit
===========================================

    bdi_update_dirty_ratelimit()
    {
        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
        balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate;
        bdi->dirty_ratelimit = balanced_dirty_ratelimit
    }

Estimation of balanced bdi->dirty_ratelimit
===========================================

balanced task_ratelimit
-----------------------

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

        dirty_rate == write_bw                                          (1)

The fairness requirement gives us:

        task_ratelimit = balanced_dirty_ratelimit
                       == write_bw / N                                  (2)

where N is the number of dd tasks.  We don't know N beforehand, but
still can estimate balanced_dirty_ratelimit within 200ms.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0                               (3)
                         (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

        dirty_rate == N * task_rate
                   == N * task_ratelimit_0                              (4)
Or
        task_ratelimit_0 == dirty_rate / N                              (5)

Now we conclude that the balanced task ratelimit can be estimated by

                                                      write_bw
        balanced_dirty_ratelimit = task_ratelimit_0 * ----------        (6)
                                                      dirty_rate

Because with (4) and (5) we can get the desired equality (1):

                                                       write_bw
        balanced_dirty_ratelimit == (dirty_rate / N) * ----------
                                                       dirty_rate
                                 == write_bw / N

Then using the balanced task ratelimit we can compute task pause times like:

        task_pause = task->nr_dirtied / task_ratelimit

task_ratelimit with position control
------------------------------------

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

The dirty position control works by extending (2) to

        task_ratelimit = balanced_dirty_ratelimit * pos_ratio           (7)

where pos_ratio is a negative feedback function that subjects to

1) f(setpoint) = 1.0
2) df/dx < 0

That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
pages are created less fast than they are cleaned, thus DROP to the
setpoints (and the reverse).

Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
remains CONSTANT for the past 200ms, we get

        task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio         (8)

Putting (8) into (6), we get the formula used in
bdi_update_dirty_ratelimit():

                                                write_bw
        balanced_dirty_ratelimit *= pos_ratio * ----------              (9)
                                                dirty_rate

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
---
 include/linux/backing-dev.h |    7 ++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   82 +++++++++++++++++++++++++++++++++-
 3 files changed, 88 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-26 13:53:40.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-26 13:54:13.000000000 +0800
@@ -75,10 +75,17 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	unsigned long bw_time_stamp;	/* last time write bw is updated */
+	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
+	/*
+	 * The base dirty throttle rate, re-calculated on every 200ms.
+	 * All the bdi tasks' dirty rate will be curbed under it.
+	 */
+	unsigned long dirty_ratelimit;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2011-08-26 13:53:40.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-26 13:54:13.000000000 +0800
@@ -670,6 +670,7 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 
+	bdi->dirty_ratelimit = INIT_BW;
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_write_bandwidth = INIT_BW;
 
--- linux-next.orig/mm/page-writeback.c	2011-08-26 13:52:42.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 15:52:42.000000000 +0800
@@ -787,6 +787,78 @@ static void global_update_bandwidth(unsi
 	spin_unlock(&dirty_lock);
 }
 
+/*
+ * Maintain bdi->dirty_ratelimit, the base dirty throttle rate.
+ *
+ * Normal bdi tasks will be curbed at or below it in long term.
+ * Obviously it should be around (write_bw / N) when there are N dd tasks.
+ */
+static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
+				       unsigned long thresh,
+				       unsigned long bg_thresh,
+				       unsigned long dirty,
+				       unsigned long bdi_thresh,
+				       unsigned long bdi_dirty,
+				       unsigned long dirtied,
+				       unsigned long elapsed)
+{
+	unsigned long write_bw = bdi->avg_write_bandwidth;
+	unsigned long dirty_ratelimit = bdi->dirty_ratelimit;
+	unsigned long dirty_rate;
+	unsigned long task_ratelimit;
+	unsigned long balanced_dirty_ratelimit;
+	unsigned long pos_ratio;
+
+	/*
+	 * The dirty rate will match the writeout rate in long term, except
+	 * when dirty pages are truncated by userspace or re-dirtied by FS.
+	 */
+	dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+	pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
+				       bdi_thresh, bdi_dirty);
+	/*
+	 * task_ratelimit reflects each dd's dirty rate for the past 200ms.
+	 */
+	task_ratelimit = (u64)dirty_ratelimit *
+					pos_ratio >> RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * A linear estimation of the "balanced" throttle rate. The theory is,
+	 * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
+	 * dirty_rate will be measured to be (N * task_ratelimit). So the below
+	 * formula will yield the balanced rate limit (write_bw / N).
+	 *
+	 * Note that the expanded form is not a pure rate feedback:
+	 *	rate_(i+1) = rate_(i) * (write_bw / dirty_rate)		     (1)
+	 * but also takes pos_ratio into account:
+	 *	rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
+	 *
+	 * (1) is not realistic because pos_ratio also takes part in balancing
+	 * the dirty rate.  Consider the state
+	 *	pos_ratio = 0.5						     (3)
+	 *	rate = 2 * (write_bw / N)				     (4)
+	 * If (1) is used, it will stuck in that state! Because each dd will
+	 * be throttled at
+	 *	task_ratelimit = pos_ratio * rate = (write_bw / N)	     (5)
+	 * yielding
+	 *	dirty_rate = N * task_ratelimit = write_bw		     (6)
+	 * put (6) into (1) we get
+	 *	rate_(i+1) = rate_(i)					     (7)
+	 *
+	 * So we end up using (2) to always keep
+	 *	rate_(i+1) ~= (write_bw / N)				     (8)
+	 * regardless of the value of pos_ratio. As long as (8) is satisfied,
+	 * pos_ratio is able to drive itself to 1.0, which is not only where
+	 * the dirty count meet the setpoint, but also where the slope of
+	 * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
+	 */
+	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
+					   dirty_rate | 1);
+
+	bdi->dirty_ratelimit = max(balanced_dirty_ratelimit, 1UL);
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
 			    unsigned long bg_thresh,
@@ -797,6 +869,7 @@ void __bdi_update_bandwidth(struct backi
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long dirtied;
 	unsigned long written;
 
 	/*
@@ -805,6 +878,7 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
+	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
 
 	/*
@@ -814,12 +888,16 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
-	if (thresh)
+	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
-
+		bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
+					   bdi_thresh, bdi_dirty,
+					   dirtied, elapsed);
+	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
+	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
 }



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 03/18] writeback: dirty rate control
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit --]
[-- Type: text/plain, Size: 10228 bytes --]

It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.

On write() syscall, use bdi->dirty_ratelimit
============================================

    balance_dirty_pages(pages_dirtied)
    {
        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
        pause = pages_dirtied / task_ratelimit;
        sleep(pause);
    }

On every 200ms, update bdi->dirty_ratelimit
===========================================

    bdi_update_dirty_ratelimit()
    {
        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
        balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate;
        bdi->dirty_ratelimit = balanced_dirty_ratelimit
    }

Estimation of balanced bdi->dirty_ratelimit
===========================================

balanced task_ratelimit
-----------------------

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

        dirty_rate == write_bw                                          (1)

The fairness requirement gives us:

        task_ratelimit = balanced_dirty_ratelimit
                       == write_bw / N                                  (2)

where N is the number of dd tasks.  We don't know N beforehand, but
still can estimate balanced_dirty_ratelimit within 200ms.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0                               (3)
                         (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

        dirty_rate == N * task_rate
                   == N * task_ratelimit_0                              (4)
Or
        task_ratelimit_0 == dirty_rate / N                              (5)

Now we conclude that the balanced task ratelimit can be estimated by

                                                      write_bw
        balanced_dirty_ratelimit = task_ratelimit_0 * ----------        (6)
                                                      dirty_rate

Because with (4) and (5) we can get the desired equality (1):

                                                       write_bw
        balanced_dirty_ratelimit == (dirty_rate / N) * ----------
                                                       dirty_rate
                                 == write_bw / N

Then using the balanced task ratelimit we can compute task pause times like:

        task_pause = task->nr_dirtied / task_ratelimit

task_ratelimit with position control
------------------------------------

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

The dirty position control works by extending (2) to

        task_ratelimit = balanced_dirty_ratelimit * pos_ratio           (7)

where pos_ratio is a negative feedback function that subjects to

1) f(setpoint) = 1.0
2) df/dx < 0

That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
pages are created less fast than they are cleaned, thus DROP to the
setpoints (and the reverse).

Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
remains CONSTANT for the past 200ms, we get

        task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio         (8)

Putting (8) into (6), we get the formula used in
bdi_update_dirty_ratelimit():

                                                write_bw
        balanced_dirty_ratelimit *= pos_ratio * ----------              (9)
                                                dirty_rate

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
---
 include/linux/backing-dev.h |    7 ++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   82 +++++++++++++++++++++++++++++++++-
 3 files changed, 88 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-26 13:53:40.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-26 13:54:13.000000000 +0800
@@ -75,10 +75,17 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	unsigned long bw_time_stamp;	/* last time write bw is updated */
+	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
+	/*
+	 * The base dirty throttle rate, re-calculated on every 200ms.
+	 * All the bdi tasks' dirty rate will be curbed under it.
+	 */
+	unsigned long dirty_ratelimit;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2011-08-26 13:53:40.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-26 13:54:13.000000000 +0800
@@ -670,6 +670,7 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 
+	bdi->dirty_ratelimit = INIT_BW;
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_write_bandwidth = INIT_BW;
 
--- linux-next.orig/mm/page-writeback.c	2011-08-26 13:52:42.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 15:52:42.000000000 +0800
@@ -787,6 +787,78 @@ static void global_update_bandwidth(unsi
 	spin_unlock(&dirty_lock);
 }
 
+/*
+ * Maintain bdi->dirty_ratelimit, the base dirty throttle rate.
+ *
+ * Normal bdi tasks will be curbed at or below it in long term.
+ * Obviously it should be around (write_bw / N) when there are N dd tasks.
+ */
+static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
+				       unsigned long thresh,
+				       unsigned long bg_thresh,
+				       unsigned long dirty,
+				       unsigned long bdi_thresh,
+				       unsigned long bdi_dirty,
+				       unsigned long dirtied,
+				       unsigned long elapsed)
+{
+	unsigned long write_bw = bdi->avg_write_bandwidth;
+	unsigned long dirty_ratelimit = bdi->dirty_ratelimit;
+	unsigned long dirty_rate;
+	unsigned long task_ratelimit;
+	unsigned long balanced_dirty_ratelimit;
+	unsigned long pos_ratio;
+
+	/*
+	 * The dirty rate will match the writeout rate in long term, except
+	 * when dirty pages are truncated by userspace or re-dirtied by FS.
+	 */
+	dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+	pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
+				       bdi_thresh, bdi_dirty);
+	/*
+	 * task_ratelimit reflects each dd's dirty rate for the past 200ms.
+	 */
+	task_ratelimit = (u64)dirty_ratelimit *
+					pos_ratio >> RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * A linear estimation of the "balanced" throttle rate. The theory is,
+	 * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
+	 * dirty_rate will be measured to be (N * task_ratelimit). So the below
+	 * formula will yield the balanced rate limit (write_bw / N).
+	 *
+	 * Note that the expanded form is not a pure rate feedback:
+	 *	rate_(i+1) = rate_(i) * (write_bw / dirty_rate)		     (1)
+	 * but also takes pos_ratio into account:
+	 *	rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
+	 *
+	 * (1) is not realistic because pos_ratio also takes part in balancing
+	 * the dirty rate.  Consider the state
+	 *	pos_ratio = 0.5						     (3)
+	 *	rate = 2 * (write_bw / N)				     (4)
+	 * If (1) is used, it will stuck in that state! Because each dd will
+	 * be throttled at
+	 *	task_ratelimit = pos_ratio * rate = (write_bw / N)	     (5)
+	 * yielding
+	 *	dirty_rate = N * task_ratelimit = write_bw		     (6)
+	 * put (6) into (1) we get
+	 *	rate_(i+1) = rate_(i)					     (7)
+	 *
+	 * So we end up using (2) to always keep
+	 *	rate_(i+1) ~= (write_bw / N)				     (8)
+	 * regardless of the value of pos_ratio. As long as (8) is satisfied,
+	 * pos_ratio is able to drive itself to 1.0, which is not only where
+	 * the dirty count meet the setpoint, but also where the slope of
+	 * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
+	 */
+	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
+					   dirty_rate | 1);
+
+	bdi->dirty_ratelimit = max(balanced_dirty_ratelimit, 1UL);
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
 			    unsigned long bg_thresh,
@@ -797,6 +869,7 @@ void __bdi_update_bandwidth(struct backi
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long dirtied;
 	unsigned long written;
 
 	/*
@@ -805,6 +878,7 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
+	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
 
 	/*
@@ -814,12 +888,16 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
-	if (thresh)
+	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
-
+		bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
+					   bdi_thresh, bdi_dirty,
+					   dirtied, elapsed);
+	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
+	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 03/18] writeback: dirty rate control
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit --]
[-- Type: text/plain, Size: 10228 bytes --]

It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.

On write() syscall, use bdi->dirty_ratelimit
============================================

    balance_dirty_pages(pages_dirtied)
    {
        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
        pause = pages_dirtied / task_ratelimit;
        sleep(pause);
    }

On every 200ms, update bdi->dirty_ratelimit
===========================================

    bdi_update_dirty_ratelimit()
    {
        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
        balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate;
        bdi->dirty_ratelimit = balanced_dirty_ratelimit
    }

Estimation of balanced bdi->dirty_ratelimit
===========================================

balanced task_ratelimit
-----------------------

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

        dirty_rate == write_bw                                          (1)

The fairness requirement gives us:

        task_ratelimit = balanced_dirty_ratelimit
                       == write_bw / N                                  (2)

where N is the number of dd tasks.  We don't know N beforehand, but
still can estimate balanced_dirty_ratelimit within 200ms.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0                               (3)
                         (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

        dirty_rate == N * task_rate
                   == N * task_ratelimit_0                              (4)
Or
        task_ratelimit_0 == dirty_rate / N                              (5)

Now we conclude that the balanced task ratelimit can be estimated by

                                                      write_bw
        balanced_dirty_ratelimit = task_ratelimit_0 * ----------        (6)
                                                      dirty_rate

Because with (4) and (5) we can get the desired equality (1):

                                                       write_bw
        balanced_dirty_ratelimit == (dirty_rate / N) * ----------
                                                       dirty_rate
                                 == write_bw / N

Then using the balanced task ratelimit we can compute task pause times like:

        task_pause = task->nr_dirtied / task_ratelimit

task_ratelimit with position control
------------------------------------

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

The dirty position control works by extending (2) to

        task_ratelimit = balanced_dirty_ratelimit * pos_ratio           (7)

where pos_ratio is a negative feedback function that subjects to

1) f(setpoint) = 1.0
2) df/dx < 0

That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
pages are created less fast than they are cleaned, thus DROP to the
setpoints (and the reverse).

Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
remains CONSTANT for the past 200ms, we get

        task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio         (8)

Putting (8) into (6), we get the formula used in
bdi_update_dirty_ratelimit():

                                                write_bw
        balanced_dirty_ratelimit *= pos_ratio * ----------              (9)
                                                dirty_rate

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
---
 include/linux/backing-dev.h |    7 ++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   82 +++++++++++++++++++++++++++++++++-
 3 files changed, 88 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-26 13:53:40.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-26 13:54:13.000000000 +0800
@@ -75,10 +75,17 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	unsigned long bw_time_stamp;	/* last time write bw is updated */
+	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
+	/*
+	 * The base dirty throttle rate, re-calculated on every 200ms.
+	 * All the bdi tasks' dirty rate will be curbed under it.
+	 */
+	unsigned long dirty_ratelimit;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2011-08-26 13:53:40.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-26 13:54:13.000000000 +0800
@@ -670,6 +670,7 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 
+	bdi->dirty_ratelimit = INIT_BW;
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_write_bandwidth = INIT_BW;
 
--- linux-next.orig/mm/page-writeback.c	2011-08-26 13:52:42.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 15:52:42.000000000 +0800
@@ -787,6 +787,78 @@ static void global_update_bandwidth(unsi
 	spin_unlock(&dirty_lock);
 }
 
+/*
+ * Maintain bdi->dirty_ratelimit, the base dirty throttle rate.
+ *
+ * Normal bdi tasks will be curbed at or below it in long term.
+ * Obviously it should be around (write_bw / N) when there are N dd tasks.
+ */
+static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
+				       unsigned long thresh,
+				       unsigned long bg_thresh,
+				       unsigned long dirty,
+				       unsigned long bdi_thresh,
+				       unsigned long bdi_dirty,
+				       unsigned long dirtied,
+				       unsigned long elapsed)
+{
+	unsigned long write_bw = bdi->avg_write_bandwidth;
+	unsigned long dirty_ratelimit = bdi->dirty_ratelimit;
+	unsigned long dirty_rate;
+	unsigned long task_ratelimit;
+	unsigned long balanced_dirty_ratelimit;
+	unsigned long pos_ratio;
+
+	/*
+	 * The dirty rate will match the writeout rate in long term, except
+	 * when dirty pages are truncated by userspace or re-dirtied by FS.
+	 */
+	dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+	pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
+				       bdi_thresh, bdi_dirty);
+	/*
+	 * task_ratelimit reflects each dd's dirty rate for the past 200ms.
+	 */
+	task_ratelimit = (u64)dirty_ratelimit *
+					pos_ratio >> RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * A linear estimation of the "balanced" throttle rate. The theory is,
+	 * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
+	 * dirty_rate will be measured to be (N * task_ratelimit). So the below
+	 * formula will yield the balanced rate limit (write_bw / N).
+	 *
+	 * Note that the expanded form is not a pure rate feedback:
+	 *	rate_(i+1) = rate_(i) * (write_bw / dirty_rate)		     (1)
+	 * but also takes pos_ratio into account:
+	 *	rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
+	 *
+	 * (1) is not realistic because pos_ratio also takes part in balancing
+	 * the dirty rate.  Consider the state
+	 *	pos_ratio = 0.5						     (3)
+	 *	rate = 2 * (write_bw / N)				     (4)
+	 * If (1) is used, it will stuck in that state! Because each dd will
+	 * be throttled at
+	 *	task_ratelimit = pos_ratio * rate = (write_bw / N)	     (5)
+	 * yielding
+	 *	dirty_rate = N * task_ratelimit = write_bw		     (6)
+	 * put (6) into (1) we get
+	 *	rate_(i+1) = rate_(i)					     (7)
+	 *
+	 * So we end up using (2) to always keep
+	 *	rate_(i+1) ~= (write_bw / N)				     (8)
+	 * regardless of the value of pos_ratio. As long as (8) is satisfied,
+	 * pos_ratio is able to drive itself to 1.0, which is not only where
+	 * the dirty count meet the setpoint, but also where the slope of
+	 * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
+	 */
+	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
+					   dirty_rate | 1);
+
+	bdi->dirty_ratelimit = max(balanced_dirty_ratelimit, 1UL);
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
 			    unsigned long bg_thresh,
@@ -797,6 +869,7 @@ void __bdi_update_bandwidth(struct backi
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long dirtied;
 	unsigned long written;
 
 	/*
@@ -805,6 +878,7 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
+	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
 
 	/*
@@ -814,12 +888,16 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
-	if (thresh)
+	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
-
+		bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
+					   bdi_thresh, bdi_dirty,
+					   dirtied, elapsed);
+	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
+	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 04/18] writeback: stabilize bdi->dirty_ratelimit
  2011-09-04  1:53 ` Wu Fengguang
  (?)
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit-stablize --]
[-- Type: text/plain, Size: 6165 bytes --]

There are some imperfections in balanced_dirty_ratelimit.

1) large fluctuations

The dirty_rate used for computing balanced_dirty_ratelimit is merely
averaged in the past 200ms (very small comparing to the 3s estimation
period for write_bw), which makes rather dispersed distribution of
balanced_dirty_ratelimit.

It's pretty hard to average out the singular points by increasing the
estimation period. Considering that the averaging technique will
introduce very undesirable time lags, I give it up totally. (btw, the 3s
write_bw averaging time lag is much more acceptable because its impact
is one-way and therefore won't lead to oscillations.)

The more practical way is filtering -- most singular
balanced_dirty_ratelimit points can be filtered out by remembering some
prev_balanced_rate and prev_prev_balanced_rate. However the more
reliable way is to guard balanced_dirty_ratelimit with task_ratelimit.

2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
match could become unbalanced, which may lead to large systematical
errors in balanced_dirty_ratelimit. The truncates, due to its possibly
bumpy nature, can hardly be compensated smoothly. So let's face it. When
some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit
high, dirty pages will go higher than the setpoint. task_ratelimit will
in turn become lower than dirty_ratelimit.  So if we consider both
balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit
only when they are on the same side of dirty_ratelimit, the systematical
errors in balanced_dirty_ratelimit won't be able to bring
dirty_ratelimit far away.

The balanced_dirty_ratelimit estimation may also be inaccurate near
@limit or @freerun, however is less an issue.

3) since we ultimately want to

- keep the fluctuations of task ratelimit as small as possible
- keep the dirty pages around the setpoint as long time as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit),
there is no point to bring up dirty_ratelimit in a hurry only to hurt
both the above two goals.

So, we make use of task_ratelimit to limit the update of dirty_ratelimit
in two ways:

1) avoid changing dirty rate when it's against the position control target
   (the adjusted rate will slow down the progress of dirty pages going
   back to setpoint).

2) limit the step size. task_ratelimit is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   balanced_dirty_ratelimit. task_ratelimit also has the nice smaller
   errors in stable state and typically larger errors when there are big
   errors in rate.  So it's a pretty good limiting factor for the step
   size of dirty_ratelimit.

Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit.
task_ratelimit is merely used as a limiting factor.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   64 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 63 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-26 16:22:48.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 16:23:06.000000000 +0800
@@ -809,6 +809,7 @@ static void bdi_update_dirty_ratelimit(s
 	unsigned long task_ratelimit;
 	unsigned long balanced_dirty_ratelimit;
 	unsigned long pos_ratio;
+	unsigned long step;
 
 	/*
 	 * The dirty rate will match the writeout rate in long term, except
@@ -857,7 +858,68 @@ static void bdi_update_dirty_ratelimit(s
 	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
 					   dirty_rate | 1);
 
-	bdi->dirty_ratelimit = max(balanced_dirty_ratelimit, 1UL);
+	/*
+	 * We could safely do this and return immediately:
+	 *
+	 *	bdi->dirty_ratelimit = balanced_dirty_ratelimit;
+	 *
+	 * However to get a more stable dirty_ratelimit, the below elaborated
+	 * code makes use of task_ratelimit to filter out sigular points and
+	 * limit the step size.
+	 *
+	 * The below code essentially only uses the relative value of
+	 *
+	 *	task_ratelimit - dirty_ratelimit
+	 *	= (pos_ratio - 1) * dirty_ratelimit
+	 *
+	 * which reflects the direction and size of dirty position error.
+	 */
+
+	/*
+	 * dirty_ratelimit will follow balanced_dirty_ratelimit iff
+	 * task_ratelimit is on the same side of dirty_ratelimit, too.
+	 * For example, when
+	 * - dirty_ratelimit > balanced_dirty_ratelimit
+	 * - dirty_ratelimit > task_ratelimit (dirty pages are above setpoint)
+	 * lowering dirty_ratelimit will help meet both the position and rate
+	 * control targets. Otherwise, don't update dirty_ratelimit if it will
+	 * only help meet the rate target. After all, what the users ultimately
+	 * feel and care are stable dirty rate and small position error.
+	 *
+	 * |task_ratelimit - dirty_ratelimit| is used to limit the step size
+	 * and filter out the sigular points of balanced_dirty_ratelimit. Which
+	 * keeps jumping around randomly and can even leap far away at times
+	 * due to the small 200ms estimation period of dirty_rate (we want to
+	 * keep that period small to reduce time lags).
+	 */
+	step = 0;
+	if (dirty_ratelimit < balanced_dirty_ratelimit) {
+		if (dirty_ratelimit < task_ratelimit)
+			step = min(balanced_dirty_ratelimit,
+				   task_ratelimit) - dirty_ratelimit;
+	} else {
+		if (dirty_ratelimit > task_ratelimit)
+			step = dirty_ratelimit - max(balanced_dirty_ratelimit,
+						     task_ratelimit);
+	}
+
+	/*
+	 * Don't pursue 100% rate matching. It's impossible since the balanced
+	 * rate itself is constantly fluctuating. So decrease the track speed
+	 * when it gets close to the target. Helps eliminate pointless tremors.
+	 */
+	step >>= dirty_ratelimit / (8 * step + 1);
+	/*
+	 * Limit the tracking speed to avoid overshooting.
+	 */
+	step = (step + 7) / 8;
+
+	if (dirty_ratelimit < balanced_dirty_ratelimit)
+		dirty_ratelimit += step;
+	else
+		dirty_ratelimit -= step;
+
+	bdi->dirty_ratelimit = max(dirty_ratelimit, 1UL);
 }
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 04/18] writeback: stabilize bdi->dirty_ratelimit
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit-stablize --]
[-- Type: text/plain, Size: 6468 bytes --]

There are some imperfections in balanced_dirty_ratelimit.

1) large fluctuations

The dirty_rate used for computing balanced_dirty_ratelimit is merely
averaged in the past 200ms (very small comparing to the 3s estimation
period for write_bw), which makes rather dispersed distribution of
balanced_dirty_ratelimit.

It's pretty hard to average out the singular points by increasing the
estimation period. Considering that the averaging technique will
introduce very undesirable time lags, I give it up totally. (btw, the 3s
write_bw averaging time lag is much more acceptable because its impact
is one-way and therefore won't lead to oscillations.)

The more practical way is filtering -- most singular
balanced_dirty_ratelimit points can be filtered out by remembering some
prev_balanced_rate and prev_prev_balanced_rate. However the more
reliable way is to guard balanced_dirty_ratelimit with task_ratelimit.

2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
match could become unbalanced, which may lead to large systematical
errors in balanced_dirty_ratelimit. The truncates, due to its possibly
bumpy nature, can hardly be compensated smoothly. So let's face it. When
some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit
high, dirty pages will go higher than the setpoint. task_ratelimit will
in turn become lower than dirty_ratelimit.  So if we consider both
balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit
only when they are on the same side of dirty_ratelimit, the systematical
errors in balanced_dirty_ratelimit won't be able to bring
dirty_ratelimit far away.

The balanced_dirty_ratelimit estimation may also be inaccurate near
@limit or @freerun, however is less an issue.

3) since we ultimately want to

- keep the fluctuations of task ratelimit as small as possible
- keep the dirty pages around the setpoint as long time as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit),
there is no point to bring up dirty_ratelimit in a hurry only to hurt
both the above two goals.

So, we make use of task_ratelimit to limit the update of dirty_ratelimit
in two ways:

1) avoid changing dirty rate when it's against the position control target
   (the adjusted rate will slow down the progress of dirty pages going
   back to setpoint).

2) limit the step size. task_ratelimit is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   balanced_dirty_ratelimit. task_ratelimit also has the nice smaller
   errors in stable state and typically larger errors when there are big
   errors in rate.  So it's a pretty good limiting factor for the step
   size of dirty_ratelimit.

Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit.
task_ratelimit is merely used as a limiting factor.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   64 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 63 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-26 16:22:48.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 16:23:06.000000000 +0800
@@ -809,6 +809,7 @@ static void bdi_update_dirty_ratelimit(s
 	unsigned long task_ratelimit;
 	unsigned long balanced_dirty_ratelimit;
 	unsigned long pos_ratio;
+	unsigned long step;
 
 	/*
 	 * The dirty rate will match the writeout rate in long term, except
@@ -857,7 +858,68 @@ static void bdi_update_dirty_ratelimit(s
 	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
 					   dirty_rate | 1);
 
-	bdi->dirty_ratelimit = max(balanced_dirty_ratelimit, 1UL);
+	/*
+	 * We could safely do this and return immediately:
+	 *
+	 *	bdi->dirty_ratelimit = balanced_dirty_ratelimit;
+	 *
+	 * However to get a more stable dirty_ratelimit, the below elaborated
+	 * code makes use of task_ratelimit to filter out sigular points and
+	 * limit the step size.
+	 *
+	 * The below code essentially only uses the relative value of
+	 *
+	 *	task_ratelimit - dirty_ratelimit
+	 *	= (pos_ratio - 1) * dirty_ratelimit
+	 *
+	 * which reflects the direction and size of dirty position error.
+	 */
+
+	/*
+	 * dirty_ratelimit will follow balanced_dirty_ratelimit iff
+	 * task_ratelimit is on the same side of dirty_ratelimit, too.
+	 * For example, when
+	 * - dirty_ratelimit > balanced_dirty_ratelimit
+	 * - dirty_ratelimit > task_ratelimit (dirty pages are above setpoint)
+	 * lowering dirty_ratelimit will help meet both the position and rate
+	 * control targets. Otherwise, don't update dirty_ratelimit if it will
+	 * only help meet the rate target. After all, what the users ultimately
+	 * feel and care are stable dirty rate and small position error.
+	 *
+	 * |task_ratelimit - dirty_ratelimit| is used to limit the step size
+	 * and filter out the sigular points of balanced_dirty_ratelimit. Which
+	 * keeps jumping around randomly and can even leap far away at times
+	 * due to the small 200ms estimation period of dirty_rate (we want to
+	 * keep that period small to reduce time lags).
+	 */
+	step = 0;
+	if (dirty_ratelimit < balanced_dirty_ratelimit) {
+		if (dirty_ratelimit < task_ratelimit)
+			step = min(balanced_dirty_ratelimit,
+				   task_ratelimit) - dirty_ratelimit;
+	} else {
+		if (dirty_ratelimit > task_ratelimit)
+			step = dirty_ratelimit - max(balanced_dirty_ratelimit,
+						     task_ratelimit);
+	}
+
+	/*
+	 * Don't pursue 100% rate matching. It's impossible since the balanced
+	 * rate itself is constantly fluctuating. So decrease the track speed
+	 * when it gets close to the target. Helps eliminate pointless tremors.
+	 */
+	step >>= dirty_ratelimit / (8 * step + 1);
+	/*
+	 * Limit the tracking speed to avoid overshooting.
+	 */
+	step = (step + 7) / 8;
+
+	if (dirty_ratelimit < balanced_dirty_ratelimit)
+		dirty_ratelimit += step;
+	else
+		dirty_ratelimit -= step;
+
+	bdi->dirty_ratelimit = max(dirty_ratelimit, 1UL);
 }
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 04/18] writeback: stabilize bdi->dirty_ratelimit
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit-stablize --]
[-- Type: text/plain, Size: 6468 bytes --]

There are some imperfections in balanced_dirty_ratelimit.

1) large fluctuations

The dirty_rate used for computing balanced_dirty_ratelimit is merely
averaged in the past 200ms (very small comparing to the 3s estimation
period for write_bw), which makes rather dispersed distribution of
balanced_dirty_ratelimit.

It's pretty hard to average out the singular points by increasing the
estimation period. Considering that the averaging technique will
introduce very undesirable time lags, I give it up totally. (btw, the 3s
write_bw averaging time lag is much more acceptable because its impact
is one-way and therefore won't lead to oscillations.)

The more practical way is filtering -- most singular
balanced_dirty_ratelimit points can be filtered out by remembering some
prev_balanced_rate and prev_prev_balanced_rate. However the more
reliable way is to guard balanced_dirty_ratelimit with task_ratelimit.

2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
match could become unbalanced, which may lead to large systematical
errors in balanced_dirty_ratelimit. The truncates, due to its possibly
bumpy nature, can hardly be compensated smoothly. So let's face it. When
some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit
high, dirty pages will go higher than the setpoint. task_ratelimit will
in turn become lower than dirty_ratelimit.  So if we consider both
balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit
only when they are on the same side of dirty_ratelimit, the systematical
errors in balanced_dirty_ratelimit won't be able to bring
dirty_ratelimit far away.

The balanced_dirty_ratelimit estimation may also be inaccurate near
@limit or @freerun, however is less an issue.

3) since we ultimately want to

- keep the fluctuations of task ratelimit as small as possible
- keep the dirty pages around the setpoint as long time as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit),
there is no point to bring up dirty_ratelimit in a hurry only to hurt
both the above two goals.

So, we make use of task_ratelimit to limit the update of dirty_ratelimit
in two ways:

1) avoid changing dirty rate when it's against the position control target
   (the adjusted rate will slow down the progress of dirty pages going
   back to setpoint).

2) limit the step size. task_ratelimit is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   balanced_dirty_ratelimit. task_ratelimit also has the nice smaller
   errors in stable state and typically larger errors when there are big
   errors in rate.  So it's a pretty good limiting factor for the step
   size of dirty_ratelimit.

Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit.
task_ratelimit is merely used as a limiting factor.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   64 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 63 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-26 16:22:48.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 16:23:06.000000000 +0800
@@ -809,6 +809,7 @@ static void bdi_update_dirty_ratelimit(s
 	unsigned long task_ratelimit;
 	unsigned long balanced_dirty_ratelimit;
 	unsigned long pos_ratio;
+	unsigned long step;
 
 	/*
 	 * The dirty rate will match the writeout rate in long term, except
@@ -857,7 +858,68 @@ static void bdi_update_dirty_ratelimit(s
 	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
 					   dirty_rate | 1);
 
-	bdi->dirty_ratelimit = max(balanced_dirty_ratelimit, 1UL);
+	/*
+	 * We could safely do this and return immediately:
+	 *
+	 *	bdi->dirty_ratelimit = balanced_dirty_ratelimit;
+	 *
+	 * However to get a more stable dirty_ratelimit, the below elaborated
+	 * code makes use of task_ratelimit to filter out sigular points and
+	 * limit the step size.
+	 *
+	 * The below code essentially only uses the relative value of
+	 *
+	 *	task_ratelimit - dirty_ratelimit
+	 *	= (pos_ratio - 1) * dirty_ratelimit
+	 *
+	 * which reflects the direction and size of dirty position error.
+	 */
+
+	/*
+	 * dirty_ratelimit will follow balanced_dirty_ratelimit iff
+	 * task_ratelimit is on the same side of dirty_ratelimit, too.
+	 * For example, when
+	 * - dirty_ratelimit > balanced_dirty_ratelimit
+	 * - dirty_ratelimit > task_ratelimit (dirty pages are above setpoint)
+	 * lowering dirty_ratelimit will help meet both the position and rate
+	 * control targets. Otherwise, don't update dirty_ratelimit if it will
+	 * only help meet the rate target. After all, what the users ultimately
+	 * feel and care are stable dirty rate and small position error.
+	 *
+	 * |task_ratelimit - dirty_ratelimit| is used to limit the step size
+	 * and filter out the sigular points of balanced_dirty_ratelimit. Which
+	 * keeps jumping around randomly and can even leap far away at times
+	 * due to the small 200ms estimation period of dirty_rate (we want to
+	 * keep that period small to reduce time lags).
+	 */
+	step = 0;
+	if (dirty_ratelimit < balanced_dirty_ratelimit) {
+		if (dirty_ratelimit < task_ratelimit)
+			step = min(balanced_dirty_ratelimit,
+				   task_ratelimit) - dirty_ratelimit;
+	} else {
+		if (dirty_ratelimit > task_ratelimit)
+			step = dirty_ratelimit - max(balanced_dirty_ratelimit,
+						     task_ratelimit);
+	}
+
+	/*
+	 * Don't pursue 100% rate matching. It's impossible since the balanced
+	 * rate itself is constantly fluctuating. So decrease the track speed
+	 * when it gets close to the target. Helps eliminate pointless tremors.
+	 */
+	step >>= dirty_ratelimit / (8 * step + 1);
+	/*
+	 * Limit the tracking speed to avoid overshooting.
+	 */
+	step = (step + 7) / 8;
+
+	if (dirty_ratelimit < balanced_dirty_ratelimit)
+		dirty_ratelimit += step;
+	else
+		dirty_ratelimit -= step;
+
+	bdi->dirty_ratelimit = max(dirty_ratelimit, 1UL);
 }
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 05/18] writeback: per task dirty rate limit
  2011-09-04  1:53 ` Wu Fengguang
  (?)
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: per-task-ratelimit --]
[-- Type: text/plain, Size: 7441 bytes --]

Add two fields to task_struct.

1) account dirtied pages in the individual tasks, for accuracy
2) per-task balance_dirty_pages() call intervals, for flexibility

The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
scale near-sqrt to the safety gap between dirty pages and threshold.

The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
pages at exactly the same time, each task will be assigned a large
initial nr_dirtied_pause, so that the dirty threshold will be exceeded
long before each task reached its nr_dirtied_pause and hence call
balance_dirty_pages().

The solution is to watch for the number of pages dirtied on each CPU in
between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
(3% dirty threshold), force call balance_dirty_pages() for a chance to
set bdi->dirty_exceeded. In normal situations, this safeguarding
condition is not expected to trigger at all.

peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 +++
 kernel/fork.c         |    3 +
 mm/page-writeback.c   |   89 ++++++++++++++++++++++------------------
 3 files changed, 60 insertions(+), 39 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-08-29 19:07:56.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-08-29 19:08:19.000000000 +0800
@@ -1521,6 +1521,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:08:05.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:08:19.000000000 +0800
@@ -54,20 +54,6 @@
  */
 static long ratelimit_pages = 32;
 
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -229,6 +215,8 @@ static void update_completion_period(voi
 	int shift = calc_period_shift();
 	prop_change_shift(&vm_completions, shift);
 	prop_change_shift(&vm_dirties, shift);
+
+	writeback_set_ratelimit();
 }
 
 int dirty_background_ratio_handler(struct ctl_table *table, int write,
@@ -982,6 +970,23 @@ static void bdi_update_bandwidth(struct 
 }
 
 /*
+ * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
+ * will look to see if it needs to start dirty throttling.
+ *
+ * If dirty_poll_interval is too low, big NUMA machines will call the expensive
+ * global_page_state() too often. So scale it near-sqrt to the safety margin
+ * (the number of pages we may dirty without exceeding the dirty limits).
+ */
+static unsigned long dirty_poll_interval(unsigned long dirty,
+					 unsigned long thresh)
+{
+	if (thresh > dirty)
+		return 1UL << (ilog2(thresh - dirty) >> 1);
+
+	return 1;
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -1113,6 +1118,9 @@ static void balance_dirty_pages(struct a
 	if (clear_dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
+	current->nr_dirtied = 0;
+	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
+
 	if (writeback_in_progress(bdi))
 		return;
 
@@ -1139,7 +1147,7 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
+static DEFINE_PER_CPU(int, bdp_ratelimits);
 
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
@@ -1159,31 +1167,39 @@ void balance_dirty_pages_ratelimited_nr(
 					unsigned long nr_pages_dirtied)
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
-	unsigned long ratelimit;
-	unsigned long *p;
+	int ratelimit;
+	int *p;
 
 	if (!bdi_cap_account_dirty(bdi))
 		return;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
-		ratelimit = 8;
+	ratelimit = current->nr_dirtied_pause;
+	if (bdi->dirty_exceeded)
+		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
+
+	current->nr_dirtied += nr_pages_dirtied;
 
+	preempt_disable();
 	/*
-	 * Check the rate limiting. Also, we do not want to throttle real-time
-	 * tasks in balance_dirty_pages(). Period.
+	 * This prevents one CPU to accumulate too many dirtied pages without
+	 * calling into balance_dirty_pages(), which can happen when there are
+	 * 1000+ tasks, all of them start dirtying pages at exactly the same
+	 * time, hence all honoured too large initial task->nr_dirtied_pause.
 	 */
-	preempt_disable();
 	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
+	if (unlikely(current->nr_dirtied >= ratelimit))
 		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
+	else {
+		*p += nr_pages_dirtied;
+		if (unlikely(*p >= ratelimit_pages)) {
+			*p = 0;
+			ratelimit = 0;
+		}
 	}
 	preempt_enable();
+
+	if (unlikely(current->nr_dirtied >= ratelimit))
+		balance_dirty_pages(mapping, current->nr_dirtied);
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
@@ -1278,22 +1294,17 @@ void laptop_sync_completion(void)
  *
  * Here we set ratelimit_pages to a level which ensures that when all CPUs are
  * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
- * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
+ * thresholds.
  */
 
 void writeback_set_ratelimit(void)
 {
-	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
 	if (ratelimit_pages < 16)
 		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
 }
 
 static int __cpuinit
--- linux-next.orig/kernel/fork.c	2011-08-29 19:07:56.000000000 +0800
+++ linux-next/kernel/fork.c	2011-08-29 19:08:19.000000000 +0800
@@ -1329,6 +1329,9 @@ static struct task_struct *copy_process(
 	p->pdeath_signal = 0;
 	p->exit_state = 0;
 
+	p->nr_dirtied = 0;
+	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
+
 	/*
 	 * Ok, make it visible to the rest of the system.
 	 * We dont wake it up yet.



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 05/18] writeback: per task dirty rate limit
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: per-task-ratelimit --]
[-- Type: text/plain, Size: 7744 bytes --]

Add two fields to task_struct.

1) account dirtied pages in the individual tasks, for accuracy
2) per-task balance_dirty_pages() call intervals, for flexibility

The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
scale near-sqrt to the safety gap between dirty pages and threshold.

The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
pages at exactly the same time, each task will be assigned a large
initial nr_dirtied_pause, so that the dirty threshold will be exceeded
long before each task reached its nr_dirtied_pause and hence call
balance_dirty_pages().

The solution is to watch for the number of pages dirtied on each CPU in
between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
(3% dirty threshold), force call balance_dirty_pages() for a chance to
set bdi->dirty_exceeded. In normal situations, this safeguarding
condition is not expected to trigger at all.

peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 +++
 kernel/fork.c         |    3 +
 mm/page-writeback.c   |   89 ++++++++++++++++++++++------------------
 3 files changed, 60 insertions(+), 39 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-08-29 19:07:56.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-08-29 19:08:19.000000000 +0800
@@ -1521,6 +1521,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:08:05.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:08:19.000000000 +0800
@@ -54,20 +54,6 @@
  */
 static long ratelimit_pages = 32;
 
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -229,6 +215,8 @@ static void update_completion_period(voi
 	int shift = calc_period_shift();
 	prop_change_shift(&vm_completions, shift);
 	prop_change_shift(&vm_dirties, shift);
+
+	writeback_set_ratelimit();
 }
 
 int dirty_background_ratio_handler(struct ctl_table *table, int write,
@@ -982,6 +970,23 @@ static void bdi_update_bandwidth(struct 
 }
 
 /*
+ * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
+ * will look to see if it needs to start dirty throttling.
+ *
+ * If dirty_poll_interval is too low, big NUMA machines will call the expensive
+ * global_page_state() too often. So scale it near-sqrt to the safety margin
+ * (the number of pages we may dirty without exceeding the dirty limits).
+ */
+static unsigned long dirty_poll_interval(unsigned long dirty,
+					 unsigned long thresh)
+{
+	if (thresh > dirty)
+		return 1UL << (ilog2(thresh - dirty) >> 1);
+
+	return 1;
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -1113,6 +1118,9 @@ static void balance_dirty_pages(struct a
 	if (clear_dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
+	current->nr_dirtied = 0;
+	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
+
 	if (writeback_in_progress(bdi))
 		return;
 
@@ -1139,7 +1147,7 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
+static DEFINE_PER_CPU(int, bdp_ratelimits);
 
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
@@ -1159,31 +1167,39 @@ void balance_dirty_pages_ratelimited_nr(
 					unsigned long nr_pages_dirtied)
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
-	unsigned long ratelimit;
-	unsigned long *p;
+	int ratelimit;
+	int *p;
 
 	if (!bdi_cap_account_dirty(bdi))
 		return;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
-		ratelimit = 8;
+	ratelimit = current->nr_dirtied_pause;
+	if (bdi->dirty_exceeded)
+		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
+
+	current->nr_dirtied += nr_pages_dirtied;
 
+	preempt_disable();
 	/*
-	 * Check the rate limiting. Also, we do not want to throttle real-time
-	 * tasks in balance_dirty_pages(). Period.
+	 * This prevents one CPU to accumulate too many dirtied pages without
+	 * calling into balance_dirty_pages(), which can happen when there are
+	 * 1000+ tasks, all of them start dirtying pages at exactly the same
+	 * time, hence all honoured too large initial task->nr_dirtied_pause.
 	 */
-	preempt_disable();
 	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
+	if (unlikely(current->nr_dirtied >= ratelimit))
 		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
+	else {
+		*p += nr_pages_dirtied;
+		if (unlikely(*p >= ratelimit_pages)) {
+			*p = 0;
+			ratelimit = 0;
+		}
 	}
 	preempt_enable();
+
+	if (unlikely(current->nr_dirtied >= ratelimit))
+		balance_dirty_pages(mapping, current->nr_dirtied);
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
@@ -1278,22 +1294,17 @@ void laptop_sync_completion(void)
  *
  * Here we set ratelimit_pages to a level which ensures that when all CPUs are
  * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
- * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
+ * thresholds.
  */
 
 void writeback_set_ratelimit(void)
 {
-	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
 	if (ratelimit_pages < 16)
 		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
 }
 
 static int __cpuinit
--- linux-next.orig/kernel/fork.c	2011-08-29 19:07:56.000000000 +0800
+++ linux-next/kernel/fork.c	2011-08-29 19:08:19.000000000 +0800
@@ -1329,6 +1329,9 @@ static struct task_struct *copy_process(
 	p->pdeath_signal = 0;
 	p->exit_state = 0;
 
+	p->nr_dirtied = 0;
+	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
+
 	/*
 	 * Ok, make it visible to the rest of the system.
 	 * We dont wake it up yet.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 05/18] writeback: per task dirty rate limit
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: per-task-ratelimit --]
[-- Type: text/plain, Size: 7744 bytes --]

Add two fields to task_struct.

1) account dirtied pages in the individual tasks, for accuracy
2) per-task balance_dirty_pages() call intervals, for flexibility

The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
scale near-sqrt to the safety gap between dirty pages and threshold.

The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
pages at exactly the same time, each task will be assigned a large
initial nr_dirtied_pause, so that the dirty threshold will be exceeded
long before each task reached its nr_dirtied_pause and hence call
balance_dirty_pages().

The solution is to watch for the number of pages dirtied on each CPU in
between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
(3% dirty threshold), force call balance_dirty_pages() for a chance to
set bdi->dirty_exceeded. In normal situations, this safeguarding
condition is not expected to trigger at all.

peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 +++
 kernel/fork.c         |    3 +
 mm/page-writeback.c   |   89 ++++++++++++++++++++++------------------
 3 files changed, 60 insertions(+), 39 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-08-29 19:07:56.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-08-29 19:08:19.000000000 +0800
@@ -1521,6 +1521,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:08:05.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:08:19.000000000 +0800
@@ -54,20 +54,6 @@
  */
 static long ratelimit_pages = 32;
 
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -229,6 +215,8 @@ static void update_completion_period(voi
 	int shift = calc_period_shift();
 	prop_change_shift(&vm_completions, shift);
 	prop_change_shift(&vm_dirties, shift);
+
+	writeback_set_ratelimit();
 }
 
 int dirty_background_ratio_handler(struct ctl_table *table, int write,
@@ -982,6 +970,23 @@ static void bdi_update_bandwidth(struct 
 }
 
 /*
+ * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
+ * will look to see if it needs to start dirty throttling.
+ *
+ * If dirty_poll_interval is too low, big NUMA machines will call the expensive
+ * global_page_state() too often. So scale it near-sqrt to the safety margin
+ * (the number of pages we may dirty without exceeding the dirty limits).
+ */
+static unsigned long dirty_poll_interval(unsigned long dirty,
+					 unsigned long thresh)
+{
+	if (thresh > dirty)
+		return 1UL << (ilog2(thresh - dirty) >> 1);
+
+	return 1;
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -1113,6 +1118,9 @@ static void balance_dirty_pages(struct a
 	if (clear_dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
+	current->nr_dirtied = 0;
+	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
+
 	if (writeback_in_progress(bdi))
 		return;
 
@@ -1139,7 +1147,7 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
+static DEFINE_PER_CPU(int, bdp_ratelimits);
 
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
@@ -1159,31 +1167,39 @@ void balance_dirty_pages_ratelimited_nr(
 					unsigned long nr_pages_dirtied)
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
-	unsigned long ratelimit;
-	unsigned long *p;
+	int ratelimit;
+	int *p;
 
 	if (!bdi_cap_account_dirty(bdi))
 		return;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
-		ratelimit = 8;
+	ratelimit = current->nr_dirtied_pause;
+	if (bdi->dirty_exceeded)
+		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
+
+	current->nr_dirtied += nr_pages_dirtied;
 
+	preempt_disable();
 	/*
-	 * Check the rate limiting. Also, we do not want to throttle real-time
-	 * tasks in balance_dirty_pages(). Period.
+	 * This prevents one CPU to accumulate too many dirtied pages without
+	 * calling into balance_dirty_pages(), which can happen when there are
+	 * 1000+ tasks, all of them start dirtying pages at exactly the same
+	 * time, hence all honoured too large initial task->nr_dirtied_pause.
 	 */
-	preempt_disable();
 	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
+	if (unlikely(current->nr_dirtied >= ratelimit))
 		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
+	else {
+		*p += nr_pages_dirtied;
+		if (unlikely(*p >= ratelimit_pages)) {
+			*p = 0;
+			ratelimit = 0;
+		}
 	}
 	preempt_enable();
+
+	if (unlikely(current->nr_dirtied >= ratelimit))
+		balance_dirty_pages(mapping, current->nr_dirtied);
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
@@ -1278,22 +1294,17 @@ void laptop_sync_completion(void)
  *
  * Here we set ratelimit_pages to a level which ensures that when all CPUs are
  * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
- * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
+ * thresholds.
  */
 
 void writeback_set_ratelimit(void)
 {
-	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
 	if (ratelimit_pages < 16)
 		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
 }
 
 static int __cpuinit
--- linux-next.orig/kernel/fork.c	2011-08-29 19:07:56.000000000 +0800
+++ linux-next/kernel/fork.c	2011-08-29 19:08:19.000000000 +0800
@@ -1329,6 +1329,9 @@ static struct task_struct *copy_process(
 	p->pdeath_signal = 0;
 	p->exit_state = 0;
 
+	p->nr_dirtied = 0;
+	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
+
 	/*
 	 * Ok, make it visible to the rest of the system.
 	 * We dont wake it up yet.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 06/18] writeback: IO-less balance_dirty_pages()
  2011-09-04  1:53 ` Wu Fengguang
  (?)
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 15796 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

RATIONALS
=========

- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

  With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
  from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

  * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

  * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

  * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

  * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

  Now it's possible to increase writeback chunk size proportional to the
  disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
  the larger writeback size dramatically reduces the seek count to 1/10
  (far beyond my expectation) and improves the write throughput by 24%.

- long block time in balance_dirty_pages() hurts desktop responsiveness

  Many of us may have the experience: it often takes a couple of seconds
  or even long time to stop a heavy writing dd/cp/tar command with
  Ctrl-C or "kill -9".

- IO pipeline broken by bumpy write() progress

  There are a broad class of "loop {read(buf); write(buf);}" applications
  whose read() pipeline will be under-utilized or even come to a stop if
  the write()s have long latencies _or_ don't progress in a constant rate.
  The current threshold based throttling inherently transfers the large
  low level IO completion fluctuations to bumpy application write()s,
  and further deteriorates with increasing number of dirtiers and/or bdi's.

  For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
  the rsync progresses very bumpy in legacy kernel, and throughput is
  improved by 67% by this patchset. (plus the larger write chunk size,
  it will be 93% speedup).

  The new rate based throttling can support 1000+ dd's with excellent
  smoothness, low latency and low overheads.

For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than   4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

It can control pause times at will. The default policy (in a followup
patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
in 1000-dd case.

BEHAVIOR CHANGE
===============

(1) dirty threshold

Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.

(2) smoothness/responsiveness

Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   24 ----
 mm/page-writeback.c              |  161 ++++++++++-------------------
 2 files changed, 56 insertions(+), 129 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-09-01 09:38:46.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-09-01 09:58:05.000000000 +0800
@@ -309,50 +309,6 @@ static void bdi_writeout_fraction(struct
 				numerator, denominator);
 }
 
-static inline void task_dirties_fraction(struct task_struct *tsk,
-		long *numerator, long *denominator)
-{
-	prop_fraction_single(&vm_dirties, &tsk->dirties,
-				numerator, denominator);
-}
-
-/*
- * task_dirty_limit - scale down dirty throttling threshold for one task
- *
- * task specific dirty limit:
- *
- *   dirty -= (dirty/8) * p_{t}
- *
- * To protect light/slow dirtying tasks from heavier/fast ones, we start
- * throttling individual tasks before reaching the bdi dirty limit.
- * Relatively low thresholds will be allocated to heavy dirtiers. So when
- * dirty pages grow large, heavy dirtiers will be throttled first, which will
- * effectively curb the growth of dirty pages. Light dirtiers with high enough
- * dirty threshold may never get throttled.
- */
-#define TASK_LIMIT_FRACTION 8
-static unsigned long task_dirty_limit(struct task_struct *tsk,
-				       unsigned long bdi_dirty)
-{
-	long numerator, denominator;
-	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty / TASK_LIMIT_FRACTION;
-
-	task_dirties_fraction(tsk, &numerator, &denominator);
-	inv *= numerator;
-	do_div(inv, denominator);
-
-	dirty -= inv;
-
-	return max(dirty, bdi_dirty/2);
-}
-
-/* Minimum limit for any task */
-static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
-{
-	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
-}
-
 /*
  *
  */
@@ -989,29 +945,35 @@ static unsigned long dirty_poll_interval
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
- * the caller to perform writeback if the system is over `vm_dirty_ratio'.
+ * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
  * If we're over `background_thresh' then the writeback threads are woken to
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
-	unsigned long nr_reclaimable, bdi_nr_reclaimable;
+	unsigned long nr_reclaimable;	/* = file_dirty + unstable_nfs */
+	unsigned long bdi_reclaimable;
 	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long bdi_dirty;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long task_bdi_thresh;
-	unsigned long min_task_bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	long pause = 0;
 	bool dirty_exceeded = false;
-	bool clear_dirty_exceeded = true;
+	unsigned long task_ratelimit;
+	unsigned long dirty_ratelimit;
+	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
@@ -1027,9 +989,23 @@ static void balance_dirty_pages(struct a
 						      background_thresh))
 			break;
 
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
+		/*
+		 * bdi_thresh is not treated as some limiting factor as
+		 * dirty_thresh, due to reasons
+		 * - in JBOD setup, bdi_thresh can fluctuate a lot
+		 * - in a system with HDD and USB key, the USB key may somehow
+		 *   go into state (bdi_dirty >> bdi_thresh) either because
+		 *   bdi_dirty starts high, or because bdi_thresh drops low.
+		 *   In this case we don't want to hard throttle the USB key
+		 *   dirtiers for 100 seconds until bdi_dirty drops under
+		 *   bdi_thresh. Instead the auxiliary bdi control line in
+		 *   bdi_position_ratio() will let the dirtier task progress
+		 *   at some rate <= (write_bw / 2) for bringing down bdi_dirty.
+		 */
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
-		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -1041,57 +1017,41 @@ static void balance_dirty_pages(struct a
 		 * actually dirty; with m+n sitting in the percpu
 		 * deltas.
 		 */
-		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		if (bdi_thresh < 2 * bdi_stat_error(bdi)) {
+			bdi_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+			bdi_dirty = bdi_reclaimable +
 				    bdi_stat_sum(bdi, BDI_WRITEBACK);
 		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+			bdi_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+			bdi_dirty = bdi_reclaimable +
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		/*
-		 * The bdi thresh is somehow "soft" limit derived from the
-		 * global "hard" limit. The former helps to prevent heavy IO
-		 * bdi or process from holding back light ones; The latter is
-		 * the last resort safeguard.
-		 */
-		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
 				  (nr_dirty > dirty_thresh);
-		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
-					(nr_dirty <= dirty_thresh);
-
-		if (!dirty_exceeded)
-			break;
-
-		if (!bdi->dirty_exceeded)
+		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
 		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
 				     nr_dirty, bdi_thresh, bdi_dirty,
 				     start_time);
 
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_balance_dirty_start(bdi);
-		if (bdi_nr_reclaimable > task_bdi_thresh) {
-			pages_written += writeback_inodes_wb(&bdi->wb,
-							     write_chunk);
-			trace_balance_dirty_written(bdi, pages_written);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
+		dirty_ratelimit = bdi->dirty_ratelimit;
+		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
+					       background_thresh, nr_dirty,
+					       bdi_thresh, bdi_dirty);
+		if (unlikely(pos_ratio == 0)) {
+			pause = MAX_PAUSE;
+			goto pause;
 		}
+		task_ratelimit = (u64)dirty_ratelimit *
+					pos_ratio >> RATELIMIT_CALC_SHIFT;
+		pause = (HZ * pages_dirtied) / (task_ratelimit | 1);
+		pause = min_t(long, pause, MAX_PAUSE);
+
+pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
-		trace_balance_dirty_wait(bdi);
 
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
@@ -1100,22 +1060,11 @@ static void balance_dirty_pages(struct a
 		 * 200ms is typically more than enough to curb heavy dirtiers;
 		 * (b) the pause time limit makes the dirtiers more responsive.
 		 */
-		if (nr_dirty < dirty_thresh &&
-		    bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2 &&
-		    time_after(jiffies, start_time + MAX_PAUSE))
+		if (nr_dirty < dirty_thresh)
 			break;
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
-	/* Clear dirty_exceeded flag only when no task can exceed the limit */
-	if (clear_dirty_exceeded && bdi->dirty_exceeded)
+	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	current->nr_dirtied = 0;
@@ -1132,8 +1081,10 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (laptop_mode)
+		return;
+
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2011-09-01 09:38:46.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-09-01 09:56:58.000000000 +0800
@@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
-DEFINE_WRITEBACK_EVENT(balance_dirty_start);
-DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
-
-TRACE_EVENT(balance_dirty_written,
-
-	TP_PROTO(struct backing_dev_info *bdi, int written),
-
-	TP_ARGS(bdi, written),
-
-	TP_STRUCT__entry(
-		__array(char,	name, 32)
-		__field(int,	written)
-	),
-
-	TP_fast_assign(
-		strncpy(__entry->name, dev_name(bdi->dev), 32);
-		__entry->written = written;
-	),
-
-	TP_printk("bdi %s written %d",
-		  __entry->name,
-		  __entry->written
-	)
-);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 06/18] writeback: IO-less balance_dirty_pages()
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 16099 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

RATIONALS
=========

- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

  With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
  from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

  * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

  * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

  * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

  * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

  Now it's possible to increase writeback chunk size proportional to the
  disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
  the larger writeback size dramatically reduces the seek count to 1/10
  (far beyond my expectation) and improves the write throughput by 24%.

- long block time in balance_dirty_pages() hurts desktop responsiveness

  Many of us may have the experience: it often takes a couple of seconds
  or even long time to stop a heavy writing dd/cp/tar command with
  Ctrl-C or "kill -9".

- IO pipeline broken by bumpy write() progress

  There are a broad class of "loop {read(buf); write(buf);}" applications
  whose read() pipeline will be under-utilized or even come to a stop if
  the write()s have long latencies _or_ don't progress in a constant rate.
  The current threshold based throttling inherently transfers the large
  low level IO completion fluctuations to bumpy application write()s,
  and further deteriorates with increasing number of dirtiers and/or bdi's.

  For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
  the rsync progresses very bumpy in legacy kernel, and throughput is
  improved by 67% by this patchset. (plus the larger write chunk size,
  it will be 93% speedup).

  The new rate based throttling can support 1000+ dd's with excellent
  smoothness, low latency and low overheads.

For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than   4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

It can control pause times at will. The default policy (in a followup
patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
in 1000-dd case.

BEHAVIOR CHANGE
===============

(1) dirty threshold

Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.

(2) smoothness/responsiveness

Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   24 ----
 mm/page-writeback.c              |  161 ++++++++++-------------------
 2 files changed, 56 insertions(+), 129 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-09-01 09:38:46.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-09-01 09:58:05.000000000 +0800
@@ -309,50 +309,6 @@ static void bdi_writeout_fraction(struct
 				numerator, denominator);
 }
 
-static inline void task_dirties_fraction(struct task_struct *tsk,
-		long *numerator, long *denominator)
-{
-	prop_fraction_single(&vm_dirties, &tsk->dirties,
-				numerator, denominator);
-}
-
-/*
- * task_dirty_limit - scale down dirty throttling threshold for one task
- *
- * task specific dirty limit:
- *
- *   dirty -= (dirty/8) * p_{t}
- *
- * To protect light/slow dirtying tasks from heavier/fast ones, we start
- * throttling individual tasks before reaching the bdi dirty limit.
- * Relatively low thresholds will be allocated to heavy dirtiers. So when
- * dirty pages grow large, heavy dirtiers will be throttled first, which will
- * effectively curb the growth of dirty pages. Light dirtiers with high enough
- * dirty threshold may never get throttled.
- */
-#define TASK_LIMIT_FRACTION 8
-static unsigned long task_dirty_limit(struct task_struct *tsk,
-				       unsigned long bdi_dirty)
-{
-	long numerator, denominator;
-	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty / TASK_LIMIT_FRACTION;
-
-	task_dirties_fraction(tsk, &numerator, &denominator);
-	inv *= numerator;
-	do_div(inv, denominator);
-
-	dirty -= inv;
-
-	return max(dirty, bdi_dirty/2);
-}
-
-/* Minimum limit for any task */
-static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
-{
-	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
-}
-
 /*
  *
  */
@@ -989,29 +945,35 @@ static unsigned long dirty_poll_interval
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
- * the caller to perform writeback if the system is over `vm_dirty_ratio'.
+ * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
  * If we're over `background_thresh' then the writeback threads are woken to
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
-	unsigned long nr_reclaimable, bdi_nr_reclaimable;
+	unsigned long nr_reclaimable;	/* = file_dirty + unstable_nfs */
+	unsigned long bdi_reclaimable;
 	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long bdi_dirty;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long task_bdi_thresh;
-	unsigned long min_task_bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	long pause = 0;
 	bool dirty_exceeded = false;
-	bool clear_dirty_exceeded = true;
+	unsigned long task_ratelimit;
+	unsigned long dirty_ratelimit;
+	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
@@ -1027,9 +989,23 @@ static void balance_dirty_pages(struct a
 						      background_thresh))
 			break;
 
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
+		/*
+		 * bdi_thresh is not treated as some limiting factor as
+		 * dirty_thresh, due to reasons
+		 * - in JBOD setup, bdi_thresh can fluctuate a lot
+		 * - in a system with HDD and USB key, the USB key may somehow
+		 *   go into state (bdi_dirty >> bdi_thresh) either because
+		 *   bdi_dirty starts high, or because bdi_thresh drops low.
+		 *   In this case we don't want to hard throttle the USB key
+		 *   dirtiers for 100 seconds until bdi_dirty drops under
+		 *   bdi_thresh. Instead the auxiliary bdi control line in
+		 *   bdi_position_ratio() will let the dirtier task progress
+		 *   at some rate <= (write_bw / 2) for bringing down bdi_dirty.
+		 */
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
-		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -1041,57 +1017,41 @@ static void balance_dirty_pages(struct a
 		 * actually dirty; with m+n sitting in the percpu
 		 * deltas.
 		 */
-		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		if (bdi_thresh < 2 * bdi_stat_error(bdi)) {
+			bdi_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+			bdi_dirty = bdi_reclaimable +
 				    bdi_stat_sum(bdi, BDI_WRITEBACK);
 		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+			bdi_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+			bdi_dirty = bdi_reclaimable +
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		/*
-		 * The bdi thresh is somehow "soft" limit derived from the
-		 * global "hard" limit. The former helps to prevent heavy IO
-		 * bdi or process from holding back light ones; The latter is
-		 * the last resort safeguard.
-		 */
-		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
 				  (nr_dirty > dirty_thresh);
-		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
-					(nr_dirty <= dirty_thresh);
-
-		if (!dirty_exceeded)
-			break;
-
-		if (!bdi->dirty_exceeded)
+		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
 		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
 				     nr_dirty, bdi_thresh, bdi_dirty,
 				     start_time);
 
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_balance_dirty_start(bdi);
-		if (bdi_nr_reclaimable > task_bdi_thresh) {
-			pages_written += writeback_inodes_wb(&bdi->wb,
-							     write_chunk);
-			trace_balance_dirty_written(bdi, pages_written);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
+		dirty_ratelimit = bdi->dirty_ratelimit;
+		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
+					       background_thresh, nr_dirty,
+					       bdi_thresh, bdi_dirty);
+		if (unlikely(pos_ratio == 0)) {
+			pause = MAX_PAUSE;
+			goto pause;
 		}
+		task_ratelimit = (u64)dirty_ratelimit *
+					pos_ratio >> RATELIMIT_CALC_SHIFT;
+		pause = (HZ * pages_dirtied) / (task_ratelimit | 1);
+		pause = min_t(long, pause, MAX_PAUSE);
+
+pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
-		trace_balance_dirty_wait(bdi);
 
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
@@ -1100,22 +1060,11 @@ static void balance_dirty_pages(struct a
 		 * 200ms is typically more than enough to curb heavy dirtiers;
 		 * (b) the pause time limit makes the dirtiers more responsive.
 		 */
-		if (nr_dirty < dirty_thresh &&
-		    bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2 &&
-		    time_after(jiffies, start_time + MAX_PAUSE))
+		if (nr_dirty < dirty_thresh)
 			break;
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
-	/* Clear dirty_exceeded flag only when no task can exceed the limit */
-	if (clear_dirty_exceeded && bdi->dirty_exceeded)
+	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	current->nr_dirtied = 0;
@@ -1132,8 +1081,10 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (laptop_mode)
+		return;
+
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2011-09-01 09:38:46.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-09-01 09:56:58.000000000 +0800
@@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
-DEFINE_WRITEBACK_EVENT(balance_dirty_start);
-DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
-
-TRACE_EVENT(balance_dirty_written,
-
-	TP_PROTO(struct backing_dev_info *bdi, int written),
-
-	TP_ARGS(bdi, written),
-
-	TP_STRUCT__entry(
-		__array(char,	name, 32)
-		__field(int,	written)
-	),
-
-	TP_fast_assign(
-		strncpy(__entry->name, dev_name(bdi->dev), 32);
-		__entry->written = written;
-	),
-
-	TP_printk("bdi %s written %d",
-		  __entry->name,
-		  __entry->written
-	)
-);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 06/18] writeback: IO-less balance_dirty_pages()
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 16099 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

RATIONALS
=========

- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

  With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
  from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

  * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

  * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

  * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

  * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

  Now it's possible to increase writeback chunk size proportional to the
  disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
  the larger writeback size dramatically reduces the seek count to 1/10
  (far beyond my expectation) and improves the write throughput by 24%.

- long block time in balance_dirty_pages() hurts desktop responsiveness

  Many of us may have the experience: it often takes a couple of seconds
  or even long time to stop a heavy writing dd/cp/tar command with
  Ctrl-C or "kill -9".

- IO pipeline broken by bumpy write() progress

  There are a broad class of "loop {read(buf); write(buf);}" applications
  whose read() pipeline will be under-utilized or even come to a stop if
  the write()s have long latencies _or_ don't progress in a constant rate.
  The current threshold based throttling inherently transfers the large
  low level IO completion fluctuations to bumpy application write()s,
  and further deteriorates with increasing number of dirtiers and/or bdi's.

  For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
  the rsync progresses very bumpy in legacy kernel, and throughput is
  improved by 67% by this patchset. (plus the larger write chunk size,
  it will be 93% speedup).

  The new rate based throttling can support 1000+ dd's with excellent
  smoothness, low latency and low overheads.

For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than   4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

It can control pause times at will. The default policy (in a followup
patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
in 1000-dd case.

BEHAVIOR CHANGE
===============

(1) dirty threshold

Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.

(2) smoothness/responsiveness

Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   24 ----
 mm/page-writeback.c              |  161 ++++++++++-------------------
 2 files changed, 56 insertions(+), 129 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-09-01 09:38:46.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-09-01 09:58:05.000000000 +0800
@@ -309,50 +309,6 @@ static void bdi_writeout_fraction(struct
 				numerator, denominator);
 }
 
-static inline void task_dirties_fraction(struct task_struct *tsk,
-		long *numerator, long *denominator)
-{
-	prop_fraction_single(&vm_dirties, &tsk->dirties,
-				numerator, denominator);
-}
-
-/*
- * task_dirty_limit - scale down dirty throttling threshold for one task
- *
- * task specific dirty limit:
- *
- *   dirty -= (dirty/8) * p_{t}
- *
- * To protect light/slow dirtying tasks from heavier/fast ones, we start
- * throttling individual tasks before reaching the bdi dirty limit.
- * Relatively low thresholds will be allocated to heavy dirtiers. So when
- * dirty pages grow large, heavy dirtiers will be throttled first, which will
- * effectively curb the growth of dirty pages. Light dirtiers with high enough
- * dirty threshold may never get throttled.
- */
-#define TASK_LIMIT_FRACTION 8
-static unsigned long task_dirty_limit(struct task_struct *tsk,
-				       unsigned long bdi_dirty)
-{
-	long numerator, denominator;
-	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty / TASK_LIMIT_FRACTION;
-
-	task_dirties_fraction(tsk, &numerator, &denominator);
-	inv *= numerator;
-	do_div(inv, denominator);
-
-	dirty -= inv;
-
-	return max(dirty, bdi_dirty/2);
-}
-
-/* Minimum limit for any task */
-static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
-{
-	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
-}
-
 /*
  *
  */
@@ -989,29 +945,35 @@ static unsigned long dirty_poll_interval
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
- * the caller to perform writeback if the system is over `vm_dirty_ratio'.
+ * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
  * If we're over `background_thresh' then the writeback threads are woken to
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
-	unsigned long nr_reclaimable, bdi_nr_reclaimable;
+	unsigned long nr_reclaimable;	/* = file_dirty + unstable_nfs */
+	unsigned long bdi_reclaimable;
 	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long bdi_dirty;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long task_bdi_thresh;
-	unsigned long min_task_bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	long pause = 0;
 	bool dirty_exceeded = false;
-	bool clear_dirty_exceeded = true;
+	unsigned long task_ratelimit;
+	unsigned long dirty_ratelimit;
+	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
@@ -1027,9 +989,23 @@ static void balance_dirty_pages(struct a
 						      background_thresh))
 			break;
 
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
+		/*
+		 * bdi_thresh is not treated as some limiting factor as
+		 * dirty_thresh, due to reasons
+		 * - in JBOD setup, bdi_thresh can fluctuate a lot
+		 * - in a system with HDD and USB key, the USB key may somehow
+		 *   go into state (bdi_dirty >> bdi_thresh) either because
+		 *   bdi_dirty starts high, or because bdi_thresh drops low.
+		 *   In this case we don't want to hard throttle the USB key
+		 *   dirtiers for 100 seconds until bdi_dirty drops under
+		 *   bdi_thresh. Instead the auxiliary bdi control line in
+		 *   bdi_position_ratio() will let the dirtier task progress
+		 *   at some rate <= (write_bw / 2) for bringing down bdi_dirty.
+		 */
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
-		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -1041,57 +1017,41 @@ static void balance_dirty_pages(struct a
 		 * actually dirty; with m+n sitting in the percpu
 		 * deltas.
 		 */
-		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		if (bdi_thresh < 2 * bdi_stat_error(bdi)) {
+			bdi_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+			bdi_dirty = bdi_reclaimable +
 				    bdi_stat_sum(bdi, BDI_WRITEBACK);
 		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+			bdi_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+			bdi_dirty = bdi_reclaimable +
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		/*
-		 * The bdi thresh is somehow "soft" limit derived from the
-		 * global "hard" limit. The former helps to prevent heavy IO
-		 * bdi or process from holding back light ones; The latter is
-		 * the last resort safeguard.
-		 */
-		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
 				  (nr_dirty > dirty_thresh);
-		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
-					(nr_dirty <= dirty_thresh);
-
-		if (!dirty_exceeded)
-			break;
-
-		if (!bdi->dirty_exceeded)
+		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
 		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
 				     nr_dirty, bdi_thresh, bdi_dirty,
 				     start_time);
 
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_balance_dirty_start(bdi);
-		if (bdi_nr_reclaimable > task_bdi_thresh) {
-			pages_written += writeback_inodes_wb(&bdi->wb,
-							     write_chunk);
-			trace_balance_dirty_written(bdi, pages_written);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
+		dirty_ratelimit = bdi->dirty_ratelimit;
+		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
+					       background_thresh, nr_dirty,
+					       bdi_thresh, bdi_dirty);
+		if (unlikely(pos_ratio == 0)) {
+			pause = MAX_PAUSE;
+			goto pause;
 		}
+		task_ratelimit = (u64)dirty_ratelimit *
+					pos_ratio >> RATELIMIT_CALC_SHIFT;
+		pause = (HZ * pages_dirtied) / (task_ratelimit | 1);
+		pause = min_t(long, pause, MAX_PAUSE);
+
+pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
-		trace_balance_dirty_wait(bdi);
 
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
@@ -1100,22 +1060,11 @@ static void balance_dirty_pages(struct a
 		 * 200ms is typically more than enough to curb heavy dirtiers;
 		 * (b) the pause time limit makes the dirtiers more responsive.
 		 */
-		if (nr_dirty < dirty_thresh &&
-		    bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2 &&
-		    time_after(jiffies, start_time + MAX_PAUSE))
+		if (nr_dirty < dirty_thresh)
 			break;
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
-	/* Clear dirty_exceeded flag only when no task can exceed the limit */
-	if (clear_dirty_exceeded && bdi->dirty_exceeded)
+	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	current->nr_dirtied = 0;
@@ -1132,8 +1081,10 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (laptop_mode)
+		return;
+
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2011-09-01 09:38:46.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-09-01 09:56:58.000000000 +0800
@@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
-DEFINE_WRITEBACK_EVENT(balance_dirty_start);
-DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
-
-TRACE_EVENT(balance_dirty_written,
-
-	TP_PROTO(struct backing_dev_info *bdi, int written),
-
-	TP_ARGS(bdi, written),
-
-	TP_STRUCT__entry(
-		__array(char,	name, 32)
-		__field(int,	written)
-	),
-
-	TP_fast_assign(
-		strncpy(__entry->name, dev_name(bdi->dev), 32);
-		__entry->written = written;
-	),
-
-	TP_printk("bdi %s written %d",
-		  __entry->name,
-		  __entry->written
-	)
-);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 07/18] writeback: dirty ratelimit - think time compensation
  2011-09-04  1:53 ` Wu Fengguang
  (?)
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: think-time-compensation --]
[-- Type: text/plain, Size: 5265 bytes --]

Compensate the task's think time when computing the final pause time,
so that ->dirty_ratelimit can be executed accurately.

        think time := time spend outside of balance_dirty_pages()

In the rare case that the task slept longer than the 200ms period time
(result in negative pause time), the sleep time will be compensated in
the following periods, too, if it's less than 1 second.

Accumulated errors are carefully avoided as long as the max pause area
is not hitted.

Pseudo code:

        period = pages_dirtied / task_ratelimit;
        think = jiffies - dirty_paused_when;
        pause = period - think;

1) normal case: period > think

        pause = period - think
        dirty_paused_when = jiffies + pause
        nr_dirtied = 0

                             period time
              |===============================>|
                  think time      pause time
              |===============>|==============>|
        ------|----------------|---------------|------------------------
        dirty_paused_when   jiffies


2) no pause case: period <= think

        don't pause; reduce future pause time by:
        dirty_paused_when += period
        nr_dirtied = 0

                           period time
              |===============================>|
                                  think time
              |===================================================>|
        ------|--------------------------------+-------------------|----
        dirty_paused_when                                       jiffies

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    1 +
 kernel/fork.c         |    1 +
 mm/page-writeback.c   |   34 +++++++++++++++++++++++++++++++---
 3 files changed, 33 insertions(+), 3 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-08-26 20:09:04.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-08-26 20:09:19.000000000 +0800
@@ -1527,6 +1527,7 @@ struct task_struct {
 	 */
 	int nr_dirtied;
 	int nr_dirtied_pause;
+	unsigned long dirty_paused_when; /* start of a write-and-pause period */
 
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
--- linux-next.orig/mm/page-writeback.c	2011-08-26 20:09:19.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 20:09:19.000000000 +0800
@@ -958,6 +958,7 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
+	long period;
 	long pause = 0;
 	bool dirty_exceeded = false;
 	unsigned long task_ratelimit;
@@ -967,6 +968,8 @@ static void balance_dirty_pages(struct a
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		unsigned long now = jiffies;
+
 		/*
 		 * Unstable writes are a feature of certain networked
 		 * filesystems (i.e. NFS) in which data may have been
@@ -985,8 +988,11 @@ static void balance_dirty_pages(struct a
 		 * when the bdi limits are ramping up.
 		 */
 		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
-						      background_thresh))
+						      background_thresh)) {
+			current->dirty_paused_when = now;
+			current->nr_dirtied = 0;
 			break;
+		}
 
 		if (unlikely(!writeback_in_progress(bdi)))
 			bdi_start_background_writeback(bdi);
@@ -1037,18 +1043,41 @@ static void balance_dirty_pages(struct a
 					       background_thresh, nr_dirty,
 					       bdi_thresh, bdi_dirty);
 		if (unlikely(pos_ratio == 0)) {
+			period = MAX_PAUSE;
 			pause = MAX_PAUSE;
 			goto pause;
 		}
 		task_ratelimit = (u64)dirty_ratelimit *
 					pos_ratio >> RATELIMIT_CALC_SHIFT;
-		pause = (HZ * pages_dirtied) / (task_ratelimit | 1);
+		period = (HZ * pages_dirtied) / (task_ratelimit | 1);
+		pause = current->dirty_paused_when + period - now;
+		/*
+		 * For less than 1s think time (ext3/4 may block the dirtier
+		 * for up to 800ms from time to time on 1-HDD; so does xfs,
+		 * however at much less frequency), try to compensate it in
+		 * future periods by updating the virtual time; otherwise just
+		 * do a reset, as it may be a light dirtier.
+		 */
+		if (unlikely(pause <= 0)) {
+			if (pause < -HZ) {
+				current->dirty_paused_when = now;
+				current->nr_dirtied = 0;
+			} else if (period) {
+				current->dirty_paused_when += period;
+				current->nr_dirtied = 0;
+			}
+			pause = 1; /* avoid resetting nr_dirtied_pause below */
+			break;
+		}
 		pause = min_t(long, pause, MAX_PAUSE);
 
 pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
 
+		current->dirty_paused_when = now + pause;
+		current->nr_dirtied = 0;
+
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
 		 * max-pause area. If dirty exceeded but still within this
@@ -1063,7 +1092,6 @@ pause:
 	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
-	current->nr_dirtied = 0;
 	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
 
 	if (writeback_in_progress(bdi))
--- linux-next.orig/kernel/fork.c	2011-08-26 20:09:04.000000000 +0800
+++ linux-next/kernel/fork.c	2011-08-26 20:09:19.000000000 +0800
@@ -1331,6 +1331,7 @@ static struct task_struct *copy_process(
 
 	p->nr_dirtied = 0;
 	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
+	p->dirty_paused_when = 0;
 
 	/*
 	 * Ok, make it visible to the rest of the system.



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 07/18] writeback: dirty ratelimit - think time compensation
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: think-time-compensation --]
[-- Type: text/plain, Size: 5568 bytes --]

Compensate the task's think time when computing the final pause time,
so that ->dirty_ratelimit can be executed accurately.

        think time := time spend outside of balance_dirty_pages()

In the rare case that the task slept longer than the 200ms period time
(result in negative pause time), the sleep time will be compensated in
the following periods, too, if it's less than 1 second.

Accumulated errors are carefully avoided as long as the max pause area
is not hitted.

Pseudo code:

        period = pages_dirtied / task_ratelimit;
        think = jiffies - dirty_paused_when;
        pause = period - think;

1) normal case: period > think

        pause = period - think
        dirty_paused_when = jiffies + pause
        nr_dirtied = 0

                             period time
              |===============================>|
                  think time      pause time
              |===============>|==============>|
        ------|----------------|---------------|------------------------
        dirty_paused_when   jiffies


2) no pause case: period <= think

        don't pause; reduce future pause time by:
        dirty_paused_when += period
        nr_dirtied = 0

                           period time
              |===============================>|
                                  think time
              |===================================================>|
        ------|--------------------------------+-------------------|----
        dirty_paused_when                                       jiffies

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    1 +
 kernel/fork.c         |    1 +
 mm/page-writeback.c   |   34 +++++++++++++++++++++++++++++++---
 3 files changed, 33 insertions(+), 3 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-08-26 20:09:04.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-08-26 20:09:19.000000000 +0800
@@ -1527,6 +1527,7 @@ struct task_struct {
 	 */
 	int nr_dirtied;
 	int nr_dirtied_pause;
+	unsigned long dirty_paused_when; /* start of a write-and-pause period */
 
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
--- linux-next.orig/mm/page-writeback.c	2011-08-26 20:09:19.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 20:09:19.000000000 +0800
@@ -958,6 +958,7 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
+	long period;
 	long pause = 0;
 	bool dirty_exceeded = false;
 	unsigned long task_ratelimit;
@@ -967,6 +968,8 @@ static void balance_dirty_pages(struct a
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		unsigned long now = jiffies;
+
 		/*
 		 * Unstable writes are a feature of certain networked
 		 * filesystems (i.e. NFS) in which data may have been
@@ -985,8 +988,11 @@ static void balance_dirty_pages(struct a
 		 * when the bdi limits are ramping up.
 		 */
 		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
-						      background_thresh))
+						      background_thresh)) {
+			current->dirty_paused_when = now;
+			current->nr_dirtied = 0;
 			break;
+		}
 
 		if (unlikely(!writeback_in_progress(bdi)))
 			bdi_start_background_writeback(bdi);
@@ -1037,18 +1043,41 @@ static void balance_dirty_pages(struct a
 					       background_thresh, nr_dirty,
 					       bdi_thresh, bdi_dirty);
 		if (unlikely(pos_ratio == 0)) {
+			period = MAX_PAUSE;
 			pause = MAX_PAUSE;
 			goto pause;
 		}
 		task_ratelimit = (u64)dirty_ratelimit *
 					pos_ratio >> RATELIMIT_CALC_SHIFT;
-		pause = (HZ * pages_dirtied) / (task_ratelimit | 1);
+		period = (HZ * pages_dirtied) / (task_ratelimit | 1);
+		pause = current->dirty_paused_when + period - now;
+		/*
+		 * For less than 1s think time (ext3/4 may block the dirtier
+		 * for up to 800ms from time to time on 1-HDD; so does xfs,
+		 * however at much less frequency), try to compensate it in
+		 * future periods by updating the virtual time; otherwise just
+		 * do a reset, as it may be a light dirtier.
+		 */
+		if (unlikely(pause <= 0)) {
+			if (pause < -HZ) {
+				current->dirty_paused_when = now;
+				current->nr_dirtied = 0;
+			} else if (period) {
+				current->dirty_paused_when += period;
+				current->nr_dirtied = 0;
+			}
+			pause = 1; /* avoid resetting nr_dirtied_pause below */
+			break;
+		}
 		pause = min_t(long, pause, MAX_PAUSE);
 
 pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
 
+		current->dirty_paused_when = now + pause;
+		current->nr_dirtied = 0;
+
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
 		 * max-pause area. If dirty exceeded but still within this
@@ -1063,7 +1092,6 @@ pause:
 	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
-	current->nr_dirtied = 0;
 	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
 
 	if (writeback_in_progress(bdi))
--- linux-next.orig/kernel/fork.c	2011-08-26 20:09:04.000000000 +0800
+++ linux-next/kernel/fork.c	2011-08-26 20:09:19.000000000 +0800
@@ -1331,6 +1331,7 @@ static struct task_struct *copy_process(
 
 	p->nr_dirtied = 0;
 	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
+	p->dirty_paused_when = 0;
 
 	/*
 	 * Ok, make it visible to the rest of the system.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 07/18] writeback: dirty ratelimit - think time compensation
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: think-time-compensation --]
[-- Type: text/plain, Size: 5568 bytes --]

Compensate the task's think time when computing the final pause time,
so that ->dirty_ratelimit can be executed accurately.

        think time := time spend outside of balance_dirty_pages()

In the rare case that the task slept longer than the 200ms period time
(result in negative pause time), the sleep time will be compensated in
the following periods, too, if it's less than 1 second.

Accumulated errors are carefully avoided as long as the max pause area
is not hitted.

Pseudo code:

        period = pages_dirtied / task_ratelimit;
        think = jiffies - dirty_paused_when;
        pause = period - think;

1) normal case: period > think

        pause = period - think
        dirty_paused_when = jiffies + pause
        nr_dirtied = 0

                             period time
              |===============================>|
                  think time      pause time
              |===============>|==============>|
        ------|----------------|---------------|------------------------
        dirty_paused_when   jiffies


2) no pause case: period <= think

        don't pause; reduce future pause time by:
        dirty_paused_when += period
        nr_dirtied = 0

                           period time
              |===============================>|
                                  think time
              |===================================================>|
        ------|--------------------------------+-------------------|----
        dirty_paused_when                                       jiffies

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    1 +
 kernel/fork.c         |    1 +
 mm/page-writeback.c   |   34 +++++++++++++++++++++++++++++++---
 3 files changed, 33 insertions(+), 3 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-08-26 20:09:04.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-08-26 20:09:19.000000000 +0800
@@ -1527,6 +1527,7 @@ struct task_struct {
 	 */
 	int nr_dirtied;
 	int nr_dirtied_pause;
+	unsigned long dirty_paused_when; /* start of a write-and-pause period */
 
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
--- linux-next.orig/mm/page-writeback.c	2011-08-26 20:09:19.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 20:09:19.000000000 +0800
@@ -958,6 +958,7 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
+	long period;
 	long pause = 0;
 	bool dirty_exceeded = false;
 	unsigned long task_ratelimit;
@@ -967,6 +968,8 @@ static void balance_dirty_pages(struct a
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		unsigned long now = jiffies;
+
 		/*
 		 * Unstable writes are a feature of certain networked
 		 * filesystems (i.e. NFS) in which data may have been
@@ -985,8 +988,11 @@ static void balance_dirty_pages(struct a
 		 * when the bdi limits are ramping up.
 		 */
 		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
-						      background_thresh))
+						      background_thresh)) {
+			current->dirty_paused_when = now;
+			current->nr_dirtied = 0;
 			break;
+		}
 
 		if (unlikely(!writeback_in_progress(bdi)))
 			bdi_start_background_writeback(bdi);
@@ -1037,18 +1043,41 @@ static void balance_dirty_pages(struct a
 					       background_thresh, nr_dirty,
 					       bdi_thresh, bdi_dirty);
 		if (unlikely(pos_ratio == 0)) {
+			period = MAX_PAUSE;
 			pause = MAX_PAUSE;
 			goto pause;
 		}
 		task_ratelimit = (u64)dirty_ratelimit *
 					pos_ratio >> RATELIMIT_CALC_SHIFT;
-		pause = (HZ * pages_dirtied) / (task_ratelimit | 1);
+		period = (HZ * pages_dirtied) / (task_ratelimit | 1);
+		pause = current->dirty_paused_when + period - now;
+		/*
+		 * For less than 1s think time (ext3/4 may block the dirtier
+		 * for up to 800ms from time to time on 1-HDD; so does xfs,
+		 * however at much less frequency), try to compensate it in
+		 * future periods by updating the virtual time; otherwise just
+		 * do a reset, as it may be a light dirtier.
+		 */
+		if (unlikely(pause <= 0)) {
+			if (pause < -HZ) {
+				current->dirty_paused_when = now;
+				current->nr_dirtied = 0;
+			} else if (period) {
+				current->dirty_paused_when += period;
+				current->nr_dirtied = 0;
+			}
+			pause = 1; /* avoid resetting nr_dirtied_pause below */
+			break;
+		}
 		pause = min_t(long, pause, MAX_PAUSE);
 
 pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
 
+		current->dirty_paused_when = now + pause;
+		current->nr_dirtied = 0;
+
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
 		 * max-pause area. If dirty exceeded but still within this
@@ -1063,7 +1092,6 @@ pause:
 	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
-	current->nr_dirtied = 0;
 	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
 
 	if (writeback_in_progress(bdi))
--- linux-next.orig/kernel/fork.c	2011-08-26 20:09:04.000000000 +0800
+++ linux-next/kernel/fork.c	2011-08-26 20:09:19.000000000 +0800
@@ -1331,6 +1331,7 @@ static struct task_struct *copy_process(
 
 	p->nr_dirtied = 0;
 	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
+	p->dirty_paused_when = 0;
 
 	/*
 	 * Ok, make it visible to the rest of the system.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 08/18] writeback: trace dirty_ratelimit
  2011-09-04  1:53 ` Wu Fengguang
  (?)
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-trace-throttle-bandwidth.patch --]
[-- Type: text/plain, Size: 2633 bytes --]

It helps understand how various throttle bandwidths are updated.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   46 +++++++++++++++++++++++++++++
 mm/page-writeback.c              |    3 +
 2 files changed, 49 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:51:30.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:52:11.000000000 +0800
@@ -864,6 +864,9 @@ static void bdi_update_dirty_ratelimit(s
 		dirty_ratelimit -= step;
 
 	bdi->dirty_ratelimit = max(dirty_ratelimit, 1UL);
+
+	trace_dirty_ratelimit(bdi, dirty_rate, task_ratelimit,
+			      balanced_dirty_ratelimit);
 }
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
--- linux-next.orig/include/trace/events/writeback.h	2011-08-29 19:51:30.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-08-29 19:52:11.000000000 +0800
@@ -226,6 +226,52 @@ TRACE_EVENT(global_dirty_state,
 	)
 );
 
+#define KBps(x)			((x) << (PAGE_SHIFT - 10))
+
+TRACE_EVENT(dirty_ratelimit,
+
+	TP_PROTO(struct backing_dev_info *bdi,
+		 unsigned long dirty_rate,
+		 unsigned long task_ratelimit,
+		 unsigned long balanced_dirty_ratelimit),
+
+	TP_ARGS(bdi, dirty_rate, task_ratelimit, balanced_dirty_ratelimit),
+
+	TP_STRUCT__entry(
+		__array(char,		bdi, 32)
+		__field(unsigned long,	write_bw)
+		__field(unsigned long,	avg_write_bw)
+		__field(unsigned long,	dirty_rate)
+		__field(unsigned long,	dirty_ratelimit)
+		__field(unsigned long,	task_ratelimit)
+		__field(unsigned long,	balanced_dirty_ratelimit)
+	),
+
+	TP_fast_assign(
+		strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
+		__entry->write_bw	= KBps(bdi->write_bandwidth);
+		__entry->avg_write_bw	= KBps(bdi->avg_write_bandwidth);
+		__entry->dirty_rate	= KBps(dirty_rate);
+		__entry->dirty_ratelimit = KBps(bdi->dirty_ratelimit);
+		__entry->task_ratelimit	= KBps(task_ratelimit);
+		__entry->balanced_dirty_ratelimit =
+					  KBps(balanced_dirty_ratelimit);
+	),
+
+	TP_printk("bdi %s: "
+		  "write_bw=%lu awrite_bw=%lu dirty_rate=%lu "
+		  "dirty_ratelimit=%lu task_ratelimit=%lu "
+		  "balanced_dirty_ratelimit=%lu",
+		  __entry->bdi,
+		  __entry->write_bw,		/* write bandwidth */
+		  __entry->avg_write_bw,	/* avg write bandwidth */
+		  __entry->dirty_rate,		/* bdi dirty rate */
+		  __entry->dirty_ratelimit,	/* base ratelimit */
+		  __entry->task_ratelimit, /* ratelimit with position control */
+		  __entry->balanced_dirty_ratelimit /* the balanced ratelimit */
+	)
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 08/18] writeback: trace dirty_ratelimit
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-trace-throttle-bandwidth.patch --]
[-- Type: text/plain, Size: 2936 bytes --]

It helps understand how various throttle bandwidths are updated.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   46 +++++++++++++++++++++++++++++
 mm/page-writeback.c              |    3 +
 2 files changed, 49 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:51:30.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:52:11.000000000 +0800
@@ -864,6 +864,9 @@ static void bdi_update_dirty_ratelimit(s
 		dirty_ratelimit -= step;
 
 	bdi->dirty_ratelimit = max(dirty_ratelimit, 1UL);
+
+	trace_dirty_ratelimit(bdi, dirty_rate, task_ratelimit,
+			      balanced_dirty_ratelimit);
 }
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
--- linux-next.orig/include/trace/events/writeback.h	2011-08-29 19:51:30.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-08-29 19:52:11.000000000 +0800
@@ -226,6 +226,52 @@ TRACE_EVENT(global_dirty_state,
 	)
 );
 
+#define KBps(x)			((x) << (PAGE_SHIFT - 10))
+
+TRACE_EVENT(dirty_ratelimit,
+
+	TP_PROTO(struct backing_dev_info *bdi,
+		 unsigned long dirty_rate,
+		 unsigned long task_ratelimit,
+		 unsigned long balanced_dirty_ratelimit),
+
+	TP_ARGS(bdi, dirty_rate, task_ratelimit, balanced_dirty_ratelimit),
+
+	TP_STRUCT__entry(
+		__array(char,		bdi, 32)
+		__field(unsigned long,	write_bw)
+		__field(unsigned long,	avg_write_bw)
+		__field(unsigned long,	dirty_rate)
+		__field(unsigned long,	dirty_ratelimit)
+		__field(unsigned long,	task_ratelimit)
+		__field(unsigned long,	balanced_dirty_ratelimit)
+	),
+
+	TP_fast_assign(
+		strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
+		__entry->write_bw	= KBps(bdi->write_bandwidth);
+		__entry->avg_write_bw	= KBps(bdi->avg_write_bandwidth);
+		__entry->dirty_rate	= KBps(dirty_rate);
+		__entry->dirty_ratelimit = KBps(bdi->dirty_ratelimit);
+		__entry->task_ratelimit	= KBps(task_ratelimit);
+		__entry->balanced_dirty_ratelimit =
+					  KBps(balanced_dirty_ratelimit);
+	),
+
+	TP_printk("bdi %s: "
+		  "write_bw=%lu awrite_bw=%lu dirty_rate=%lu "
+		  "dirty_ratelimit=%lu task_ratelimit=%lu "
+		  "balanced_dirty_ratelimit=%lu",
+		  __entry->bdi,
+		  __entry->write_bw,		/* write bandwidth */
+		  __entry->avg_write_bw,	/* avg write bandwidth */
+		  __entry->dirty_rate,		/* bdi dirty rate */
+		  __entry->dirty_ratelimit,	/* base ratelimit */
+		  __entry->task_ratelimit, /* ratelimit with position control */
+		  __entry->balanced_dirty_ratelimit /* the balanced ratelimit */
+	)
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 08/18] writeback: trace dirty_ratelimit
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-trace-throttle-bandwidth.patch --]
[-- Type: text/plain, Size: 2936 bytes --]

It helps understand how various throttle bandwidths are updated.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   46 +++++++++++++++++++++++++++++
 mm/page-writeback.c              |    3 +
 2 files changed, 49 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:51:30.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:52:11.000000000 +0800
@@ -864,6 +864,9 @@ static void bdi_update_dirty_ratelimit(s
 		dirty_ratelimit -= step;
 
 	bdi->dirty_ratelimit = max(dirty_ratelimit, 1UL);
+
+	trace_dirty_ratelimit(bdi, dirty_rate, task_ratelimit,
+			      balanced_dirty_ratelimit);
 }
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
--- linux-next.orig/include/trace/events/writeback.h	2011-08-29 19:51:30.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-08-29 19:52:11.000000000 +0800
@@ -226,6 +226,52 @@ TRACE_EVENT(global_dirty_state,
 	)
 );
 
+#define KBps(x)			((x) << (PAGE_SHIFT - 10))
+
+TRACE_EVENT(dirty_ratelimit,
+
+	TP_PROTO(struct backing_dev_info *bdi,
+		 unsigned long dirty_rate,
+		 unsigned long task_ratelimit,
+		 unsigned long balanced_dirty_ratelimit),
+
+	TP_ARGS(bdi, dirty_rate, task_ratelimit, balanced_dirty_ratelimit),
+
+	TP_STRUCT__entry(
+		__array(char,		bdi, 32)
+		__field(unsigned long,	write_bw)
+		__field(unsigned long,	avg_write_bw)
+		__field(unsigned long,	dirty_rate)
+		__field(unsigned long,	dirty_ratelimit)
+		__field(unsigned long,	task_ratelimit)
+		__field(unsigned long,	balanced_dirty_ratelimit)
+	),
+
+	TP_fast_assign(
+		strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
+		__entry->write_bw	= KBps(bdi->write_bandwidth);
+		__entry->avg_write_bw	= KBps(bdi->avg_write_bandwidth);
+		__entry->dirty_rate	= KBps(dirty_rate);
+		__entry->dirty_ratelimit = KBps(bdi->dirty_ratelimit);
+		__entry->task_ratelimit	= KBps(task_ratelimit);
+		__entry->balanced_dirty_ratelimit =
+					  KBps(balanced_dirty_ratelimit);
+	),
+
+	TP_printk("bdi %s: "
+		  "write_bw=%lu awrite_bw=%lu dirty_rate=%lu "
+		  "dirty_ratelimit=%lu task_ratelimit=%lu "
+		  "balanced_dirty_ratelimit=%lu",
+		  __entry->bdi,
+		  __entry->write_bw,		/* write bandwidth */
+		  __entry->avg_write_bw,	/* avg write bandwidth */
+		  __entry->dirty_rate,		/* bdi dirty rate */
+		  __entry->dirty_ratelimit,	/* base ratelimit */
+		  __entry->task_ratelimit, /* ratelimit with position control */
+		  __entry->balanced_dirty_ratelimit /* the balanced ratelimit */
+	)
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 09/18] writeback: trace balance_dirty_pages
  2011-09-04  1:53 ` Wu Fengguang
  (?)
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-trace-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 4286 bytes --]

Useful for analyzing the dynamics of the throttling algorithms and
debugging user reported problems.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   81 +++++++++++++++++++++++++++++
 mm/page-writeback.c              |   24 ++++++++
 2 files changed, 105 insertions(+)

--- linux-next.orig/include/trace/events/writeback.h	2011-08-29 19:52:11.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-08-29 19:52:13.000000000 +0800
@@ -272,6 +272,87 @@ TRACE_EVENT(dirty_ratelimit,
 	)
 );
 
+TRACE_EVENT(balance_dirty_pages,
+
+	TP_PROTO(struct backing_dev_info *bdi,
+		 unsigned long thresh,
+		 unsigned long bg_thresh,
+		 unsigned long dirty,
+		 unsigned long bdi_thresh,
+		 unsigned long bdi_dirty,
+		 unsigned long dirty_ratelimit,
+		 unsigned long task_ratelimit,
+		 unsigned long dirtied,
+		 unsigned long period,
+		 long pause,
+		 unsigned long start_time),
+
+	TP_ARGS(bdi, thresh, bg_thresh, dirty, bdi_thresh, bdi_dirty,
+		dirty_ratelimit, task_ratelimit,
+		dirtied, period, pause, start_time),
+
+	TP_STRUCT__entry(
+		__array(	 char,	bdi, 32)
+		__field(unsigned long,	limit)
+		__field(unsigned long,	setpoint)
+		__field(unsigned long,	dirty)
+		__field(unsigned long,	bdi_setpoint)
+		__field(unsigned long,	bdi_dirty)
+		__field(unsigned long,	dirty_ratelimit)
+		__field(unsigned long,	task_ratelimit)
+		__field(unsigned int,	dirtied)
+		__field(unsigned int,	dirtied_pause)
+		__field(unsigned long,	period)
+		__field(	 long,	think)
+		__field(	 long,	pause)
+		__field(unsigned long,	paused)
+	),
+
+	TP_fast_assign(
+		unsigned long freerun = (thresh + bg_thresh) / 2;
+		strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
+
+		__entry->limit		= global_dirty_limit;
+		__entry->setpoint	= (global_dirty_limit + freerun) / 2;
+		__entry->dirty		= dirty;
+		__entry->bdi_setpoint	= __entry->setpoint *
+						bdi_thresh / (thresh + 1);
+		__entry->bdi_dirty	= bdi_dirty;
+		__entry->dirty_ratelimit = KBps(dirty_ratelimit);
+		__entry->task_ratelimit	= KBps(task_ratelimit);
+		__entry->dirtied	= dirtied;
+		__entry->dirtied_pause	= current->nr_dirtied_pause;
+		__entry->think		= current->dirty_paused_when == 0 ? 0 :
+			 (long)(jiffies - current->dirty_paused_when) * 1000/HZ;
+		__entry->period		= period * 1000 / HZ;
+		__entry->pause		= pause * 1000 / HZ;
+		__entry->paused		= (jiffies - start_time) * 1000 / HZ;
+	),
+
+
+	TP_printk("bdi %s: "
+		  "limit=%lu setpoint=%lu dirty=%lu "
+		  "bdi_setpoint=%lu bdi_dirty=%lu "
+		  "dirty_ratelimit=%lu task_ratelimit=%lu "
+		  "dirtied=%u dirtied_pause=%u "
+		  "period=%lu think=%ld pause=%ld paused=%lu",
+		  __entry->bdi,
+		  __entry->limit,
+		  __entry->setpoint,
+		  __entry->dirty,
+		  __entry->bdi_setpoint,
+		  __entry->bdi_dirty,
+		  __entry->dirty_ratelimit,
+		  __entry->task_ratelimit,
+		  __entry->dirtied,
+		  __entry->dirtied_pause,
+		  __entry->period,	/* ms */
+		  __entry->think,	/* ms */
+		  __entry->pause,	/* ms */
+		  __entry->paused	/* ms */
+	  )
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:52:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:52:13.000000000 +0800
@@ -1062,6 +1062,18 @@ static void balance_dirty_pages(struct a
 		 * do a reset, as it may be a light dirtier.
 		 */
 		if (unlikely(pause <= 0)) {
+			trace_balance_dirty_pages(bdi,
+						  dirty_thresh,
+						  background_thresh,
+						  nr_dirty,
+						  bdi_thresh,
+						  bdi_dirty,
+						  dirty_ratelimit,
+						  task_ratelimit,
+						  pages_dirtied,
+						  period,
+						  pause,
+						  start_time);
 			if (pause < -HZ) {
 				current->dirty_paused_when = now;
 				current->nr_dirtied = 0;
@@ -1075,6 +1087,18 @@ static void balance_dirty_pages(struct a
 		pause = min(pause, (long)MAX_PAUSE);
 
 pause:
+		trace_balance_dirty_pages(bdi,
+					  dirty_thresh,
+					  background_thresh,
+					  nr_dirty,
+					  bdi_thresh,
+					  bdi_dirty,
+					  dirty_ratelimit,
+					  task_ratelimit,
+					  pages_dirtied,
+					  period,
+					  pause,
+					  start_time);
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
 



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 09/18] writeback: trace balance_dirty_pages
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-trace-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 4589 bytes --]

Useful for analyzing the dynamics of the throttling algorithms and
debugging user reported problems.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   81 +++++++++++++++++++++++++++++
 mm/page-writeback.c              |   24 ++++++++
 2 files changed, 105 insertions(+)

--- linux-next.orig/include/trace/events/writeback.h	2011-08-29 19:52:11.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-08-29 19:52:13.000000000 +0800
@@ -272,6 +272,87 @@ TRACE_EVENT(dirty_ratelimit,
 	)
 );
 
+TRACE_EVENT(balance_dirty_pages,
+
+	TP_PROTO(struct backing_dev_info *bdi,
+		 unsigned long thresh,
+		 unsigned long bg_thresh,
+		 unsigned long dirty,
+		 unsigned long bdi_thresh,
+		 unsigned long bdi_dirty,
+		 unsigned long dirty_ratelimit,
+		 unsigned long task_ratelimit,
+		 unsigned long dirtied,
+		 unsigned long period,
+		 long pause,
+		 unsigned long start_time),
+
+	TP_ARGS(bdi, thresh, bg_thresh, dirty, bdi_thresh, bdi_dirty,
+		dirty_ratelimit, task_ratelimit,
+		dirtied, period, pause, start_time),
+
+	TP_STRUCT__entry(
+		__array(	 char,	bdi, 32)
+		__field(unsigned long,	limit)
+		__field(unsigned long,	setpoint)
+		__field(unsigned long,	dirty)
+		__field(unsigned long,	bdi_setpoint)
+		__field(unsigned long,	bdi_dirty)
+		__field(unsigned long,	dirty_ratelimit)
+		__field(unsigned long,	task_ratelimit)
+		__field(unsigned int,	dirtied)
+		__field(unsigned int,	dirtied_pause)
+		__field(unsigned long,	period)
+		__field(	 long,	think)
+		__field(	 long,	pause)
+		__field(unsigned long,	paused)
+	),
+
+	TP_fast_assign(
+		unsigned long freerun = (thresh + bg_thresh) / 2;
+		strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
+
+		__entry->limit		= global_dirty_limit;
+		__entry->setpoint	= (global_dirty_limit + freerun) / 2;
+		__entry->dirty		= dirty;
+		__entry->bdi_setpoint	= __entry->setpoint *
+						bdi_thresh / (thresh + 1);
+		__entry->bdi_dirty	= bdi_dirty;
+		__entry->dirty_ratelimit = KBps(dirty_ratelimit);
+		__entry->task_ratelimit	= KBps(task_ratelimit);
+		__entry->dirtied	= dirtied;
+		__entry->dirtied_pause	= current->nr_dirtied_pause;
+		__entry->think		= current->dirty_paused_when == 0 ? 0 :
+			 (long)(jiffies - current->dirty_paused_when) * 1000/HZ;
+		__entry->period		= period * 1000 / HZ;
+		__entry->pause		= pause * 1000 / HZ;
+		__entry->paused		= (jiffies - start_time) * 1000 / HZ;
+	),
+
+
+	TP_printk("bdi %s: "
+		  "limit=%lu setpoint=%lu dirty=%lu "
+		  "bdi_setpoint=%lu bdi_dirty=%lu "
+		  "dirty_ratelimit=%lu task_ratelimit=%lu "
+		  "dirtied=%u dirtied_pause=%u "
+		  "period=%lu think=%ld pause=%ld paused=%lu",
+		  __entry->bdi,
+		  __entry->limit,
+		  __entry->setpoint,
+		  __entry->dirty,
+		  __entry->bdi_setpoint,
+		  __entry->bdi_dirty,
+		  __entry->dirty_ratelimit,
+		  __entry->task_ratelimit,
+		  __entry->dirtied,
+		  __entry->dirtied_pause,
+		  __entry->period,	/* ms */
+		  __entry->think,	/* ms */
+		  __entry->pause,	/* ms */
+		  __entry->paused	/* ms */
+	  )
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:52:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:52:13.000000000 +0800
@@ -1062,6 +1062,18 @@ static void balance_dirty_pages(struct a
 		 * do a reset, as it may be a light dirtier.
 		 */
 		if (unlikely(pause <= 0)) {
+			trace_balance_dirty_pages(bdi,
+						  dirty_thresh,
+						  background_thresh,
+						  nr_dirty,
+						  bdi_thresh,
+						  bdi_dirty,
+						  dirty_ratelimit,
+						  task_ratelimit,
+						  pages_dirtied,
+						  period,
+						  pause,
+						  start_time);
 			if (pause < -HZ) {
 				current->dirty_paused_when = now;
 				current->nr_dirtied = 0;
@@ -1075,6 +1087,18 @@ static void balance_dirty_pages(struct a
 		pause = min(pause, (long)MAX_PAUSE);
 
 pause:
+		trace_balance_dirty_pages(bdi,
+					  dirty_thresh,
+					  background_thresh,
+					  nr_dirty,
+					  bdi_thresh,
+					  bdi_dirty,
+					  dirty_ratelimit,
+					  task_ratelimit,
+					  pages_dirtied,
+					  period,
+					  pause,
+					  start_time);
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 09/18] writeback: trace balance_dirty_pages
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-trace-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 4589 bytes --]

Useful for analyzing the dynamics of the throttling algorithms and
debugging user reported problems.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   81 +++++++++++++++++++++++++++++
 mm/page-writeback.c              |   24 ++++++++
 2 files changed, 105 insertions(+)

--- linux-next.orig/include/trace/events/writeback.h	2011-08-29 19:52:11.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-08-29 19:52:13.000000000 +0800
@@ -272,6 +272,87 @@ TRACE_EVENT(dirty_ratelimit,
 	)
 );
 
+TRACE_EVENT(balance_dirty_pages,
+
+	TP_PROTO(struct backing_dev_info *bdi,
+		 unsigned long thresh,
+		 unsigned long bg_thresh,
+		 unsigned long dirty,
+		 unsigned long bdi_thresh,
+		 unsigned long bdi_dirty,
+		 unsigned long dirty_ratelimit,
+		 unsigned long task_ratelimit,
+		 unsigned long dirtied,
+		 unsigned long period,
+		 long pause,
+		 unsigned long start_time),
+
+	TP_ARGS(bdi, thresh, bg_thresh, dirty, bdi_thresh, bdi_dirty,
+		dirty_ratelimit, task_ratelimit,
+		dirtied, period, pause, start_time),
+
+	TP_STRUCT__entry(
+		__array(	 char,	bdi, 32)
+		__field(unsigned long,	limit)
+		__field(unsigned long,	setpoint)
+		__field(unsigned long,	dirty)
+		__field(unsigned long,	bdi_setpoint)
+		__field(unsigned long,	bdi_dirty)
+		__field(unsigned long,	dirty_ratelimit)
+		__field(unsigned long,	task_ratelimit)
+		__field(unsigned int,	dirtied)
+		__field(unsigned int,	dirtied_pause)
+		__field(unsigned long,	period)
+		__field(	 long,	think)
+		__field(	 long,	pause)
+		__field(unsigned long,	paused)
+	),
+
+	TP_fast_assign(
+		unsigned long freerun = (thresh + bg_thresh) / 2;
+		strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
+
+		__entry->limit		= global_dirty_limit;
+		__entry->setpoint	= (global_dirty_limit + freerun) / 2;
+		__entry->dirty		= dirty;
+		__entry->bdi_setpoint	= __entry->setpoint *
+						bdi_thresh / (thresh + 1);
+		__entry->bdi_dirty	= bdi_dirty;
+		__entry->dirty_ratelimit = KBps(dirty_ratelimit);
+		__entry->task_ratelimit	= KBps(task_ratelimit);
+		__entry->dirtied	= dirtied;
+		__entry->dirtied_pause	= current->nr_dirtied_pause;
+		__entry->think		= current->dirty_paused_when == 0 ? 0 :
+			 (long)(jiffies - current->dirty_paused_when) * 1000/HZ;
+		__entry->period		= period * 1000 / HZ;
+		__entry->pause		= pause * 1000 / HZ;
+		__entry->paused		= (jiffies - start_time) * 1000 / HZ;
+	),
+
+
+	TP_printk("bdi %s: "
+		  "limit=%lu setpoint=%lu dirty=%lu "
+		  "bdi_setpoint=%lu bdi_dirty=%lu "
+		  "dirty_ratelimit=%lu task_ratelimit=%lu "
+		  "dirtied=%u dirtied_pause=%u "
+		  "period=%lu think=%ld pause=%ld paused=%lu",
+		  __entry->bdi,
+		  __entry->limit,
+		  __entry->setpoint,
+		  __entry->dirty,
+		  __entry->bdi_setpoint,
+		  __entry->bdi_dirty,
+		  __entry->dirty_ratelimit,
+		  __entry->task_ratelimit,
+		  __entry->dirtied,
+		  __entry->dirtied_pause,
+		  __entry->period,	/* ms */
+		  __entry->think,	/* ms */
+		  __entry->pause,	/* ms */
+		  __entry->paused	/* ms */
+	  )
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:52:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:52:13.000000000 +0800
@@ -1062,6 +1062,18 @@ static void balance_dirty_pages(struct a
 		 * do a reset, as it may be a light dirtier.
 		 */
 		if (unlikely(pause <= 0)) {
+			trace_balance_dirty_pages(bdi,
+						  dirty_thresh,
+						  background_thresh,
+						  nr_dirty,
+						  bdi_thresh,
+						  bdi_dirty,
+						  dirty_ratelimit,
+						  task_ratelimit,
+						  pages_dirtied,
+						  period,
+						  pause,
+						  start_time);
 			if (pause < -HZ) {
 				current->dirty_paused_when = now;
 				current->nr_dirtied = 0;
@@ -1075,6 +1087,18 @@ static void balance_dirty_pages(struct a
 		pause = min(pause, (long)MAX_PAUSE);
 
 pause:
+		trace_balance_dirty_pages(bdi,
+					  dirty_thresh,
+					  background_thresh,
+					  nr_dirty,
+					  bdi_thresh,
+					  bdi_dirty,
+					  dirty_ratelimit,
+					  task_ratelimit,
+					  pages_dirtied,
+					  period,
+					  pause,
+					  start_time);
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 10/18] writeback: dirty position control - bdi reserve area
  2011-09-04  1:53 ` Wu Fengguang
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: bdi-reserve-area --]
[-- Type: text/plain, Size: 2573 bytes --]

Keep a minimal pool of dirty pages for each bdi, so that the disk IO
queues won't underrun.

It's particularly useful for JBOD and small memory system.

Note that this is not enough when memory is really tight (in comparison
to write bandwidth). It may result in (pos_ratio > 1) at the setpoint
and push the dirty pages high. This is more or less intended because the
bdi is in the danger of IO queue underflow. However the global dirty
pages, when pushed close to limit, will eventually conteract our desire
to push up the low bdi_dirty.

In low memory JBOD tests we do see disks under-utilized from time to
time. The additional fix may be to add a BDI_async_underrun flag to
indicate that the block write queue is running low and it's time to
quickly fill the queue by unthrottling the tasks regardless of the
global limit.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-26 20:12:19.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 20:13:21.000000000 +0800
@@ -487,6 +487,16 @@ unsigned long bdi_dirty_limit(struct bac
  *   0 +------------.------------------.----------------------*------------->
  *           freerun^          setpoint^                 limit^   dirty pages
  *
+ * (o) bdi reserve area
+ *
+ * The bdi reserve area tries to keep a reasonable number of dirty pages for
+ * preventing block queue underrun.
+ *
+ * reserve area, scale up rate as dirty pages drop low
+ * |<----------------------------------------------->|
+ * |-------------------------------------------------------*-------|----------
+ * 0                                           bdi setpoint^       ^bdi_thresh
+ *
  * (o) bdi control lines
  *
  * The control lines for the global/bdi setpoints both stretch up to @limit.
@@ -634,6 +644,22 @@ static unsigned long bdi_position_ratio(
 	pos_ratio *= x_intercept - bdi_dirty;
 	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
 
+	/*
+	 * bdi reserve area, safeguard against dirty pool underrun and disk idle
+	 *
+	 * It may push the desired control point of global dirty pages higher
+	 * than setpoint. It's not necessary in single-bdi case because a
+	 * minimal pool of @freerun dirty pages will already be guaranteed.
+	 */
+	x_intercept = min(write_bw, freerun);
+	if (bdi_dirty < x_intercept) {
+		if (bdi_dirty > x_intercept / 8) {
+			pos_ratio *= x_intercept;
+			do_div(pos_ratio, bdi_dirty);
+		} else
+			pos_ratio *= 8;
+	}
+
 	return pos_ratio;
 }
 



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 10/18] writeback: dirty position control - bdi reserve area
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: bdi-reserve-area --]
[-- Type: text/plain, Size: 2876 bytes --]

Keep a minimal pool of dirty pages for each bdi, so that the disk IO
queues won't underrun.

It's particularly useful for JBOD and small memory system.

Note that this is not enough when memory is really tight (in comparison
to write bandwidth). It may result in (pos_ratio > 1) at the setpoint
and push the dirty pages high. This is more or less intended because the
bdi is in the danger of IO queue underflow. However the global dirty
pages, when pushed close to limit, will eventually conteract our desire
to push up the low bdi_dirty.

In low memory JBOD tests we do see disks under-utilized from time to
time. The additional fix may be to add a BDI_async_underrun flag to
indicate that the block write queue is running low and it's time to
quickly fill the queue by unthrottling the tasks regardless of the
global limit.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-26 20:12:19.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 20:13:21.000000000 +0800
@@ -487,6 +487,16 @@ unsigned long bdi_dirty_limit(struct bac
  *   0 +------------.------------------.----------------------*------------->
  *           freerun^          setpoint^                 limit^   dirty pages
  *
+ * (o) bdi reserve area
+ *
+ * The bdi reserve area tries to keep a reasonable number of dirty pages for
+ * preventing block queue underrun.
+ *
+ * reserve area, scale up rate as dirty pages drop low
+ * |<----------------------------------------------->|
+ * |-------------------------------------------------------*-------|----------
+ * 0                                           bdi setpoint^       ^bdi_thresh
+ *
  * (o) bdi control lines
  *
  * The control lines for the global/bdi setpoints both stretch up to @limit.
@@ -634,6 +644,22 @@ static unsigned long bdi_position_ratio(
 	pos_ratio *= x_intercept - bdi_dirty;
 	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
 
+	/*
+	 * bdi reserve area, safeguard against dirty pool underrun and disk idle
+	 *
+	 * It may push the desired control point of global dirty pages higher
+	 * than setpoint. It's not necessary in single-bdi case because a
+	 * minimal pool of @freerun dirty pages will already be guaranteed.
+	 */
+	x_intercept = min(write_bw, freerun);
+	if (bdi_dirty < x_intercept) {
+		if (bdi_dirty > x_intercept / 8) {
+			pos_ratio *= x_intercept;
+			do_div(pos_ratio, bdi_dirty);
+		} else
+			pos_ratio *= 8;
+	}
+
 	return pos_ratio;
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 11/18] block: add bdi flag to indicate risk of io queue underrun
  2011-09-04  1:53 ` Wu Fengguang
  (?)
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Tejun Heo, Jens Axboe, Li Shaohua, Wu Fengguang,
	Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm,
	LKML

[-- Attachment #1: blk-queue-underrun.patch --]
[-- Type: text/plain, Size: 4129 bytes --]

Hurry it up when there are less than 3 async requests in the block io queue:

1) don't dirty throttle the current dirtier

2) wakeup the flusher for background writeout (XXX: the flusher may then
   abort not being aware of the underrun)

When doing 1-dd write test with dirty_bytes=1MB, it increased the XFS
writeout throughput from 5MB/s to 55MB/s and increased disk utilization
from ~3% to ~85%.  ext4 achieves almost the same. However btrfs is not
good: it only does 1MB/s normally, with sudden rushes to 10-60MB/s.

CC: Tejun Heo <tj@kernel.org>
CC: Jens Axboe <axboe@kernel.dk>
CC: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 block/blk-core.c            |    7 +++++++
 include/linux/backing-dev.h |   18 ++++++++++++++++++
 include/linux/blkdev.h      |   12 ++++++++++++
 mm/page-writeback.c         |    3 +++
 4 files changed, 40 insertions(+)

--- linux-next.orig/block/blk-core.c	2011-08-31 10:27:11.000000000 +0800
+++ linux-next/block/blk-core.c	2011-08-31 14:41:38.000000000 +0800
@@ -637,6 +637,10 @@ static void __freed_request(struct reque
 {
 	struct request_list *rl = &q->rq;
 
+	if (rl->count[sync] <= q->in_flight[sync] &&
+	    rl->count[!sync] == 0)
+		blk_set_queue_underrun(q, sync);
+
 	if (rl->count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
@@ -738,6 +742,9 @@ static struct request *get_request(struc
 	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
 		goto out;
 
+	if (rl->count[is_sync] >= q->in_flight[is_sync] + BLK_UNDERRUN_REQUESTS)
+		blk_clear_queue_underrun(q, is_sync);
+
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
--- linux-next.orig/include/linux/blkdev.h	2011-08-31 10:27:11.000000000 +0800
+++ linux-next/include/linux/blkdev.h	2011-08-31 10:49:43.000000000 +0800
@@ -699,6 +699,18 @@ static inline void blk_set_queue_congest
 	set_bdi_congested(&q->backing_dev_info, sync);
 }
 
+#define BLK_UNDERRUN_REQUESTS	3
+
+static inline void blk_clear_queue_underrun(struct request_queue *q, int sync)
+{
+	clear_bdi_underrun(&q->backing_dev_info, sync);
+}
+
+static inline void blk_set_queue_underrun(struct request_queue *q, int sync)
+{
+	set_bdi_underrun(&q->backing_dev_info, sync);
+}
+
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_stop_queue(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);
--- linux-next.orig/include/linux/backing-dev.h	2011-08-31 10:27:11.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-31 10:49:43.000000000 +0800
@@ -32,6 +32,7 @@ enum bdi_state {
 	BDI_sync_congested,	/* The sync queue is getting full */
 	BDI_registered,		/* bdi_register() was done */
 	BDI_writeback_running,	/* Writeback is in progress */
+	BDI_async_underrun,	/* The async queue is getting underrun */
 	BDI_unused,		/* Available bits start here */
 };
 
@@ -301,6 +302,23 @@ void set_bdi_congested(struct backing_de
 long congestion_wait(int sync, long timeout);
 long wait_iff_congested(struct zone *zone, int sync, long timeout);
 
+static inline void clear_bdi_underrun(struct backing_dev_info *bdi, int sync)
+{
+	if (sync == BLK_RW_ASYNC)
+		clear_bit(BDI_async_underrun, &bdi->state);
+}
+
+static inline void set_bdi_underrun(struct backing_dev_info *bdi, int sync)
+{
+	if (sync == BLK_RW_ASYNC)
+		set_bit(BDI_async_underrun, &bdi->state);
+}
+
+static inline int bdi_async_underrun(struct backing_dev_info *bdi)
+{
+	return bdi->state & (1 << BDI_async_underrun);
+}
+
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
 {
 	return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK);
--- linux-next.orig/mm/page-writeback.c	2011-08-31 10:49:43.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-31 14:40:58.000000000 +0800
@@ -1067,6 +1067,9 @@ static void balance_dirty_pages(struct a
 				     nr_dirty, bdi_thresh, bdi_dirty,
 				     start_time);
 
+		if (unlikely(!dirty_exceeded && bdi_async_underrun(bdi)))
+			break;
+
 		dirty_ratelimit = bdi->dirty_ratelimit;
 		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
 					       background_thresh, nr_dirty,



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 11/18] block: add bdi flag to indicate risk of io queue underrun
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Tejun Heo, Jens Axboe, Li Shaohua, Wu Fengguang,
	Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm,
	LKML

[-- Attachment #1: blk-queue-underrun.patch --]
[-- Type: text/plain, Size: 4432 bytes --]

Hurry it up when there are less than 3 async requests in the block io queue:

1) don't dirty throttle the current dirtier

2) wakeup the flusher for background writeout (XXX: the flusher may then
   abort not being aware of the underrun)

When doing 1-dd write test with dirty_bytes=1MB, it increased the XFS
writeout throughput from 5MB/s to 55MB/s and increased disk utilization
from ~3% to ~85%.  ext4 achieves almost the same. However btrfs is not
good: it only does 1MB/s normally, with sudden rushes to 10-60MB/s.

CC: Tejun Heo <tj@kernel.org>
CC: Jens Axboe <axboe@kernel.dk>
CC: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 block/blk-core.c            |    7 +++++++
 include/linux/backing-dev.h |   18 ++++++++++++++++++
 include/linux/blkdev.h      |   12 ++++++++++++
 mm/page-writeback.c         |    3 +++
 4 files changed, 40 insertions(+)

--- linux-next.orig/block/blk-core.c	2011-08-31 10:27:11.000000000 +0800
+++ linux-next/block/blk-core.c	2011-08-31 14:41:38.000000000 +0800
@@ -637,6 +637,10 @@ static void __freed_request(struct reque
 {
 	struct request_list *rl = &q->rq;
 
+	if (rl->count[sync] <= q->in_flight[sync] &&
+	    rl->count[!sync] == 0)
+		blk_set_queue_underrun(q, sync);
+
 	if (rl->count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
@@ -738,6 +742,9 @@ static struct request *get_request(struc
 	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
 		goto out;
 
+	if (rl->count[is_sync] >= q->in_flight[is_sync] + BLK_UNDERRUN_REQUESTS)
+		blk_clear_queue_underrun(q, is_sync);
+
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
--- linux-next.orig/include/linux/blkdev.h	2011-08-31 10:27:11.000000000 +0800
+++ linux-next/include/linux/blkdev.h	2011-08-31 10:49:43.000000000 +0800
@@ -699,6 +699,18 @@ static inline void blk_set_queue_congest
 	set_bdi_congested(&q->backing_dev_info, sync);
 }
 
+#define BLK_UNDERRUN_REQUESTS	3
+
+static inline void blk_clear_queue_underrun(struct request_queue *q, int sync)
+{
+	clear_bdi_underrun(&q->backing_dev_info, sync);
+}
+
+static inline void blk_set_queue_underrun(struct request_queue *q, int sync)
+{
+	set_bdi_underrun(&q->backing_dev_info, sync);
+}
+
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_stop_queue(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);
--- linux-next.orig/include/linux/backing-dev.h	2011-08-31 10:27:11.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-31 10:49:43.000000000 +0800
@@ -32,6 +32,7 @@ enum bdi_state {
 	BDI_sync_congested,	/* The sync queue is getting full */
 	BDI_registered,		/* bdi_register() was done */
 	BDI_writeback_running,	/* Writeback is in progress */
+	BDI_async_underrun,	/* The async queue is getting underrun */
 	BDI_unused,		/* Available bits start here */
 };
 
@@ -301,6 +302,23 @@ void set_bdi_congested(struct backing_de
 long congestion_wait(int sync, long timeout);
 long wait_iff_congested(struct zone *zone, int sync, long timeout);
 
+static inline void clear_bdi_underrun(struct backing_dev_info *bdi, int sync)
+{
+	if (sync == BLK_RW_ASYNC)
+		clear_bit(BDI_async_underrun, &bdi->state);
+}
+
+static inline void set_bdi_underrun(struct backing_dev_info *bdi, int sync)
+{
+	if (sync == BLK_RW_ASYNC)
+		set_bit(BDI_async_underrun, &bdi->state);
+}
+
+static inline int bdi_async_underrun(struct backing_dev_info *bdi)
+{
+	return bdi->state & (1 << BDI_async_underrun);
+}
+
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
 {
 	return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK);
--- linux-next.orig/mm/page-writeback.c	2011-08-31 10:49:43.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-31 14:40:58.000000000 +0800
@@ -1067,6 +1067,9 @@ static void balance_dirty_pages(struct a
 				     nr_dirty, bdi_thresh, bdi_dirty,
 				     start_time);
 
+		if (unlikely(!dirty_exceeded && bdi_async_underrun(bdi)))
+			break;
+
 		dirty_ratelimit = bdi->dirty_ratelimit;
 		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
 					       background_thresh, nr_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 11/18] block: add bdi flag to indicate risk of io queue underrun
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Tejun Heo, Jens Axboe, Li Shaohua, Wu Fengguang,
	Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm,
	LKML

[-- Attachment #1: blk-queue-underrun.patch --]
[-- Type: text/plain, Size: 4432 bytes --]

Hurry it up when there are less than 3 async requests in the block io queue:

1) don't dirty throttle the current dirtier

2) wakeup the flusher for background writeout (XXX: the flusher may then
   abort not being aware of the underrun)

When doing 1-dd write test with dirty_bytes=1MB, it increased the XFS
writeout throughput from 5MB/s to 55MB/s and increased disk utilization
from ~3% to ~85%.  ext4 achieves almost the same. However btrfs is not
good: it only does 1MB/s normally, with sudden rushes to 10-60MB/s.

CC: Tejun Heo <tj@kernel.org>
CC: Jens Axboe <axboe@kernel.dk>
CC: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 block/blk-core.c            |    7 +++++++
 include/linux/backing-dev.h |   18 ++++++++++++++++++
 include/linux/blkdev.h      |   12 ++++++++++++
 mm/page-writeback.c         |    3 +++
 4 files changed, 40 insertions(+)

--- linux-next.orig/block/blk-core.c	2011-08-31 10:27:11.000000000 +0800
+++ linux-next/block/blk-core.c	2011-08-31 14:41:38.000000000 +0800
@@ -637,6 +637,10 @@ static void __freed_request(struct reque
 {
 	struct request_list *rl = &q->rq;
 
+	if (rl->count[sync] <= q->in_flight[sync] &&
+	    rl->count[!sync] == 0)
+		blk_set_queue_underrun(q, sync);
+
 	if (rl->count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
@@ -738,6 +742,9 @@ static struct request *get_request(struc
 	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
 		goto out;
 
+	if (rl->count[is_sync] >= q->in_flight[is_sync] + BLK_UNDERRUN_REQUESTS)
+		blk_clear_queue_underrun(q, is_sync);
+
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
--- linux-next.orig/include/linux/blkdev.h	2011-08-31 10:27:11.000000000 +0800
+++ linux-next/include/linux/blkdev.h	2011-08-31 10:49:43.000000000 +0800
@@ -699,6 +699,18 @@ static inline void blk_set_queue_congest
 	set_bdi_congested(&q->backing_dev_info, sync);
 }
 
+#define BLK_UNDERRUN_REQUESTS	3
+
+static inline void blk_clear_queue_underrun(struct request_queue *q, int sync)
+{
+	clear_bdi_underrun(&q->backing_dev_info, sync);
+}
+
+static inline void blk_set_queue_underrun(struct request_queue *q, int sync)
+{
+	set_bdi_underrun(&q->backing_dev_info, sync);
+}
+
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_stop_queue(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);
--- linux-next.orig/include/linux/backing-dev.h	2011-08-31 10:27:11.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-31 10:49:43.000000000 +0800
@@ -32,6 +32,7 @@ enum bdi_state {
 	BDI_sync_congested,	/* The sync queue is getting full */
 	BDI_registered,		/* bdi_register() was done */
 	BDI_writeback_running,	/* Writeback is in progress */
+	BDI_async_underrun,	/* The async queue is getting underrun */
 	BDI_unused,		/* Available bits start here */
 };
 
@@ -301,6 +302,23 @@ void set_bdi_congested(struct backing_de
 long congestion_wait(int sync, long timeout);
 long wait_iff_congested(struct zone *zone, int sync, long timeout);
 
+static inline void clear_bdi_underrun(struct backing_dev_info *bdi, int sync)
+{
+	if (sync == BLK_RW_ASYNC)
+		clear_bit(BDI_async_underrun, &bdi->state);
+}
+
+static inline void set_bdi_underrun(struct backing_dev_info *bdi, int sync)
+{
+	if (sync == BLK_RW_ASYNC)
+		set_bit(BDI_async_underrun, &bdi->state);
+}
+
+static inline int bdi_async_underrun(struct backing_dev_info *bdi)
+{
+	return bdi->state & (1 << BDI_async_underrun);
+}
+
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
 {
 	return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK);
--- linux-next.orig/mm/page-writeback.c	2011-08-31 10:49:43.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-31 14:40:58.000000000 +0800
@@ -1067,6 +1067,9 @@ static void balance_dirty_pages(struct a
 				     nr_dirty, bdi_thresh, bdi_dirty,
 				     start_time);
 
+		if (unlikely(!dirty_exceeded && bdi_async_underrun(bdi)))
+			break;
+
 		dirty_ratelimit = bdi->dirty_ratelimit;
 		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
 					       background_thresh, nr_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 12/18] writeback: balanced_rate cannot exceed write bandwidth
  2011-09-04  1:53 ` Wu Fengguang
  (?)
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: ref-bw-up-bound --]
[-- Type: text/plain, Size: 970 bytes --]

Add an upper limit to balanced_rate according to the below inequality.
This filters out some rare but huge singular points, which at least
enables more readable gnuplot figures.

When there are N dd dirtiers,

	balanced_dirty_ratelimit = write_bw / N

So it holds that

	balanced_dirty_ratelimit <= write_bw

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    5 +++++
 1 file changed, 5 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:14:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:20:36.000000000 +0800
@@ -828,6 +828,11 @@ static void bdi_update_dirty_ratelimit(s
 	 */
 	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
 					   dirty_rate | 1);
+	/*
+	 * balanced_dirty_ratelimit ~= (write_bw / N) <= write_bw
+	 */
+	if (unlikely(balanced_dirty_ratelimit > write_bw))
+		balanced_dirty_ratelimit = write_bw;
 
 	/*
 	 * We could safely do this and return immediately:



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 12/18] writeback: balanced_rate cannot exceed write bandwidth
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: ref-bw-up-bound --]
[-- Type: text/plain, Size: 1273 bytes --]

Add an upper limit to balanced_rate according to the below inequality.
This filters out some rare but huge singular points, which at least
enables more readable gnuplot figures.

When there are N dd dirtiers,

	balanced_dirty_ratelimit = write_bw / N

So it holds that

	balanced_dirty_ratelimit <= write_bw

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    5 +++++
 1 file changed, 5 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:14:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:20:36.000000000 +0800
@@ -828,6 +828,11 @@ static void bdi_update_dirty_ratelimit(s
 	 */
 	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
 					   dirty_rate | 1);
+	/*
+	 * balanced_dirty_ratelimit ~= (write_bw / N) <= write_bw
+	 */
+	if (unlikely(balanced_dirty_ratelimit > write_bw))
+		balanced_dirty_ratelimit = write_bw;
 
 	/*
 	 * We could safely do this and return immediately:


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 12/18] writeback: balanced_rate cannot exceed write bandwidth
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: ref-bw-up-bound --]
[-- Type: text/plain, Size: 1273 bytes --]

Add an upper limit to balanced_rate according to the below inequality.
This filters out some rare but huge singular points, which at least
enables more readable gnuplot figures.

When there are N dd dirtiers,

	balanced_dirty_ratelimit = write_bw / N

So it holds that

	balanced_dirty_ratelimit <= write_bw

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    5 +++++
 1 file changed, 5 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:14:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:20:36.000000000 +0800
@@ -828,6 +828,11 @@ static void bdi_update_dirty_ratelimit(s
 	 */
 	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
 					   dirty_rate | 1);
+	/*
+	 * balanced_dirty_ratelimit ~= (write_bw / N) <= write_bw
+	 */
+	if (unlikely(balanced_dirty_ratelimit > write_bw))
+		balanced_dirty_ratelimit = write_bw;
 
 	/*
 	 * We could safely do this and return immediately:


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 13/18] writeback: limit max dirty pause time
  2011-09-04  1:53 ` Wu Fengguang
  (?)
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: max-pause --]
[-- Type: text/plain, Size: 3172 bytes --]

Apply two policies to scale down the max pause time for

1) small number of concurrent dirtiers
2) small memory system (comparing to storage bandwidth)

MAX_PAUSE=200ms may only be suitable for high end servers with lots of
concurrent dirtiers, where the large pause time can reduce much overheads.

Otherwise, smaller pause time is desirable whenever possible, so as to
get good responsiveness and smooth user experiences. It's actually
required for good disk utilization in the case when all the dirty pages
can be synced to disk within MAX_PAUSE=200ms.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   45 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-09-01 09:43:38.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-09-01 09:43:39.000000000 +0800
@@ -976,6 +976,42 @@ static unsigned long dirty_poll_interval
 	return 1;
 }
 
+static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
+				   unsigned long bdi_dirty)
+{
+	unsigned long hi = ilog2(bdi->write_bandwidth);
+	unsigned long lo = ilog2(bdi->dirty_ratelimit);
+	unsigned long t;
+
+	/* target for ~10ms pause on 1-dd case */
+	t = HZ / 50;
+
+	/*
+	 * Scale up pause time for concurrent dirtiers in order to reduce CPU
+	 * overheads.
+	 *
+	 * (N * 20ms) on 2^N concurrent tasks.
+	 */
+	if (hi > lo)
+		t += (hi - lo) * (20 * HZ) / 1024;
+
+	/*
+	 * Limit pause time for small memory systems. If sleeping for too long
+	 * time, a small pool of dirty/writeback pages may go empty and disk go
+	 * idle.
+	 *
+	 * 1ms for every 1MB; may further consider bdi bandwidth.
+	 */
+	if (bdi_dirty)
+		t = min(t, bdi_dirty >> (30 - PAGE_CACHE_SHIFT - ilog2(HZ)));
+
+	/*
+	 * The pause time will be settled within range (max_pause/4, max_pause).
+	 * Apply a minimal value of 4 to get a non-zero max_pause/4.
+	 */
+	return clamp_val(t, 4, MAX_PAUSE);
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -995,6 +1031,7 @@ static void balance_dirty_pages(struct a
 	unsigned long bdi_thresh;
 	long period;
 	long pause = 0;
+	long max_pause;
 	bool dirty_exceeded = false;
 	unsigned long task_ratelimit;
 	unsigned long dirty_ratelimit;
@@ -1079,13 +1116,15 @@ static void balance_dirty_pages(struct a
 		if (unlikely(!dirty_exceeded && bdi_async_underrun(bdi)))
 			break;
 
+		max_pause = bdi_max_pause(bdi, bdi_dirty);
+
 		dirty_ratelimit = bdi->dirty_ratelimit;
 		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
 					       background_thresh, nr_dirty,
 					       bdi_thresh, bdi_dirty);
 		if (unlikely(pos_ratio == 0)) {
-			period = MAX_PAUSE;
-			pause = MAX_PAUSE;
+			period = max_pause;
+			pause = max_pause;
 			goto pause;
 		}
 		task_ratelimit = (u64)dirty_ratelimit *
@@ -1122,7 +1161,7 @@ static void balance_dirty_pages(struct a
 			pause = 1; /* avoid resetting nr_dirtied_pause below */
 			break;
 		}
-		pause = min_t(long, pause, MAX_PAUSE);
+		pause = min(pause, max_pause);
 
 pause:
 		trace_balance_dirty_pages(bdi,



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 13/18] writeback: limit max dirty pause time
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: max-pause --]
[-- Type: text/plain, Size: 3475 bytes --]

Apply two policies to scale down the max pause time for

1) small number of concurrent dirtiers
2) small memory system (comparing to storage bandwidth)

MAX_PAUSE=200ms may only be suitable for high end servers with lots of
concurrent dirtiers, where the large pause time can reduce much overheads.

Otherwise, smaller pause time is desirable whenever possible, so as to
get good responsiveness and smooth user experiences. It's actually
required for good disk utilization in the case when all the dirty pages
can be synced to disk within MAX_PAUSE=200ms.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   45 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-09-01 09:43:38.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-09-01 09:43:39.000000000 +0800
@@ -976,6 +976,42 @@ static unsigned long dirty_poll_interval
 	return 1;
 }
 
+static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
+				   unsigned long bdi_dirty)
+{
+	unsigned long hi = ilog2(bdi->write_bandwidth);
+	unsigned long lo = ilog2(bdi->dirty_ratelimit);
+	unsigned long t;
+
+	/* target for ~10ms pause on 1-dd case */
+	t = HZ / 50;
+
+	/*
+	 * Scale up pause time for concurrent dirtiers in order to reduce CPU
+	 * overheads.
+	 *
+	 * (N * 20ms) on 2^N concurrent tasks.
+	 */
+	if (hi > lo)
+		t += (hi - lo) * (20 * HZ) / 1024;
+
+	/*
+	 * Limit pause time for small memory systems. If sleeping for too long
+	 * time, a small pool of dirty/writeback pages may go empty and disk go
+	 * idle.
+	 *
+	 * 1ms for every 1MB; may further consider bdi bandwidth.
+	 */
+	if (bdi_dirty)
+		t = min(t, bdi_dirty >> (30 - PAGE_CACHE_SHIFT - ilog2(HZ)));
+
+	/*
+	 * The pause time will be settled within range (max_pause/4, max_pause).
+	 * Apply a minimal value of 4 to get a non-zero max_pause/4.
+	 */
+	return clamp_val(t, 4, MAX_PAUSE);
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -995,6 +1031,7 @@ static void balance_dirty_pages(struct a
 	unsigned long bdi_thresh;
 	long period;
 	long pause = 0;
+	long max_pause;
 	bool dirty_exceeded = false;
 	unsigned long task_ratelimit;
 	unsigned long dirty_ratelimit;
@@ -1079,13 +1116,15 @@ static void balance_dirty_pages(struct a
 		if (unlikely(!dirty_exceeded && bdi_async_underrun(bdi)))
 			break;
 
+		max_pause = bdi_max_pause(bdi, bdi_dirty);
+
 		dirty_ratelimit = bdi->dirty_ratelimit;
 		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
 					       background_thresh, nr_dirty,
 					       bdi_thresh, bdi_dirty);
 		if (unlikely(pos_ratio == 0)) {
-			period = MAX_PAUSE;
-			pause = MAX_PAUSE;
+			period = max_pause;
+			pause = max_pause;
 			goto pause;
 		}
 		task_ratelimit = (u64)dirty_ratelimit *
@@ -1122,7 +1161,7 @@ static void balance_dirty_pages(struct a
 			pause = 1; /* avoid resetting nr_dirtied_pause below */
 			break;
 		}
-		pause = min_t(long, pause, MAX_PAUSE);
+		pause = min(pause, max_pause);
 
 pause:
 		trace_balance_dirty_pages(bdi,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 13/18] writeback: limit max dirty pause time
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: max-pause --]
[-- Type: text/plain, Size: 3475 bytes --]

Apply two policies to scale down the max pause time for

1) small number of concurrent dirtiers
2) small memory system (comparing to storage bandwidth)

MAX_PAUSE=200ms may only be suitable for high end servers with lots of
concurrent dirtiers, where the large pause time can reduce much overheads.

Otherwise, smaller pause time is desirable whenever possible, so as to
get good responsiveness and smooth user experiences. It's actually
required for good disk utilization in the case when all the dirty pages
can be synced to disk within MAX_PAUSE=200ms.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   45 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-09-01 09:43:38.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-09-01 09:43:39.000000000 +0800
@@ -976,6 +976,42 @@ static unsigned long dirty_poll_interval
 	return 1;
 }
 
+static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
+				   unsigned long bdi_dirty)
+{
+	unsigned long hi = ilog2(bdi->write_bandwidth);
+	unsigned long lo = ilog2(bdi->dirty_ratelimit);
+	unsigned long t;
+
+	/* target for ~10ms pause on 1-dd case */
+	t = HZ / 50;
+
+	/*
+	 * Scale up pause time for concurrent dirtiers in order to reduce CPU
+	 * overheads.
+	 *
+	 * (N * 20ms) on 2^N concurrent tasks.
+	 */
+	if (hi > lo)
+		t += (hi - lo) * (20 * HZ) / 1024;
+
+	/*
+	 * Limit pause time for small memory systems. If sleeping for too long
+	 * time, a small pool of dirty/writeback pages may go empty and disk go
+	 * idle.
+	 *
+	 * 1ms for every 1MB; may further consider bdi bandwidth.
+	 */
+	if (bdi_dirty)
+		t = min(t, bdi_dirty >> (30 - PAGE_CACHE_SHIFT - ilog2(HZ)));
+
+	/*
+	 * The pause time will be settled within range (max_pause/4, max_pause).
+	 * Apply a minimal value of 4 to get a non-zero max_pause/4.
+	 */
+	return clamp_val(t, 4, MAX_PAUSE);
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -995,6 +1031,7 @@ static void balance_dirty_pages(struct a
 	unsigned long bdi_thresh;
 	long period;
 	long pause = 0;
+	long max_pause;
 	bool dirty_exceeded = false;
 	unsigned long task_ratelimit;
 	unsigned long dirty_ratelimit;
@@ -1079,13 +1116,15 @@ static void balance_dirty_pages(struct a
 		if (unlikely(!dirty_exceeded && bdi_async_underrun(bdi)))
 			break;
 
+		max_pause = bdi_max_pause(bdi, bdi_dirty);
+
 		dirty_ratelimit = bdi->dirty_ratelimit;
 		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
 					       background_thresh, nr_dirty,
 					       bdi_thresh, bdi_dirty);
 		if (unlikely(pos_ratio == 0)) {
-			period = MAX_PAUSE;
-			pause = MAX_PAUSE;
+			period = max_pause;
+			pause = max_pause;
 			goto pause;
 		}
 		task_ratelimit = (u64)dirty_ratelimit *
@@ -1122,7 +1161,7 @@ static void balance_dirty_pages(struct a
 			pause = 1; /* avoid resetting nr_dirtied_pause below */
 			break;
 		}
-		pause = min_t(long, pause, MAX_PAUSE);
+		pause = min(pause, max_pause);
 
 pause:
 		trace_balance_dirty_pages(bdi,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 14/18] writeback: control dirty pause time
  2011-09-04  1:53 ` Wu Fengguang
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: max-pause-adaption --]
[-- Type: text/plain, Size: 1871 bytes --]

The dirty pause time shall ultimately be controlled by adjusting
nr_dirtied_pause, since there is relationship

	pause = pages_dirtied / task_ratelimit

Assuming

	pages_dirtied ~= nr_dirtied_pause
	task_ratelimit ~= dirty_ratelimit

We get

	nr_dirtied_pause ~= dirty_ratelimit * desired_pause

Here dirty_ratelimit is preferred over task_ratelimit because it's
more stable.

It's also important to limit possible large transitional errors:

- bw is changing quickly
- pages_dirtied << nr_dirtied_pause on entering dirty exceeded area
- pages_dirtied >> nr_dirtied_pause on btrfs (to be improved by a
  separate fix, but still expect non-trivial errors)

So we end up using the above formula inside clamp_val().

The best test case for this code is to run 100 "dd bs=4M" tasks on
btrfs and check its pause time distribution.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:08:43.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:08:44.000000000 +0800
@@ -1193,7 +1193,20 @@ pause:
 	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
-	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
+	if (pause == 0)
+		current->nr_dirtied_pause =
+				dirty_poll_interval(nr_dirty, dirty_thresh);
+	else if (period <= max_pause / 4 &&
+		 pages_dirtied >= current->nr_dirtied_pause)
+		current->nr_dirtied_pause = clamp_val(
+					dirty_ratelimit * (max_pause / 2) / HZ,
+					pages_dirtied + pages_dirtied / 8,
+					pages_dirtied * 4);
+	else if (pause >= max_pause)
+		current->nr_dirtied_pause = 1 | clamp_val(
+					dirty_ratelimit * (max_pause * 3/8)/HZ,
+					pages_dirtied / 4,
+					pages_dirtied * 7/8);
 
 	if (writeback_in_progress(bdi))
 		return;



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 14/18] writeback: control dirty pause time
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: max-pause-adaption --]
[-- Type: text/plain, Size: 2174 bytes --]

The dirty pause time shall ultimately be controlled by adjusting
nr_dirtied_pause, since there is relationship

	pause = pages_dirtied / task_ratelimit

Assuming

	pages_dirtied ~= nr_dirtied_pause
	task_ratelimit ~= dirty_ratelimit

We get

	nr_dirtied_pause ~= dirty_ratelimit * desired_pause

Here dirty_ratelimit is preferred over task_ratelimit because it's
more stable.

It's also important to limit possible large transitional errors:

- bw is changing quickly
- pages_dirtied << nr_dirtied_pause on entering dirty exceeded area
- pages_dirtied >> nr_dirtied_pause on btrfs (to be improved by a
  separate fix, but still expect non-trivial errors)

So we end up using the above formula inside clamp_val().

The best test case for this code is to run 100 "dd bs=4M" tasks on
btrfs and check its pause time distribution.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:08:43.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:08:44.000000000 +0800
@@ -1193,7 +1193,20 @@ pause:
 	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
-	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
+	if (pause == 0)
+		current->nr_dirtied_pause =
+				dirty_poll_interval(nr_dirty, dirty_thresh);
+	else if (period <= max_pause / 4 &&
+		 pages_dirtied >= current->nr_dirtied_pause)
+		current->nr_dirtied_pause = clamp_val(
+					dirty_ratelimit * (max_pause / 2) / HZ,
+					pages_dirtied + pages_dirtied / 8,
+					pages_dirtied * 4);
+	else if (pause >= max_pause)
+		current->nr_dirtied_pause = 1 | clamp_val(
+					dirty_ratelimit * (max_pause * 3/8)/HZ,
+					pages_dirtied / 4,
+					pages_dirtied * 7/8);
 
 	if (writeback_in_progress(bdi))
 		return;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 15/18] writeback: charge leaked page dirties to active tasks
  2011-09-04  1:53 ` Wu Fengguang
  (?)
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-save-leaks-at-exit.patch --]
[-- Type: text/plain, Size: 2403 bytes --]

It's a years long problem that a large number of short-lived dirtiers
(eg. gcc instances in a fast kernel build) may starve long-run dirtiers
(eg. dd) as well as pushing the dirty pages to the global hard limit.

The solution is to charge the pages dirtied by the exited gcc to the
other random gcc/dd instances. It sounds not perfect, however should
behave good enough in practice.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    2 ++
 kernel/exit.c             |    2 ++
 mm/page-writeback.c       |   12 ++++++++++++
 3 files changed, 16 insertions(+)

--- linux-next.orig/include/linux/writeback.h	2011-08-29 19:14:22.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-29 19:14:32.000000000 +0800
@@ -7,6 +7,8 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 
+DECLARE_PER_CPU(int, dirty_leaks);
+
 /*
  * The 1/4 region under the global dirty thresh is for smooth dirty throttling:
  *
--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:14:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:14:32.000000000 +0800
@@ -1237,6 +1237,7 @@ void set_page_dirty_balance(struct page 
 }
 
 static DEFINE_PER_CPU(int, bdp_ratelimits);
+DEFINE_PER_CPU(int, dirty_leaks) = 0;
 
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
@@ -1285,6 +1286,17 @@ void balance_dirty_pages_ratelimited_nr(
 			ratelimit = 0;
 		}
 	}
+	/*
+	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
+	 * short-lived tasks (eg. gcc invocations in a kernel build) escaping
+	 * the dirty throttling and livelock other long-run dirtiers.
+	 */
+	p = &__get_cpu_var(dirty_leaks);
+	if (*p > 0 && current->nr_dirtied < ratelimit) {
+		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
+		*p -= nr_pages_dirtied;
+		current->nr_dirtied += nr_pages_dirtied;
+	}
 	preempt_enable();
 
 	if (unlikely(current->nr_dirtied >= ratelimit))
--- linux-next.orig/kernel/exit.c	2011-08-26 16:19:27.000000000 +0800
+++ linux-next/kernel/exit.c	2011-08-29 19:14:22.000000000 +0800
@@ -1044,6 +1044,8 @@ NORET_TYPE void do_exit(long code)
 	validate_creds_for_do_exit(tsk);
 
 	preempt_disable();
+	if (tsk->nr_dirtied)
+		__this_cpu_add(dirty_leaks, tsk->nr_dirtied);
 	exit_rcu();
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 15/18] writeback: charge leaked page dirties to active tasks
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-save-leaks-at-exit.patch --]
[-- Type: text/plain, Size: 2706 bytes --]

It's a years long problem that a large number of short-lived dirtiers
(eg. gcc instances in a fast kernel build) may starve long-run dirtiers
(eg. dd) as well as pushing the dirty pages to the global hard limit.

The solution is to charge the pages dirtied by the exited gcc to the
other random gcc/dd instances. It sounds not perfect, however should
behave good enough in practice.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    2 ++
 kernel/exit.c             |    2 ++
 mm/page-writeback.c       |   12 ++++++++++++
 3 files changed, 16 insertions(+)

--- linux-next.orig/include/linux/writeback.h	2011-08-29 19:14:22.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-29 19:14:32.000000000 +0800
@@ -7,6 +7,8 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 
+DECLARE_PER_CPU(int, dirty_leaks);
+
 /*
  * The 1/4 region under the global dirty thresh is for smooth dirty throttling:
  *
--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:14:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:14:32.000000000 +0800
@@ -1237,6 +1237,7 @@ void set_page_dirty_balance(struct page 
 }
 
 static DEFINE_PER_CPU(int, bdp_ratelimits);
+DEFINE_PER_CPU(int, dirty_leaks) = 0;
 
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
@@ -1285,6 +1286,17 @@ void balance_dirty_pages_ratelimited_nr(
 			ratelimit = 0;
 		}
 	}
+	/*
+	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
+	 * short-lived tasks (eg. gcc invocations in a kernel build) escaping
+	 * the dirty throttling and livelock other long-run dirtiers.
+	 */
+	p = &__get_cpu_var(dirty_leaks);
+	if (*p > 0 && current->nr_dirtied < ratelimit) {
+		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
+		*p -= nr_pages_dirtied;
+		current->nr_dirtied += nr_pages_dirtied;
+	}
 	preempt_enable();
 
 	if (unlikely(current->nr_dirtied >= ratelimit))
--- linux-next.orig/kernel/exit.c	2011-08-26 16:19:27.000000000 +0800
+++ linux-next/kernel/exit.c	2011-08-29 19:14:22.000000000 +0800
@@ -1044,6 +1044,8 @@ NORET_TYPE void do_exit(long code)
 	validate_creds_for_do_exit(tsk);
 
 	preempt_disable();
+	if (tsk->nr_dirtied)
+		__this_cpu_add(dirty_leaks, tsk->nr_dirtied);
 	exit_rcu();
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 15/18] writeback: charge leaked page dirties to active tasks
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-save-leaks-at-exit.patch --]
[-- Type: text/plain, Size: 2706 bytes --]

It's a years long problem that a large number of short-lived dirtiers
(eg. gcc instances in a fast kernel build) may starve long-run dirtiers
(eg. dd) as well as pushing the dirty pages to the global hard limit.

The solution is to charge the pages dirtied by the exited gcc to the
other random gcc/dd instances. It sounds not perfect, however should
behave good enough in practice.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    2 ++
 kernel/exit.c             |    2 ++
 mm/page-writeback.c       |   12 ++++++++++++
 3 files changed, 16 insertions(+)

--- linux-next.orig/include/linux/writeback.h	2011-08-29 19:14:22.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-29 19:14:32.000000000 +0800
@@ -7,6 +7,8 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 
+DECLARE_PER_CPU(int, dirty_leaks);
+
 /*
  * The 1/4 region under the global dirty thresh is for smooth dirty throttling:
  *
--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:14:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:14:32.000000000 +0800
@@ -1237,6 +1237,7 @@ void set_page_dirty_balance(struct page 
 }
 
 static DEFINE_PER_CPU(int, bdp_ratelimits);
+DEFINE_PER_CPU(int, dirty_leaks) = 0;
 
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
@@ -1285,6 +1286,17 @@ void balance_dirty_pages_ratelimited_nr(
 			ratelimit = 0;
 		}
 	}
+	/*
+	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
+	 * short-lived tasks (eg. gcc invocations in a kernel build) escaping
+	 * the dirty throttling and livelock other long-run dirtiers.
+	 */
+	p = &__get_cpu_var(dirty_leaks);
+	if (*p > 0 && current->nr_dirtied < ratelimit) {
+		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
+		*p -= nr_pages_dirtied;
+		current->nr_dirtied += nr_pages_dirtied;
+	}
 	preempt_enable();
 
 	if (unlikely(current->nr_dirtied >= ratelimit))
--- linux-next.orig/kernel/exit.c	2011-08-26 16:19:27.000000000 +0800
+++ linux-next/kernel/exit.c	2011-08-29 19:14:22.000000000 +0800
@@ -1044,6 +1044,8 @@ NORET_TYPE void do_exit(long code)
 	validate_creds_for_do_exit(tsk);
 
 	preempt_disable();
+	if (tsk->nr_dirtied)
+		__this_cpu_add(dirty_leaks, tsk->nr_dirtied);
 	exit_rcu();
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 16/18] writeback: fix dirtied pages accounting on sub-page writes
  2011-09-04  1:53 ` Wu Fengguang
  (?)
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-accurate-task-dirtied.patch --]
[-- Type: text/plain, Size: 1053 bytes --]

When dd in 512bytes, generic_perform_write() calls
balance_dirty_pages_ratelimited() 8 times for the same page, but
obviously the page is only dirtied once.

Fix it by accounting nr_dirtied at page dirty time.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:14:32.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:14:36.000000000 +0800
@@ -1267,8 +1267,6 @@ void balance_dirty_pages_ratelimited_nr(
 	if (bdi->dirty_exceeded)
 		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
 
-	current->nr_dirtied += nr_pages_dirtied;
-
 	preempt_disable();
 	/*
 	 * This prevents one CPU to accumulate too many dirtied pages without
@@ -1778,6 +1776,7 @@ void account_page_dirtied(struct page *p
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
+		current->nr_dirtied++;
 	}
 }
 EXPORT_SYMBOL(account_page_dirtied);



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 16/18] writeback: fix dirtied pages accounting on sub-page writes
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-accurate-task-dirtied.patch --]
[-- Type: text/plain, Size: 1356 bytes --]

When dd in 512bytes, generic_perform_write() calls
balance_dirty_pages_ratelimited() 8 times for the same page, but
obviously the page is only dirtied once.

Fix it by accounting nr_dirtied at page dirty time.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:14:32.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:14:36.000000000 +0800
@@ -1267,8 +1267,6 @@ void balance_dirty_pages_ratelimited_nr(
 	if (bdi->dirty_exceeded)
 		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
 
-	current->nr_dirtied += nr_pages_dirtied;
-
 	preempt_disable();
 	/*
 	 * This prevents one CPU to accumulate too many dirtied pages without
@@ -1778,6 +1776,7 @@ void account_page_dirtied(struct page *p
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
+		current->nr_dirtied++;
 	}
 }
 EXPORT_SYMBOL(account_page_dirtied);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 16/18] writeback: fix dirtied pages accounting on sub-page writes
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-accurate-task-dirtied.patch --]
[-- Type: text/plain, Size: 1356 bytes --]

When dd in 512bytes, generic_perform_write() calls
balance_dirty_pages_ratelimited() 8 times for the same page, but
obviously the page is only dirtied once.

Fix it by accounting nr_dirtied at page dirty time.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:14:32.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:14:36.000000000 +0800
@@ -1267,8 +1267,6 @@ void balance_dirty_pages_ratelimited_nr(
 	if (bdi->dirty_exceeded)
 		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
 
-	current->nr_dirtied += nr_pages_dirtied;
-
 	preempt_disable();
 	/*
 	 * This prevents one CPU to accumulate too many dirtied pages without
@@ -1778,6 +1776,7 @@ void account_page_dirtied(struct page *p
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
+		current->nr_dirtied++;
 	}
 }
 EXPORT_SYMBOL(account_page_dirtied);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
  2011-09-04  1:53 ` Wu Fengguang
  (?)
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-account-redirty --]
[-- Type: text/plain, Size: 2080 bytes --]

De-account the accumulative dirty counters on page redirty.

Page redirties (very common in ext4) will introduce mismatch between
counters (a) and (b)

a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
b) NR_WRITTEN, BDI_WRITTEN

This will introduce systematic errors in balanced_rate and result in
dirty page position errors (ie. the dirty pages are no longer balanced
around the global/bdi setpoints).

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    2 ++
 mm/page-writeback.c       |   12 ++++++++++++
 2 files changed, 14 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:14:36.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:14:38.000000000 +0800
@@ -1836,6 +1836,17 @@ int __set_page_dirty_nobuffers(struct pa
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
+void account_page_redirty(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	if (mapping && mapping_cap_account_dirty(mapping)) {
+		current->nr_dirtied--;
+		dec_zone_page_state(page, NR_DIRTIED);
+		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
+	}
+}
+EXPORT_SYMBOL(account_page_redirty);
+
 /*
  * When a writepage implementation decides that it doesn't want to write this
  * page for some reason, it should redirty the locked page via
@@ -1844,6 +1855,7 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers
 int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page)
 {
 	wbc->pages_skipped++;
+	account_page_redirty(page);
 	return __set_page_dirty_nobuffers(page);
 }
 EXPORT_SYMBOL(redirty_page_for_writepage);
--- linux-next.orig/include/linux/writeback.h	2011-08-29 19:14:32.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-29 19:14:38.000000000 +0800
@@ -175,6 +175,8 @@ void writeback_set_ratelimit(void);
 void tag_pages_for_writeback(struct address_space *mapping,
 			     pgoff_t start, pgoff_t end);
 
+void account_page_redirty(struct page *page);
+
 /* pdflush.c */
 extern int nr_pdflush_threads;	/* Global so it can be exported to sysctl
 				   read-only. */



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-account-redirty --]
[-- Type: text/plain, Size: 2383 bytes --]

De-account the accumulative dirty counters on page redirty.

Page redirties (very common in ext4) will introduce mismatch between
counters (a) and (b)

a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
b) NR_WRITTEN, BDI_WRITTEN

This will introduce systematic errors in balanced_rate and result in
dirty page position errors (ie. the dirty pages are no longer balanced
around the global/bdi setpoints).

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    2 ++
 mm/page-writeback.c       |   12 ++++++++++++
 2 files changed, 14 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:14:36.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:14:38.000000000 +0800
@@ -1836,6 +1836,17 @@ int __set_page_dirty_nobuffers(struct pa
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
+void account_page_redirty(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	if (mapping && mapping_cap_account_dirty(mapping)) {
+		current->nr_dirtied--;
+		dec_zone_page_state(page, NR_DIRTIED);
+		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
+	}
+}
+EXPORT_SYMBOL(account_page_redirty);
+
 /*
  * When a writepage implementation decides that it doesn't want to write this
  * page for some reason, it should redirty the locked page via
@@ -1844,6 +1855,7 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers
 int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page)
 {
 	wbc->pages_skipped++;
+	account_page_redirty(page);
 	return __set_page_dirty_nobuffers(page);
 }
 EXPORT_SYMBOL(redirty_page_for_writepage);
--- linux-next.orig/include/linux/writeback.h	2011-08-29 19:14:32.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-29 19:14:38.000000000 +0800
@@ -175,6 +175,8 @@ void writeback_set_ratelimit(void);
 void tag_pages_for_writeback(struct address_space *mapping,
 			     pgoff_t start, pgoff_t end);
 
+void account_page_redirty(struct page *page);
+
 /* pdflush.c */
 extern int nr_pdflush_threads;	/* Global so it can be exported to sysctl
 				   read-only. */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-account-redirty --]
[-- Type: text/plain, Size: 2383 bytes --]

De-account the accumulative dirty counters on page redirty.

Page redirties (very common in ext4) will introduce mismatch between
counters (a) and (b)

a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
b) NR_WRITTEN, BDI_WRITTEN

This will introduce systematic errors in balanced_rate and result in
dirty page position errors (ie. the dirty pages are no longer balanced
around the global/bdi setpoints).

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    2 ++
 mm/page-writeback.c       |   12 ++++++++++++
 2 files changed, 14 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-29 19:14:36.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-29 19:14:38.000000000 +0800
@@ -1836,6 +1836,17 @@ int __set_page_dirty_nobuffers(struct pa
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
+void account_page_redirty(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	if (mapping && mapping_cap_account_dirty(mapping)) {
+		current->nr_dirtied--;
+		dec_zone_page_state(page, NR_DIRTIED);
+		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
+	}
+}
+EXPORT_SYMBOL(account_page_redirty);
+
 /*
  * When a writepage implementation decides that it doesn't want to write this
  * page for some reason, it should redirty the locked page via
@@ -1844,6 +1855,7 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers
 int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page)
 {
 	wbc->pages_skipped++;
+	account_page_redirty(page);
 	return __set_page_dirty_nobuffers(page);
 }
 EXPORT_SYMBOL(redirty_page_for_writepage);
--- linux-next.orig/include/linux/writeback.h	2011-08-29 19:14:32.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-29 19:14:38.000000000 +0800
@@ -175,6 +175,8 @@ void writeback_set_ratelimit(void);
 void tag_pages_for_writeback(struct address_space *mapping,
 			     pgoff_t start, pgoff_t end);
 
+void account_page_redirty(struct page *page);
+
 /* pdflush.c */
 extern int nr_pdflush_threads;	/* Global so it can be exported to sysctl
 				   read-only. */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 18/18] btrfs: fix dirtied pages accounting on sub-page writes
  2011-09-04  1:53 ` Wu Fengguang
  (?)
@ 2011-09-04  1:53   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Chris Mason, Wu Fengguang, Andrew Morton,
	Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: btrfs-account-redirty --]
[-- Type: text/plain, Size: 857 bytes --]

When doing 1KB sequential writes to the same page,
balance_dirty_pages_ratelimited_nr() should be called once instead of 4
times, the latter makes the dirtier tasks be throttled much too heavy.

Fix it with proper de-accounting on clear_page_dirty_for_io().

CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/file.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- linux-next.orig/fs/btrfs/file.c	2011-08-29 19:14:32.000000000 +0800
+++ linux-next/fs/btrfs/file.c	2011-08-29 19:14:40.000000000 +0800
@@ -1138,7 +1138,8 @@ again:
 				     GFP_NOFS);
 	}
 	for (i = 0; i < num_pages; i++) {
-		clear_page_dirty_for_io(pages[i]);
+		if (clear_page_dirty_for_io(pages[i]))
+			account_page_redirty(pages[i]);
 		set_page_extent_mapped(pages[i]);
 		WARN_ON(!PageLocked(pages[i]));
 	}



^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 18/18] btrfs: fix dirtied pages accounting on sub-page writes
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Chris Mason, Wu Fengguang, Andrew Morton,
	Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: btrfs-account-redirty --]
[-- Type: text/plain, Size: 1160 bytes --]

When doing 1KB sequential writes to the same page,
balance_dirty_pages_ratelimited_nr() should be called once instead of 4
times, the latter makes the dirtier tasks be throttled much too heavy.

Fix it with proper de-accounting on clear_page_dirty_for_io().

CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/file.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- linux-next.orig/fs/btrfs/file.c	2011-08-29 19:14:32.000000000 +0800
+++ linux-next/fs/btrfs/file.c	2011-08-29 19:14:40.000000000 +0800
@@ -1138,7 +1138,8 @@ again:
 				     GFP_NOFS);
 	}
 	for (i = 0; i < num_pages; i++) {
-		clear_page_dirty_for_io(pages[i]);
+		if (clear_page_dirty_for_io(pages[i]))
+			account_page_redirty(pages[i]);
 		set_page_extent_mapped(pages[i]);
 		WARN_ON(!PageLocked(pages[i]));
 	}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH 18/18] btrfs: fix dirtied pages accounting on sub-page writes
@ 2011-09-04  1:53   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-04  1:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Chris Mason, Wu Fengguang, Andrew Morton,
	Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: btrfs-account-redirty --]
[-- Type: text/plain, Size: 1160 bytes --]

When doing 1KB sequential writes to the same page,
balance_dirty_pages_ratelimited_nr() should be called once instead of 4
times, the latter makes the dirtier tasks be throttled much too heavy.

Fix it with proper de-accounting on clear_page_dirty_for_io().

CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/file.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- linux-next.orig/fs/btrfs/file.c	2011-08-29 19:14:32.000000000 +0800
+++ linux-next/fs/btrfs/file.c	2011-08-29 19:14:40.000000000 +0800
@@ -1138,7 +1138,8 @@ again:
 				     GFP_NOFS);
 	}
 	for (i = 0; i < num_pages; i++) {
-		clear_page_dirty_for_io(pages[i]);
+		if (clear_page_dirty_for_io(pages[i]))
+			account_page_redirty(pages[i]);
 		set_page_extent_mapped(pages[i]);
 		WARN_ON(!PageLocked(pages[i]));
 	}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 02/18] writeback: dirty position control
  2011-09-04  1:53   ` Wu Fengguang
@ 2011-09-05 15:02     ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-05 15:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0              bdi_setpoint                    x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload 

In light of the global control thing already having a hard stop at
limit, what's the point of the auxiliary line? Why not simply run the
bdi control between [0.5, 1.5] and leave it at that?

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 02/18] writeback: dirty position control
@ 2011-09-05 15:02     ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-05 15:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0              bdi_setpoint                    x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload 

In light of the global control thing already having a hard stop at
limit, what's the point of the auxiliary line? Why not simply run the
bdi control between [0.5, 1.5] and leave it at that?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 02/18] writeback: dirty position control
  2011-09-04  1:53   ` Wu Fengguang
@ 2011-09-05 15:05     ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-05 15:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> @@ -591,6 +790,7 @@ static void global_update_bandwidth(unsi
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>                             unsigned long thresh,
> +                           unsigned long bg_thresh,
>                             unsigned long dirty,
>                             unsigned long bdi_thresh,
>                             unsigned long bdi_dirty,
> @@ -627,6 +827,7 @@ snapshot:
>  
>  static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>                                  unsigned long thresh,
> +                                unsigned long bg_thresh,
>                                  unsigned long dirty,
>                                  unsigned long bdi_thresh,
>                                  unsigned long bdi_dirty,
> @@ -635,8 +836,8 @@ static void bdi_update_bandwidth(struct 
>         if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
>                 return;
>         spin_lock(&bdi->wb.list_lock);
> -       __bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> -                              start_time);
> +       __bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> +                              bdi_thresh, bdi_dirty, start_time);
>         spin_unlock(&bdi->wb.list_lock);
>  }
>  
> @@ -677,7 +878,8 @@ static void balance_dirty_pages(struct a
>                  * catch-up. This avoids (excessively) small writeouts
>                  * when the bdi limits are ramping up.
>                  */
> -               if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> +               if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> +                                                     background_thresh))
>                         break;
>  
>                 bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -721,8 +923,9 @@ static void balance_dirty_pages(struct a
>                 if (!bdi->dirty_exceeded)
>                         bdi->dirty_exceeded = 1;
>  
> -               bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> -                                    bdi_thresh, bdi_dirty, start_time);
> +               bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> +                                    nr_dirty, bdi_thresh, bdi_dirty,
> +                                    start_time);
>  
>                 /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>                  * Unstable writes are a feature of certain networked
> --- linux-next.orig/fs/fs-writeback.c   2011-08-26 15:57:18.000000000 +0800
> +++ linux-next/fs/fs-writeback.c        2011-08-26 15:57:20.000000000 +0800
> @@ -675,7 +675,7 @@ static inline bool over_bground_thresh(v
>  static void wb_update_bandwidth(struct bdi_writeback *wb,
>                                 unsigned long start_time)
>  {
> -       __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> +       __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
>  }
>  
>  /*
> --- linux-next.orig/include/linux/writeback.h   2011-08-26 15:57:18.000000000 +0800
> +++ linux-next/include/linux/writeback.h        2011-08-26 15:57:20.000000000 +0800
> @@ -141,6 +141,7 @@ unsigned long bdi_dirty_limit(struct bac
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>                             unsigned long thresh,
> +                           unsigned long bg_thresh,
>                             unsigned long dirty,
>                             unsigned long bdi_thresh,
>                             unsigned long bdi_dirty,


All this function signature muck doesn't seem immediately relevant to
the introduction of bdi_position_ratio() since the new function isn't
actually used.

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 02/18] writeback: dirty position control
@ 2011-09-05 15:05     ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-05 15:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> @@ -591,6 +790,7 @@ static void global_update_bandwidth(unsi
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>                             unsigned long thresh,
> +                           unsigned long bg_thresh,
>                             unsigned long dirty,
>                             unsigned long bdi_thresh,
>                             unsigned long bdi_dirty,
> @@ -627,6 +827,7 @@ snapshot:
>  
>  static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>                                  unsigned long thresh,
> +                                unsigned long bg_thresh,
>                                  unsigned long dirty,
>                                  unsigned long bdi_thresh,
>                                  unsigned long bdi_dirty,
> @@ -635,8 +836,8 @@ static void bdi_update_bandwidth(struct 
>         if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
>                 return;
>         spin_lock(&bdi->wb.list_lock);
> -       __bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> -                              start_time);
> +       __bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> +                              bdi_thresh, bdi_dirty, start_time);
>         spin_unlock(&bdi->wb.list_lock);
>  }
>  
> @@ -677,7 +878,8 @@ static void balance_dirty_pages(struct a
>                  * catch-up. This avoids (excessively) small writeouts
>                  * when the bdi limits are ramping up.
>                  */
> -               if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> +               if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> +                                                     background_thresh))
>                         break;
>  
>                 bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -721,8 +923,9 @@ static void balance_dirty_pages(struct a
>                 if (!bdi->dirty_exceeded)
>                         bdi->dirty_exceeded = 1;
>  
> -               bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> -                                    bdi_thresh, bdi_dirty, start_time);
> +               bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> +                                    nr_dirty, bdi_thresh, bdi_dirty,
> +                                    start_time);
>  
>                 /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>                  * Unstable writes are a feature of certain networked
> --- linux-next.orig/fs/fs-writeback.c   2011-08-26 15:57:18.000000000 +0800
> +++ linux-next/fs/fs-writeback.c        2011-08-26 15:57:20.000000000 +0800
> @@ -675,7 +675,7 @@ static inline bool over_bground_thresh(v
>  static void wb_update_bandwidth(struct bdi_writeback *wb,
>                                 unsigned long start_time)
>  {
> -       __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> +       __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
>  }
>  
>  /*
> --- linux-next.orig/include/linux/writeback.h   2011-08-26 15:57:18.000000000 +0800
> +++ linux-next/include/linux/writeback.h        2011-08-26 15:57:20.000000000 +0800
> @@ -141,6 +141,7 @@ unsigned long bdi_dirty_limit(struct bac
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>                             unsigned long thresh,
> +                           unsigned long bg_thresh,
>                             unsigned long dirty,
>                             unsigned long bdi_thresh,
>                             unsigned long bdi_dirty,


All this function signature muck doesn't seem immediately relevant to
the introduction of bdi_position_ratio() since the new function isn't
actually used.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 02/18] writeback: dirty position control
  2011-09-05 15:02     ` Peter Zijlstra
@ 2011-09-06  2:10       ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-06  2:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Jan Kara, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Sep 05, 2011 at 11:02:59PM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > + * (o) bdi control lines
> > + *
> > + * The control lines for the global/bdi setpoints both stretch up to @limit.
> > + * The below figure illustrates the main bdi control line with an auxiliary
> > + * line extending it to @limit.
> > + *
> > + *   o
> > + *     o
> > + *       o                                      [o] main control line
> > + *         o                                    [*] auxiliary control line
> > + *           o
> > + *             o
> > + *               o
> > + *                 o
> > + *                   o
> > + *                     o
> > + *                       o--------------------- balance point, rate scale = 1
> > + *                       | o
> > + *                       |   o
> > + *                       |     o
> > + *                       |       o
> > + *                       |         o
> > + *                       |           o
> > + *                       |             o------- connect point, rate scale = 1/2
> > + *                       |               .*
> > + *                       |                 .   *
> > + *                       |                   .      *
> > + *                       |                     .         *
> > + *                       |                       .           *
> > + *                       |                         .              *
> > + *                       |                           .                 *
> > + *  [--------------------+-----------------------------.--------------------*]
> > + *  0              bdi_setpoint                    x_intercept           limit
> > + *
> > + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> > + * normal if it starts high in situations like
> > + * - start writing to a slow SD card and a fast disk at the same time. The SD
> > + *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
> > + * - the bdi dirty thresh drops quickly due to change of JBOD workload 
> 
> In light of the global control thing already having a hard stop at
> limit, what's the point of the auxiliary line? Why not simply run the
> bdi control between [0.5, 1.5] and leave it at that?

Good point! It helps remove one confusing concept.

This patch reduces the auxiliary control line to a flat y=0.25 line.
The comments will be further simplified, too.

Thanks,
Fengguang
---

 mm/page-writeback.c |   17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-09-06 09:59:50.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-09-06 10:05:31.000000000 +0800
@@ -676,18 +676,11 @@ static unsigned long bdi_position_ratio(
 	span = (thresh - bdi_thresh + 8 * write_bw) * (u64)x >> 16;
 	x_intercept = bdi_setpoint + span;
 
-	span >>= 1;
-	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
-		if (unlikely(bdi_dirty > limit))
-			return 0;
-		if (x_intercept < limit) {
-			x_intercept = limit;	/* auxiliary control line */
-			bdi_setpoint += span;
-			pos_ratio >>= 1;
-		}
-	}
-	pos_ratio *= x_intercept - bdi_dirty;
-	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
+	if (bdi_dirty < x_intercept - span / 4) {
+		pos_ratio *= x_intercept - bdi_dirty;
+		do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
+	} else
+		pos_ratio /= 4;
 
 	return pos_ratio;
 }

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 02/18] writeback: dirty position control
@ 2011-09-06  2:10       ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-06  2:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Jan Kara, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Sep 05, 2011 at 11:02:59PM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > + * (o) bdi control lines
> > + *
> > + * The control lines for the global/bdi setpoints both stretch up to @limit.
> > + * The below figure illustrates the main bdi control line with an auxiliary
> > + * line extending it to @limit.
> > + *
> > + *   o
> > + *     o
> > + *       o                                      [o] main control line
> > + *         o                                    [*] auxiliary control line
> > + *           o
> > + *             o
> > + *               o
> > + *                 o
> > + *                   o
> > + *                     o
> > + *                       o--------------------- balance point, rate scale = 1
> > + *                       | o
> > + *                       |   o
> > + *                       |     o
> > + *                       |       o
> > + *                       |         o
> > + *                       |           o
> > + *                       |             o------- connect point, rate scale = 1/2
> > + *                       |               .*
> > + *                       |                 .   *
> > + *                       |                   .      *
> > + *                       |                     .         *
> > + *                       |                       .           *
> > + *                       |                         .              *
> > + *                       |                           .                 *
> > + *  [--------------------+-----------------------------.--------------------*]
> > + *  0              bdi_setpoint                    x_intercept           limit
> > + *
> > + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> > + * normal if it starts high in situations like
> > + * - start writing to a slow SD card and a fast disk at the same time. The SD
> > + *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
> > + * - the bdi dirty thresh drops quickly due to change of JBOD workload 
> 
> In light of the global control thing already having a hard stop at
> limit, what's the point of the auxiliary line? Why not simply run the
> bdi control between [0.5, 1.5] and leave it at that?

Good point! It helps remove one confusing concept.

This patch reduces the auxiliary control line to a flat y=0.25 line.
The comments will be further simplified, too.

Thanks,
Fengguang
---

 mm/page-writeback.c |   17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-09-06 09:59:50.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-09-06 10:05:31.000000000 +0800
@@ -676,18 +676,11 @@ static unsigned long bdi_position_ratio(
 	span = (thresh - bdi_thresh + 8 * write_bw) * (u64)x >> 16;
 	x_intercept = bdi_setpoint + span;
 
-	span >>= 1;
-	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
-		if (unlikely(bdi_dirty > limit))
-			return 0;
-		if (x_intercept < limit) {
-			x_intercept = limit;	/* auxiliary control line */
-			bdi_setpoint += span;
-			pos_ratio >>= 1;
-		}
-	}
-	pos_ratio *= x_intercept - bdi_dirty;
-	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
+	if (bdi_dirty < x_intercept - span / 4) {
+		pos_ratio *= x_intercept - bdi_dirty;
+		do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
+	} else
+		pos_ratio /= 4;
 
 	return pos_ratio;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 02/18] writeback: dirty position control
  2011-09-05 15:05     ` Peter Zijlstra
@ 2011-09-06  2:43       ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-06  2:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Jan Kara, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Sep 05, 2011 at 11:05:57PM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > @@ -591,6 +790,7 @@ static void global_update_bandwidth(unsi
> >  
> >  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> >                             unsigned long thresh,
> > +                           unsigned long bg_thresh,
> >                             unsigned long dirty,
> >                             unsigned long bdi_thresh,
> >                             unsigned long bdi_dirty,
> > @@ -627,6 +827,7 @@ snapshot:
> >  
> >  static void bdi_update_bandwidth(struct backing_dev_info *bdi,
> >                                  unsigned long thresh,
> > +                                unsigned long bg_thresh,
> >                                  unsigned long dirty,
> >                                  unsigned long bdi_thresh,
> >                                  unsigned long bdi_dirty,
> > @@ -635,8 +836,8 @@ static void bdi_update_bandwidth(struct 
> >         if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
> >                 return;
> >         spin_lock(&bdi->wb.list_lock);
> > -       __bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> > -                              start_time);
> > +       __bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> > +                              bdi_thresh, bdi_dirty, start_time);
> >         spin_unlock(&bdi->wb.list_lock);
> >  }
> >  
> > @@ -677,7 +878,8 @@ static void balance_dirty_pages(struct a
> >                  * catch-up. This avoids (excessively) small writeouts
> >                  * when the bdi limits are ramping up.
> >                  */
> > -               if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> > +               if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> > +                                                     background_thresh))
> >                         break;
> >  
> >                 bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> > @@ -721,8 +923,9 @@ static void balance_dirty_pages(struct a
> >                 if (!bdi->dirty_exceeded)
> >                         bdi->dirty_exceeded = 1;
> >  
> > -               bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> > -                                    bdi_thresh, bdi_dirty, start_time);
> > +               bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> > +                                    nr_dirty, bdi_thresh, bdi_dirty,
> > +                                    start_time);
> >  
> >                 /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> >                  * Unstable writes are a feature of certain networked
> > --- linux-next.orig/fs/fs-writeback.c   2011-08-26 15:57:18.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c        2011-08-26 15:57:20.000000000 +0800
> > @@ -675,7 +675,7 @@ static inline bool over_bground_thresh(v
> >  static void wb_update_bandwidth(struct bdi_writeback *wb,
> >                                 unsigned long start_time)
> >  {
> > -       __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> > +       __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
> >  }
> >  
> >  /*
> > --- linux-next.orig/include/linux/writeback.h   2011-08-26 15:57:18.000000000 +0800
> > +++ linux-next/include/linux/writeback.h        2011-08-26 15:57:20.000000000 +0800
> > @@ -141,6 +141,7 @@ unsigned long bdi_dirty_limit(struct bac
> >  
> >  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> >                             unsigned long thresh,
> > +                           unsigned long bg_thresh,
> >                             unsigned long dirty,
> >                             unsigned long bdi_thresh,
> >                             unsigned long bdi_dirty,
> 
> 
> All this function signature muck doesn't seem immediately relevant to
> the introduction of bdi_position_ratio() since the new function isn't
> actually used.

Ahh, you are right.

I'll just make the chunks a standalone patch. Logically they are more
related to patch 03 "writeback: dirty rate control", however let's not
add burden to the already complex patch 03..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 02/18] writeback: dirty position control
@ 2011-09-06  2:43       ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-06  2:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Jan Kara, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Sep 05, 2011 at 11:05:57PM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > @@ -591,6 +790,7 @@ static void global_update_bandwidth(unsi
> >  
> >  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> >                             unsigned long thresh,
> > +                           unsigned long bg_thresh,
> >                             unsigned long dirty,
> >                             unsigned long bdi_thresh,
> >                             unsigned long bdi_dirty,
> > @@ -627,6 +827,7 @@ snapshot:
> >  
> >  static void bdi_update_bandwidth(struct backing_dev_info *bdi,
> >                                  unsigned long thresh,
> > +                                unsigned long bg_thresh,
> >                                  unsigned long dirty,
> >                                  unsigned long bdi_thresh,
> >                                  unsigned long bdi_dirty,
> > @@ -635,8 +836,8 @@ static void bdi_update_bandwidth(struct 
> >         if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
> >                 return;
> >         spin_lock(&bdi->wb.list_lock);
> > -       __bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> > -                              start_time);
> > +       __bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> > +                              bdi_thresh, bdi_dirty, start_time);
> >         spin_unlock(&bdi->wb.list_lock);
> >  }
> >  
> > @@ -677,7 +878,8 @@ static void balance_dirty_pages(struct a
> >                  * catch-up. This avoids (excessively) small writeouts
> >                  * when the bdi limits are ramping up.
> >                  */
> > -               if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> > +               if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> > +                                                     background_thresh))
> >                         break;
> >  
> >                 bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> > @@ -721,8 +923,9 @@ static void balance_dirty_pages(struct a
> >                 if (!bdi->dirty_exceeded)
> >                         bdi->dirty_exceeded = 1;
> >  
> > -               bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> > -                                    bdi_thresh, bdi_dirty, start_time);
> > +               bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> > +                                    nr_dirty, bdi_thresh, bdi_dirty,
> > +                                    start_time);
> >  
> >                 /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> >                  * Unstable writes are a feature of certain networked
> > --- linux-next.orig/fs/fs-writeback.c   2011-08-26 15:57:18.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c        2011-08-26 15:57:20.000000000 +0800
> > @@ -675,7 +675,7 @@ static inline bool over_bground_thresh(v
> >  static void wb_update_bandwidth(struct bdi_writeback *wb,
> >                                 unsigned long start_time)
> >  {
> > -       __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> > +       __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
> >  }
> >  
> >  /*
> > --- linux-next.orig/include/linux/writeback.h   2011-08-26 15:57:18.000000000 +0800
> > +++ linux-next/include/linux/writeback.h        2011-08-26 15:57:20.000000000 +0800
> > @@ -141,6 +141,7 @@ unsigned long bdi_dirty_limit(struct bac
> >  
> >  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> >                             unsigned long thresh,
> > +                           unsigned long bg_thresh,
> >                             unsigned long dirty,
> >                             unsigned long bdi_thresh,
> >                             unsigned long bdi_dirty,
> 
> 
> All this function signature muck doesn't seem immediately relevant to
> the introduction of bdi_position_ratio() since the new function isn't
> actually used.

Ahh, you are right.

I'll just make the chunks a standalone patch. Logically they are more
related to patch 03 "writeback: dirty rate control", however let's not
add burden to the already complex patch 03..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 06/18] writeback: IO-less balance_dirty_pages()
  2011-09-04  1:53   ` Wu Fengguang
  (?)
@ 2011-09-06 12:13     ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 12:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> -static inline void task_dirties_fraction(struct task_struct *tsk,
> -               long *numerator, long *denominator)
> -{
> -       prop_fraction_single(&vm_dirties, &tsk->dirties,
> -                               numerator, denominator);
> -} 

it looks like this patch removes all users of tsk->dirties, but doesn't
in fact remove the data member from task_struct.

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 06/18] writeback: IO-less balance_dirty_pages()
@ 2011-09-06 12:13     ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 12:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> -static inline void task_dirties_fraction(struct task_struct *tsk,
> -               long *numerator, long *denominator)
> -{
> -       prop_fraction_single(&vm_dirties, &tsk->dirties,
> -                               numerator, denominator);
> -} 

it looks like this patch removes all users of tsk->dirties, but doesn't
in fact remove the data member from task_struct.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 06/18] writeback: IO-less balance_dirty_pages()
@ 2011-09-06 12:13     ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 12:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> -static inline void task_dirties_fraction(struct task_struct *tsk,
> -               long *numerator, long *denominator)
> -{
> -       prop_fraction_single(&vm_dirties, &tsk->dirties,
> -                               numerator, denominator);
> -} 

it looks like this patch removes all users of tsk->dirties, but doesn't
in fact remove the data member from task_struct.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
  2011-09-04  1:53   ` Wu Fengguang
  (?)
@ 2011-09-06 14:09     ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 14:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> plain text document attachment (bdi-reserve-area)
> Keep a minimal pool of dirty pages for each bdi, so that the disk IO
> queues won't underrun.
> 
> It's particularly useful for JBOD and small memory system.
> 
> Note that this is not enough when memory is really tight (in comparison
> to write bandwidth). It may result in (pos_ratio > 1) at the setpoint
> and push the dirty pages high. This is more or less intended because the
> bdi is in the danger of IO queue underflow. However the global dirty
> pages, when pushed close to limit, will eventually conteract our desire
> to push up the low bdi_dirty.
> 
> In low memory JBOD tests we do see disks under-utilized from time to
> time. The additional fix may be to add a BDI_async_underrun flag to
> indicate that the block write queue is running low and it's time to
> quickly fill the queue by unthrottling the tasks regardless of the
> global limit.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-26 20:12:19.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-26 20:13:21.000000000 +0800
> @@ -487,6 +487,16 @@ unsigned long bdi_dirty_limit(struct bac
>   *   0 +------------.------------------.----------------------*------------->
>   *           freerun^          setpoint^                 limit^   dirty pages
>   *
> + * (o) bdi reserve area
> + *
> + * The bdi reserve area tries to keep a reasonable number of dirty pages for
> + * preventing block queue underrun.
> + *
> + * reserve area, scale up rate as dirty pages drop low
> + * |<----------------------------------------------->|
> + * |-------------------------------------------------------*-------|----------
> + * 0                                           bdi setpoint^       ^bdi_thresh


So why not call the thing bdi freerun ?

>   * (o) bdi control lines
>   *
>   * The control lines for the global/bdi setpoints both stretch up to @limit.
> @@ -634,6 +644,22 @@ static unsigned long bdi_position_ratio(
>  	pos_ratio *= x_intercept - bdi_dirty;
>  	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
>  
> +	/*
> +	 * bdi reserve area, safeguard against dirty pool underrun and disk idle
> +	 *
> +	 * It may push the desired control point of global dirty pages higher
> +	 * than setpoint. It's not necessary in single-bdi case because a
> +	 * minimal pool of @freerun dirty pages will already be guaranteed.
> +	 */
> +	x_intercept = min(write_bw, freerun);
> +	if (bdi_dirty < x_intercept) {

So the point of the freerun point is that we never throttle before it,
so basically all the below shouldn't be needed at all, right? 

> +		if (bdi_dirty > x_intercept / 8) {
> +			pos_ratio *= x_intercept;
> +			do_div(pos_ratio, bdi_dirty);
> +		} else
> +			pos_ratio *= 8;
> +	}
> +
>  	return pos_ratio;
>  }


So why not add:

	if (likely(dirty < freerun))
		return 2;

at the start of this function and leave it at that?



^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
@ 2011-09-06 14:09     ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 14:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> plain text document attachment (bdi-reserve-area)
> Keep a minimal pool of dirty pages for each bdi, so that the disk IO
> queues won't underrun.
> 
> It's particularly useful for JBOD and small memory system.
> 
> Note that this is not enough when memory is really tight (in comparison
> to write bandwidth). It may result in (pos_ratio > 1) at the setpoint
> and push the dirty pages high. This is more or less intended because the
> bdi is in the danger of IO queue underflow. However the global dirty
> pages, when pushed close to limit, will eventually conteract our desire
> to push up the low bdi_dirty.
> 
> In low memory JBOD tests we do see disks under-utilized from time to
> time. The additional fix may be to add a BDI_async_underrun flag to
> indicate that the block write queue is running low and it's time to
> quickly fill the queue by unthrottling the tasks regardless of the
> global limit.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-26 20:12:19.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-26 20:13:21.000000000 +0800
> @@ -487,6 +487,16 @@ unsigned long bdi_dirty_limit(struct bac
>   *   0 +------------.------------------.----------------------*------------->
>   *           freerun^          setpoint^                 limit^   dirty pages
>   *
> + * (o) bdi reserve area
> + *
> + * The bdi reserve area tries to keep a reasonable number of dirty pages for
> + * preventing block queue underrun.
> + *
> + * reserve area, scale up rate as dirty pages drop low
> + * |<----------------------------------------------->|
> + * |-------------------------------------------------------*-------|----------
> + * 0                                           bdi setpoint^       ^bdi_thresh


So why not call the thing bdi freerun ?

>   * (o) bdi control lines
>   *
>   * The control lines for the global/bdi setpoints both stretch up to @limit.
> @@ -634,6 +644,22 @@ static unsigned long bdi_position_ratio(
>  	pos_ratio *= x_intercept - bdi_dirty;
>  	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
>  
> +	/*
> +	 * bdi reserve area, safeguard against dirty pool underrun and disk idle
> +	 *
> +	 * It may push the desired control point of global dirty pages higher
> +	 * than setpoint. It's not necessary in single-bdi case because a
> +	 * minimal pool of @freerun dirty pages will already be guaranteed.
> +	 */
> +	x_intercept = min(write_bw, freerun);
> +	if (bdi_dirty < x_intercept) {

So the point of the freerun point is that we never throttle before it,
so basically all the below shouldn't be needed at all, right? 

> +		if (bdi_dirty > x_intercept / 8) {
> +			pos_ratio *= x_intercept;
> +			do_div(pos_ratio, bdi_dirty);
> +		} else
> +			pos_ratio *= 8;
> +	}
> +
>  	return pos_ratio;
>  }


So why not add:

	if (likely(dirty < freerun))
		return 2;

at the start of this function and leave it at that?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
@ 2011-09-06 14:09     ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 14:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> plain text document attachment (bdi-reserve-area)
> Keep a minimal pool of dirty pages for each bdi, so that the disk IO
> queues won't underrun.
> 
> It's particularly useful for JBOD and small memory system.
> 
> Note that this is not enough when memory is really tight (in comparison
> to write bandwidth). It may result in (pos_ratio > 1) at the setpoint
> and push the dirty pages high. This is more or less intended because the
> bdi is in the danger of IO queue underflow. However the global dirty
> pages, when pushed close to limit, will eventually conteract our desire
> to push up the low bdi_dirty.
> 
> In low memory JBOD tests we do see disks under-utilized from time to
> time. The additional fix may be to add a BDI_async_underrun flag to
> indicate that the block write queue is running low and it's time to
> quickly fill the queue by unthrottling the tasks regardless of the
> global limit.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-26 20:12:19.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-26 20:13:21.000000000 +0800
> @@ -487,6 +487,16 @@ unsigned long bdi_dirty_limit(struct bac
>   *   0 +------------.------------------.----------------------*------------->
>   *           freerun^          setpoint^                 limit^   dirty pages
>   *
> + * (o) bdi reserve area
> + *
> + * The bdi reserve area tries to keep a reasonable number of dirty pages for
> + * preventing block queue underrun.
> + *
> + * reserve area, scale up rate as dirty pages drop low
> + * |<----------------------------------------------->|
> + * |-------------------------------------------------------*-------|----------
> + * 0                                           bdi setpoint^       ^bdi_thresh


So why not call the thing bdi freerun ?

>   * (o) bdi control lines
>   *
>   * The control lines for the global/bdi setpoints both stretch up to @limit.
> @@ -634,6 +644,22 @@ static unsigned long bdi_position_ratio(
>  	pos_ratio *= x_intercept - bdi_dirty;
>  	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
>  
> +	/*
> +	 * bdi reserve area, safeguard against dirty pool underrun and disk idle
> +	 *
> +	 * It may push the desired control point of global dirty pages higher
> +	 * than setpoint. It's not necessary in single-bdi case because a
> +	 * minimal pool of @freerun dirty pages will already be guaranteed.
> +	 */
> +	x_intercept = min(write_bw, freerun);
> +	if (bdi_dirty < x_intercept) {

So the point of the freerun point is that we never throttle before it,
so basically all the below shouldn't be needed at all, right? 

> +		if (bdi_dirty > x_intercept / 8) {
> +			pos_ratio *= x_intercept;
> +			do_div(pos_ratio, bdi_dirty);
> +		} else
> +			pos_ratio *= 8;
> +	}
> +
>  	return pos_ratio;
>  }


So why not add:

	if (likely(dirty < freerun))
		return 2;

at the start of this function and leave it at that?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 11/18] block: add bdi flag to indicate risk of io queue underrun
  2011-09-04  1:53   ` Wu Fengguang
@ 2011-09-06 14:22     ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 14:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Tejun Heo, Jens Axboe, Li Shaohua, Andrew Morton,
	Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> +++ linux-next/mm/page-writeback.c      2011-08-31 14:40:58.000000000 +0800
> @@ -1067,6 +1067,9 @@ static void balance_dirty_pages(struct a
>                                      nr_dirty, bdi_thresh, bdi_dirty,
>                                      start_time);
>  
> +               if (unlikely(!dirty_exceeded && bdi_async_underrun(bdi)))
> +                       break;
> +
>                 dirty_ratelimit = bdi->dirty_ratelimit;
>                 pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
>                                                background_thresh, nr_dirty,

So dirty_exceeded looks like:


1109                 dirty_exceeded = (bdi_dirty > bdi_thresh) ||
1110                                   (nr_dirty > dirty_thresh);

Would it make sense to write it as:

	if (nr_dirty > dirty_thresh || 
	    (nr_dirty > freerun && bdi_dirty > bdi_thresh))
		dirty_exceeded = 1;

So that we don't actually throttle bdi thingies when we're still in the
freerun area?


^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 11/18] block: add bdi flag to indicate risk of io queue underrun
@ 2011-09-06 14:22     ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 14:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Tejun Heo, Jens Axboe, Li Shaohua, Andrew Morton,
	Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> +++ linux-next/mm/page-writeback.c      2011-08-31 14:40:58.000000000 +0800
> @@ -1067,6 +1067,9 @@ static void balance_dirty_pages(struct a
>                                      nr_dirty, bdi_thresh, bdi_dirty,
>                                      start_time);
>  
> +               if (unlikely(!dirty_exceeded && bdi_async_underrun(bdi)))
> +                       break;
> +
>                 dirty_ratelimit = bdi->dirty_ratelimit;
>                 pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
>                                                background_thresh, nr_dirty,

So dirty_exceeded looks like:


1109                 dirty_exceeded = (bdi_dirty > bdi_thresh) ||
1110                                   (nr_dirty > dirty_thresh);

Would it make sense to write it as:

	if (nr_dirty > dirty_thresh || 
	    (nr_dirty > freerun && bdi_dirty > bdi_thresh))
		dirty_exceeded = 1;

So that we don't actually throttle bdi thingies when we're still in the
freerun area?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 13/18] writeback: limit max dirty pause time
  2011-09-04  1:53   ` Wu Fengguang
  (?)
@ 2011-09-06 14:52     ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 14:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:

> +static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
> +				   unsigned long bdi_dirty)
> +{
> +	unsigned long hi = ilog2(bdi->write_bandwidth);
> +	unsigned long lo = ilog2(bdi->dirty_ratelimit);
> +	unsigned long t;
> +
> +	/* target for ~10ms pause on 1-dd case */
> +	t = HZ / 50;

1k/50 usually ends up being 20 something

> +	/*
> +	 * Scale up pause time for concurrent dirtiers in order to reduce CPU
> +	 * overheads.
> +	 *
> +	 * (N * 20ms) on 2^N concurrent tasks.
> +	 */
> +	if (hi > lo)
> +		t += (hi - lo) * (20 * HZ) / 1024;
> +
> +	/*
> +	 * Limit pause time for small memory systems. If sleeping for too long
> +	 * time, a small pool of dirty/writeback pages may go empty and disk go
> +	 * idle.
> +	 *
> +	 * 1ms for every 1MB; may further consider bdi bandwidth.
> +	 */
> +	if (bdi_dirty)
> +		t = min(t, bdi_dirty >> (30 - PAGE_CACHE_SHIFT - ilog2(HZ)));

Yeah, I would add the bdi->avg_write_bandwidth term in there, 1g/s as an
avg bandwidth is just too wrong..


> +
> +	/*
> +	 * The pause time will be settled within range (max_pause/4, max_pause).
> +	 * Apply a minimal value of 4 to get a non-zero max_pause/4.
> +	 */
> +	return clamp_val(t, 4, MAX_PAUSE);

So you limit to 50ms min? That still seems fairly large. Is that because
your min sleep granularity might be something like 10ms since you're
using jiffies?

> +}



^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 13/18] writeback: limit max dirty pause time
@ 2011-09-06 14:52     ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 14:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:

> +static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
> +				   unsigned long bdi_dirty)
> +{
> +	unsigned long hi = ilog2(bdi->write_bandwidth);
> +	unsigned long lo = ilog2(bdi->dirty_ratelimit);
> +	unsigned long t;
> +
> +	/* target for ~10ms pause on 1-dd case */
> +	t = HZ / 50;

1k/50 usually ends up being 20 something

> +	/*
> +	 * Scale up pause time for concurrent dirtiers in order to reduce CPU
> +	 * overheads.
> +	 *
> +	 * (N * 20ms) on 2^N concurrent tasks.
> +	 */
> +	if (hi > lo)
> +		t += (hi - lo) * (20 * HZ) / 1024;
> +
> +	/*
> +	 * Limit pause time for small memory systems. If sleeping for too long
> +	 * time, a small pool of dirty/writeback pages may go empty and disk go
> +	 * idle.
> +	 *
> +	 * 1ms for every 1MB; may further consider bdi bandwidth.
> +	 */
> +	if (bdi_dirty)
> +		t = min(t, bdi_dirty >> (30 - PAGE_CACHE_SHIFT - ilog2(HZ)));

Yeah, I would add the bdi->avg_write_bandwidth term in there, 1g/s as an
avg bandwidth is just too wrong..


> +
> +	/*
> +	 * The pause time will be settled within range (max_pause/4, max_pause).
> +	 * Apply a minimal value of 4 to get a non-zero max_pause/4.
> +	 */
> +	return clamp_val(t, 4, MAX_PAUSE);

So you limit to 50ms min? That still seems fairly large. Is that because
your min sleep granularity might be something like 10ms since you're
using jiffies?

> +}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 13/18] writeback: limit max dirty pause time
@ 2011-09-06 14:52     ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 14:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:

> +static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
> +				   unsigned long bdi_dirty)
> +{
> +	unsigned long hi = ilog2(bdi->write_bandwidth);
> +	unsigned long lo = ilog2(bdi->dirty_ratelimit);
> +	unsigned long t;
> +
> +	/* target for ~10ms pause on 1-dd case */
> +	t = HZ / 50;

1k/50 usually ends up being 20 something

> +	/*
> +	 * Scale up pause time for concurrent dirtiers in order to reduce CPU
> +	 * overheads.
> +	 *
> +	 * (N * 20ms) on 2^N concurrent tasks.
> +	 */
> +	if (hi > lo)
> +		t += (hi - lo) * (20 * HZ) / 1024;
> +
> +	/*
> +	 * Limit pause time for small memory systems. If sleeping for too long
> +	 * time, a small pool of dirty/writeback pages may go empty and disk go
> +	 * idle.
> +	 *
> +	 * 1ms for every 1MB; may further consider bdi bandwidth.
> +	 */
> +	if (bdi_dirty)
> +		t = min(t, bdi_dirty >> (30 - PAGE_CACHE_SHIFT - ilog2(HZ)));

Yeah, I would add the bdi->avg_write_bandwidth term in there, 1g/s as an
avg bandwidth is just too wrong..


> +
> +	/*
> +	 * The pause time will be settled within range (max_pause/4, max_pause).
> +	 * Apply a minimal value of 4 to get a non-zero max_pause/4.
> +	 */
> +	return clamp_val(t, 4, MAX_PAUSE);

So you limit to 50ms min? That still seems fairly large. Is that because
your min sleep granularity might be something like 10ms since you're
using jiffies?

> +}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 05/18] writeback: per task dirty rate limit
  2011-09-04  1:53   ` Wu Fengguang
  (?)
@ 2011-09-06 15:47     ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 15:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
>  /*
> + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> + * will look to see if it needs to start dirty throttling.
> + *
> + * If dirty_poll_interval is too low, big NUMA machines will call the expensive
> + * global_page_state() too often. So scale it near-sqrt to the safety margin
> + * (the number of pages we may dirty without exceeding the dirty limits).
> + */
> +static unsigned long dirty_poll_interval(unsigned long dirty,
> +                                        unsigned long thresh)
> +{
> +       if (thresh > dirty)
> +               return 1UL << (ilog2(thresh - dirty) >> 1);
> +
> +       return 1;
> +}

Where does that sqrt come from? 

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 05/18] writeback: per task dirty rate limit
@ 2011-09-06 15:47     ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 15:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
>  /*
> + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> + * will look to see if it needs to start dirty throttling.
> + *
> + * If dirty_poll_interval is too low, big NUMA machines will call the expensive
> + * global_page_state() too often. So scale it near-sqrt to the safety margin
> + * (the number of pages we may dirty without exceeding the dirty limits).
> + */
> +static unsigned long dirty_poll_interval(unsigned long dirty,
> +                                        unsigned long thresh)
> +{
> +       if (thresh > dirty)
> +               return 1UL << (ilog2(thresh - dirty) >> 1);
> +
> +       return 1;
> +}

Where does that sqrt come from? 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 05/18] writeback: per task dirty rate limit
@ 2011-09-06 15:47     ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 15:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
>  /*
> + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> + * will look to see if it needs to start dirty throttling.
> + *
> + * If dirty_poll_interval is too low, big NUMA machines will call the expensive
> + * global_page_state() too often. So scale it near-sqrt to the safety margin
> + * (the number of pages we may dirty without exceeding the dirty limits).
> + */
> +static unsigned long dirty_poll_interval(unsigned long dirty,
> +                                        unsigned long thresh)
> +{
> +       if (thresh > dirty)
> +               return 1UL << (ilog2(thresh - dirty) >> 1);
> +
> +       return 1;
> +}

Where does that sqrt come from? 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 14/18] writeback: control dirty pause time
  2011-09-04  1:53   ` Wu Fengguang
  (?)
@ 2011-09-06 15:51     ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 15:51 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> plain text document attachment (max-pause-adaption)
> The dirty pause time shall ultimately be controlled by adjusting
> nr_dirtied_pause, since there is relationship
> 
> 	pause = pages_dirtied / task_ratelimit
> 
> Assuming
> 
> 	pages_dirtied ~= nr_dirtied_pause
> 	task_ratelimit ~= dirty_ratelimit
> 
> We get
> 
> 	nr_dirtied_pause ~= dirty_ratelimit * desired_pause
> 
> Here dirty_ratelimit is preferred over task_ratelimit because it's
> more stable.
> 
> It's also important to limit possible large transitional errors:
> 
> - bw is changing quickly
> - pages_dirtied << nr_dirtied_pause on entering dirty exceeded area
> - pages_dirtied >> nr_dirtied_pause on btrfs (to be improved by a
>   separate fix, but still expect non-trivial errors)
> 
> So we end up using the above formula inside clamp_val().
> 
> The best test case for this code is to run 100 "dd bs=4M" tasks on
> btrfs and check its pause time distribution.



> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-29 19:08:43.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-29 19:08:44.000000000 +0800
> @@ -1193,7 +1193,20 @@ pause:
>  	if (!dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
> -	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
> +	if (pause == 0)
> +		current->nr_dirtied_pause =
> +				dirty_poll_interval(nr_dirty, dirty_thresh);
> +	else if (period <= max_pause / 4 &&
> +		 pages_dirtied >= current->nr_dirtied_pause)
> +		current->nr_dirtied_pause = clamp_val(
> +					dirty_ratelimit * (max_pause / 2) / HZ,
> +					pages_dirtied + pages_dirtied / 8,
> +					pages_dirtied * 4);
> +	else if (pause >= max_pause)
> +		current->nr_dirtied_pause = 1 | clamp_val(
> +					dirty_ratelimit * (max_pause * 3/8)/HZ,
> +					pages_dirtied / 4,
> +					pages_dirtied * 7/8);
>  

I very much prefer { } over multi line stmts, even if not strictly
needed.

I'm also not quite sure why pause==0 is a special case, also, do the two
other line segments connect on the transition point?

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 14/18] writeback: control dirty pause time
@ 2011-09-06 15:51     ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 15:51 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> plain text document attachment (max-pause-adaption)
> The dirty pause time shall ultimately be controlled by adjusting
> nr_dirtied_pause, since there is relationship
> 
> 	pause = pages_dirtied / task_ratelimit
> 
> Assuming
> 
> 	pages_dirtied ~= nr_dirtied_pause
> 	task_ratelimit ~= dirty_ratelimit
> 
> We get
> 
> 	nr_dirtied_pause ~= dirty_ratelimit * desired_pause
> 
> Here dirty_ratelimit is preferred over task_ratelimit because it's
> more stable.
> 
> It's also important to limit possible large transitional errors:
> 
> - bw is changing quickly
> - pages_dirtied << nr_dirtied_pause on entering dirty exceeded area
> - pages_dirtied >> nr_dirtied_pause on btrfs (to be improved by a
>   separate fix, but still expect non-trivial errors)
> 
> So we end up using the above formula inside clamp_val().
> 
> The best test case for this code is to run 100 "dd bs=4M" tasks on
> btrfs and check its pause time distribution.



> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-29 19:08:43.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-29 19:08:44.000000000 +0800
> @@ -1193,7 +1193,20 @@ pause:
>  	if (!dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
> -	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
> +	if (pause == 0)
> +		current->nr_dirtied_pause =
> +				dirty_poll_interval(nr_dirty, dirty_thresh);
> +	else if (period <= max_pause / 4 &&
> +		 pages_dirtied >= current->nr_dirtied_pause)
> +		current->nr_dirtied_pause = clamp_val(
> +					dirty_ratelimit * (max_pause / 2) / HZ,
> +					pages_dirtied + pages_dirtied / 8,
> +					pages_dirtied * 4);
> +	else if (pause >= max_pause)
> +		current->nr_dirtied_pause = 1 | clamp_val(
> +					dirty_ratelimit * (max_pause * 3/8)/HZ,
> +					pages_dirtied / 4,
> +					pages_dirtied * 7/8);
>  

I very much prefer { } over multi line stmts, even if not strictly
needed.

I'm also not quite sure why pause==0 is a special case, also, do the two
other line segments connect on the transition point?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 14/18] writeback: control dirty pause time
@ 2011-09-06 15:51     ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 15:51 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> plain text document attachment (max-pause-adaption)
> The dirty pause time shall ultimately be controlled by adjusting
> nr_dirtied_pause, since there is relationship
> 
> 	pause = pages_dirtied / task_ratelimit
> 
> Assuming
> 
> 	pages_dirtied ~= nr_dirtied_pause
> 	task_ratelimit ~= dirty_ratelimit
> 
> We get
> 
> 	nr_dirtied_pause ~= dirty_ratelimit * desired_pause
> 
> Here dirty_ratelimit is preferred over task_ratelimit because it's
> more stable.
> 
> It's also important to limit possible large transitional errors:
> 
> - bw is changing quickly
> - pages_dirtied << nr_dirtied_pause on entering dirty exceeded area
> - pages_dirtied >> nr_dirtied_pause on btrfs (to be improved by a
>   separate fix, but still expect non-trivial errors)
> 
> So we end up using the above formula inside clamp_val().
> 
> The best test case for this code is to run 100 "dd bs=4M" tasks on
> btrfs and check its pause time distribution.



> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-29 19:08:43.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-29 19:08:44.000000000 +0800
> @@ -1193,7 +1193,20 @@ pause:
>  	if (!dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
> -	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
> +	if (pause == 0)
> +		current->nr_dirtied_pause =
> +				dirty_poll_interval(nr_dirty, dirty_thresh);
> +	else if (period <= max_pause / 4 &&
> +		 pages_dirtied >= current->nr_dirtied_pause)
> +		current->nr_dirtied_pause = clamp_val(
> +					dirty_ratelimit * (max_pause / 2) / HZ,
> +					pages_dirtied + pages_dirtied / 8,
> +					pages_dirtied * 4);
> +	else if (pause >= max_pause)
> +		current->nr_dirtied_pause = 1 | clamp_val(
> +					dirty_ratelimit * (max_pause * 3/8)/HZ,
> +					pages_dirtied / 4,
> +					pages_dirtied * 7/8);
>  

I very much prefer { } over multi line stmts, even if not strictly
needed.

I'm also not quite sure why pause==0 is a special case, also, do the two
other line segments connect on the transition point?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 15/18] writeback: charge leaked page dirties to active tasks
  2011-09-04  1:53   ` Wu Fengguang
  (?)
@ 2011-09-06 16:16     ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 16:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> The solution is to charge the pages dirtied by the exited gcc to the
> other random gcc/dd instances.

random dirtying task, seeing it lacks a !strcmp(t->comm, "gcc") || !
strcmp(t->comm, "dd") clause.

>  It sounds not perfect, however should
> behave good enough in practice. 

Seeing as that throttled tasks aren't actually running so those that are
running are more likely to pick it up and get throttled, therefore
promoting an equal spread.. ?

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 15/18] writeback: charge leaked page dirties to active tasks
@ 2011-09-06 16:16     ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 16:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> The solution is to charge the pages dirtied by the exited gcc to the
> other random gcc/dd instances.

random dirtying task, seeing it lacks a !strcmp(t->comm, "gcc") || !
strcmp(t->comm, "dd") clause.

>  It sounds not perfect, however should
> behave good enough in practice. 

Seeing as that throttled tasks aren't actually running so those that are
running are more likely to pick it up and get throttled, therefore
promoting an equal spread.. ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 15/18] writeback: charge leaked page dirties to active tasks
@ 2011-09-06 16:16     ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 16:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> The solution is to charge the pages dirtied by the exited gcc to the
> other random gcc/dd instances.

random dirtying task, seeing it lacks a !strcmp(t->comm, "gcc") || !
strcmp(t->comm, "dd") clause.

>  It sounds not perfect, however should
> behave good enough in practice. 

Seeing as that throttled tasks aren't actually running so those that are
running are more likely to pick it up and get throttled, therefore
promoting an equal spread.. ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
  2011-09-04  1:53   ` Wu Fengguang
  (?)
@ 2011-09-06 16:18     ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 16:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> De-account the accumulative dirty counters on page redirty.
> 
> Page redirties (very common in ext4) will introduce mismatch between
> counters (a) and (b)
> 
> a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
> b) NR_WRITTEN, BDI_WRITTEN
> 
> This will introduce systematic errors in balanced_rate and result in
> dirty page position errors (ie. the dirty pages are no longer balanced
> around the global/bdi setpoints).
> 

So wtf is ext4 doing? Shouldn't a page stay dirty until its written out?

That is, should we really frob around this behaviour or fix ext4 because
its on crack?

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
@ 2011-09-06 16:18     ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 16:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> De-account the accumulative dirty counters on page redirty.
> 
> Page redirties (very common in ext4) will introduce mismatch between
> counters (a) and (b)
> 
> a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
> b) NR_WRITTEN, BDI_WRITTEN
> 
> This will introduce systematic errors in balanced_rate and result in
> dirty page position errors (ie. the dirty pages are no longer balanced
> around the global/bdi setpoints).
> 

So wtf is ext4 doing? Shouldn't a page stay dirty until its written out?

That is, should we really frob around this behaviour or fix ext4 because
its on crack?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
@ 2011-09-06 16:18     ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-06 16:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> De-account the accumulative dirty counters on page redirty.
> 
> Page redirties (very common in ext4) will introduce mismatch between
> counters (a) and (b)
> 
> a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
> b) NR_WRITTEN, BDI_WRITTEN
> 
> This will introduce systematic errors in balanced_rate and result in
> dirty page position errors (ie. the dirty pages are no longer balanced
> around the global/bdi setpoints).
> 

So wtf is ext4 doing? Shouldn't a page stay dirty until its written out?

That is, should we really frob around this behaviour or fix ext4 because
its on crack?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 02/18] writeback: dirty position control
  2011-09-04  1:53   ` Wu Fengguang
@ 2011-09-06 18:20     ` Vivek Goyal
  -1 siblings, 0 replies; 175+ messages in thread
From: Vivek Goyal @ 2011-09-06 18:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Jan Kara, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Sun, Sep 04, 2011 at 09:53:07AM +0800, Wu Fengguang wrote:

[..]
> - in memory tight systems, (1) becomes strong enough to squeeze dirty
>   pages inside the control scope
> 
> - in large memory systems where the "gravity" of (1) for pulling the
>   dirty pages to setpoint is too weak, (2) can back (1) up and drive
>   dirty pages to bdi_setpoint ~= setpoint reasonably fast.
> 
> Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> is related to memory size due to the interferences between disks.  In
> this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

Can you please elaborate a little more that what changes in JBOD setup.

> 
> Given equations
> 
>         span = x_intercept - bdi_setpoint
>         k = df/dx = - 1 / span
> 
> and the extremum values
> 
>         span = bdi_thresh
>         dx = bdi_thresh
> 
> we get
> 
>         df = - dx / span = - 1.0
> 
> That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence
> task ratelimit will fluctuate by -100%.

I am not sure I understand above calculation. I understood the part that
for single bdi case, you want 12.5% varation of bdi_setpoint over a
range of write_bw [SP-write_bw/2, SP+write_bw/2]. This requirement will
lead to.

k = -1/8*write_bw

OR span = 8*write_bw, hence
k= -1/span

Now I missed the part that what is different in case of JBOD setup and
how do you come up with values for that setup so that slope of bdi
setpoint is sharper.

IIUC, in case of single bdi case you want to use k=-1/(8*write_bw) and in
case of JBOD you want to use k=-1/(bdi_thresh)?

That means for single bdi case you want to trust bdi, write_bw but in
case of JBOD you stop trusting that and just switch to bdi_thresh. Not
sure what does it mean.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 02/18] writeback: dirty position control
@ 2011-09-06 18:20     ` Vivek Goyal
  0 siblings, 0 replies; 175+ messages in thread
From: Vivek Goyal @ 2011-09-06 18:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Jan Kara, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Sun, Sep 04, 2011 at 09:53:07AM +0800, Wu Fengguang wrote:

[..]
> - in memory tight systems, (1) becomes strong enough to squeeze dirty
>   pages inside the control scope
> 
> - in large memory systems where the "gravity" of (1) for pulling the
>   dirty pages to setpoint is too weak, (2) can back (1) up and drive
>   dirty pages to bdi_setpoint ~= setpoint reasonably fast.
> 
> Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> is related to memory size due to the interferences between disks.  In
> this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

Can you please elaborate a little more that what changes in JBOD setup.

> 
> Given equations
> 
>         span = x_intercept - bdi_setpoint
>         k = df/dx = - 1 / span
> 
> and the extremum values
> 
>         span = bdi_thresh
>         dx = bdi_thresh
> 
> we get
> 
>         df = - dx / span = - 1.0
> 
> That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence
> task ratelimit will fluctuate by -100%.

I am not sure I understand above calculation. I understood the part that
for single bdi case, you want 12.5% varation of bdi_setpoint over a
range of write_bw [SP-write_bw/2, SP+write_bw/2]. This requirement will
lead to.

k = -1/8*write_bw

OR span = 8*write_bw, hence
k= -1/span

Now I missed the part that what is different in case of JBOD setup and
how do you come up with values for that setup so that slope of bdi
setpoint is sharper.

IIUC, in case of single bdi case you want to use k=-1/(8*write_bw) and in
case of JBOD you want to use k=-1/(bdi_thresh)?

That means for single bdi case you want to trust bdi, write_bw but in
case of JBOD you stop trusting that and just switch to bdi_thresh. Not
sure what does it mean.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 05/18] writeback: per task dirty rate limit
  2011-09-06 15:47     ` Peter Zijlstra
@ 2011-09-06 23:27       ` Jan Kara
  -1 siblings, 0 replies; 175+ messages in thread
From: Jan Kara @ 2011-09-06 23:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Tue 06-09-11 17:47:10, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> >  /*
> > + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> > + * will look to see if it needs to start dirty throttling.
> > + *
> > + * If dirty_poll_interval is too low, big NUMA machines will call the expensive
> > + * global_page_state() too often. So scale it near-sqrt to the safety margin
> > + * (the number of pages we may dirty without exceeding the dirty limits).
> > + */
> > +static unsigned long dirty_poll_interval(unsigned long dirty,
> > +                                        unsigned long thresh)
> > +{
> > +       if (thresh > dirty)
> > +               return 1UL << (ilog2(thresh - dirty) >> 1);
> > +
> > +       return 1;
> > +}
> 
> Where does that sqrt come from? 
  He does 2^{log_2(x)/2} which, if done in real numbers arithmetics, would
result in x^{1/2}. Given the integer arithmetics, it might be twice as
small but still it's some approximation...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 05/18] writeback: per task dirty rate limit
@ 2011-09-06 23:27       ` Jan Kara
  0 siblings, 0 replies; 175+ messages in thread
From: Jan Kara @ 2011-09-06 23:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Tue 06-09-11 17:47:10, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> >  /*
> > + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> > + * will look to see if it needs to start dirty throttling.
> > + *
> > + * If dirty_poll_interval is too low, big NUMA machines will call the expensive
> > + * global_page_state() too often. So scale it near-sqrt to the safety margin
> > + * (the number of pages we may dirty without exceeding the dirty limits).
> > + */
> > +static unsigned long dirty_poll_interval(unsigned long dirty,
> > +                                        unsigned long thresh)
> > +{
> > +       if (thresh > dirty)
> > +               return 1UL << (ilog2(thresh - dirty) >> 1);
> > +
> > +       return 1;
> > +}
> 
> Where does that sqrt come from? 
  He does 2^{log_2(x)/2} which, if done in real numbers arithmetics, would
result in x^{1/2}. Given the integer arithmetics, it might be twice as
small but still it's some approximation...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 05/18] writeback: per task dirty rate limit
  2011-09-06 23:27       ` Jan Kara
@ 2011-09-06 23:34         ` Jan Kara
  -1 siblings, 0 replies; 175+ messages in thread
From: Jan Kara @ 2011-09-06 23:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Wed 07-09-11 01:27:38, Jan Kara wrote:
> On Tue 06-09-11 17:47:10, Peter Zijlstra wrote:
> > On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > >  /*
> > > + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> > > + * will look to see if it needs to start dirty throttling.
> > > + *
> > > + * If dirty_poll_interval is too low, big NUMA machines will call the expensive
> > > + * global_page_state() too often. So scale it near-sqrt to the safety margin
> > > + * (the number of pages we may dirty without exceeding the dirty limits).
> > > + */
> > > +static unsigned long dirty_poll_interval(unsigned long dirty,
> > > +                                        unsigned long thresh)
> > > +{
> > > +       if (thresh > dirty)
> > > +               return 1UL << (ilog2(thresh - dirty) >> 1);
> > > +
> > > +       return 1;
> > > +}
> > 
> > Where does that sqrt come from? 
>   He does 2^{log_2(x)/2} which, if done in real numbers arithmetics, would
> result in x^{1/2}. Given the integer arithmetics, it might be twice as
> small but still it's some approximation...
  Ah, now I realized that you probably meant to ask why does he use sqrt
and not some other function... Sorry for the noise.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 05/18] writeback: per task dirty rate limit
@ 2011-09-06 23:34         ` Jan Kara
  0 siblings, 0 replies; 175+ messages in thread
From: Jan Kara @ 2011-09-06 23:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Wed 07-09-11 01:27:38, Jan Kara wrote:
> On Tue 06-09-11 17:47:10, Peter Zijlstra wrote:
> > On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > >  /*
> > > + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> > > + * will look to see if it needs to start dirty throttling.
> > > + *
> > > + * If dirty_poll_interval is too low, big NUMA machines will call the expensive
> > > + * global_page_state() too often. So scale it near-sqrt to the safety margin
> > > + * (the number of pages we may dirty without exceeding the dirty limits).
> > > + */
> > > +static unsigned long dirty_poll_interval(unsigned long dirty,
> > > +                                        unsigned long thresh)
> > > +{
> > > +       if (thresh > dirty)
> > > +               return 1UL << (ilog2(thresh - dirty) >> 1);
> > > +
> > > +       return 1;
> > > +}
> > 
> > Where does that sqrt come from? 
>   He does 2^{log_2(x)/2} which, if done in real numbers arithmetics, would
> result in x^{1/2}. Given the integer arithmetics, it might be twice as
> small but still it's some approximation...
  Ah, now I realized that you probably meant to ask why does he use sqrt
and not some other function... Sorry for the noise.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 15/18] writeback: charge leaked page dirties to active tasks
  2011-09-04  1:53   ` Wu Fengguang
@ 2011-09-07  0:17     ` Jan Kara
  -1 siblings, 0 replies; 175+ messages in thread
From: Jan Kara @ 2011-09-07  0:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Sun 04-09-11 09:53:20, Wu Fengguang wrote:
> It's a years long problem that a large number of short-lived dirtiers
> (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
> (eg. dd) as well as pushing the dirty pages to the global hard limit.
  I don't think it's years long problem. When we do per-cpu ratelimiting,
short lived processes have the same chance (proportional to the number of
pages dirtied) of hitting balance_dirty_pages() as long-run dirtiers have.
So this problem seems to be introduced by your per task dirty ratelimiting?
But given that you kept per-cpu ratelimiting in the end, is this still an
issue? Do you have some numbers for this patch?

								Honza

> The solution is to charge the pages dirtied by the exited gcc to the
> other random gcc/dd instances. It sounds not perfect, however should
> behave good enough in practice.
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/writeback.h |    2 ++
>  kernel/exit.c             |    2 ++
>  mm/page-writeback.c       |   12 ++++++++++++
>  3 files changed, 16 insertions(+)
> 
> --- linux-next.orig/include/linux/writeback.h	2011-08-29 19:14:22.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-08-29 19:14:32.000000000 +0800
> @@ -7,6 +7,8 @@
>  #include <linux/sched.h>
>  #include <linux/fs.h>
>  
> +DECLARE_PER_CPU(int, dirty_leaks);
> +
>  /*
>   * The 1/4 region under the global dirty thresh is for smooth dirty throttling:
>   *
> --- linux-next.orig/mm/page-writeback.c	2011-08-29 19:14:22.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-29 19:14:32.000000000 +0800
> @@ -1237,6 +1237,7 @@ void set_page_dirty_balance(struct page 
>  }
>  
>  static DEFINE_PER_CPU(int, bdp_ratelimits);
> +DEFINE_PER_CPU(int, dirty_leaks) = 0;
>  
>  /**
>   * balance_dirty_pages_ratelimited_nr - balance dirty memory state
> @@ -1285,6 +1286,17 @@ void balance_dirty_pages_ratelimited_nr(
>  			ratelimit = 0;
>  		}
>  	}
> +	/*
> +	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
> +	 * short-lived tasks (eg. gcc invocations in a kernel build) escaping
> +	 * the dirty throttling and livelock other long-run dirtiers.
> +	 */
> +	p = &__get_cpu_var(dirty_leaks);
> +	if (*p > 0 && current->nr_dirtied < ratelimit) {
> +		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
> +		*p -= nr_pages_dirtied;
> +		current->nr_dirtied += nr_pages_dirtied;
> +	}
>  	preempt_enable();
>  
>  	if (unlikely(current->nr_dirtied >= ratelimit))
> --- linux-next.orig/kernel/exit.c	2011-08-26 16:19:27.000000000 +0800
> +++ linux-next/kernel/exit.c	2011-08-29 19:14:22.000000000 +0800
> @@ -1044,6 +1044,8 @@ NORET_TYPE void do_exit(long code)
>  	validate_creds_for_do_exit(tsk);
>  
>  	preempt_disable();
> +	if (tsk->nr_dirtied)
> +		__this_cpu_add(dirty_leaks, tsk->nr_dirtied);
>  	exit_rcu();
>  	/* causes final put_task_struct in finish_task_switch(). */
>  	tsk->state = TASK_DEAD;
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 15/18] writeback: charge leaked page dirties to active tasks
@ 2011-09-07  0:17     ` Jan Kara
  0 siblings, 0 replies; 175+ messages in thread
From: Jan Kara @ 2011-09-07  0:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Sun 04-09-11 09:53:20, Wu Fengguang wrote:
> It's a years long problem that a large number of short-lived dirtiers
> (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
> (eg. dd) as well as pushing the dirty pages to the global hard limit.
  I don't think it's years long problem. When we do per-cpu ratelimiting,
short lived processes have the same chance (proportional to the number of
pages dirtied) of hitting balance_dirty_pages() as long-run dirtiers have.
So this problem seems to be introduced by your per task dirty ratelimiting?
But given that you kept per-cpu ratelimiting in the end, is this still an
issue? Do you have some numbers for this patch?

								Honza

> The solution is to charge the pages dirtied by the exited gcc to the
> other random gcc/dd instances. It sounds not perfect, however should
> behave good enough in practice.
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/writeback.h |    2 ++
>  kernel/exit.c             |    2 ++
>  mm/page-writeback.c       |   12 ++++++++++++
>  3 files changed, 16 insertions(+)
> 
> --- linux-next.orig/include/linux/writeback.h	2011-08-29 19:14:22.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-08-29 19:14:32.000000000 +0800
> @@ -7,6 +7,8 @@
>  #include <linux/sched.h>
>  #include <linux/fs.h>
>  
> +DECLARE_PER_CPU(int, dirty_leaks);
> +
>  /*
>   * The 1/4 region under the global dirty thresh is for smooth dirty throttling:
>   *
> --- linux-next.orig/mm/page-writeback.c	2011-08-29 19:14:22.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-29 19:14:32.000000000 +0800
> @@ -1237,6 +1237,7 @@ void set_page_dirty_balance(struct page 
>  }
>  
>  static DEFINE_PER_CPU(int, bdp_ratelimits);
> +DEFINE_PER_CPU(int, dirty_leaks) = 0;
>  
>  /**
>   * balance_dirty_pages_ratelimited_nr - balance dirty memory state
> @@ -1285,6 +1286,17 @@ void balance_dirty_pages_ratelimited_nr(
>  			ratelimit = 0;
>  		}
>  	}
> +	/*
> +	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
> +	 * short-lived tasks (eg. gcc invocations in a kernel build) escaping
> +	 * the dirty throttling and livelock other long-run dirtiers.
> +	 */
> +	p = &__get_cpu_var(dirty_leaks);
> +	if (*p > 0 && current->nr_dirtied < ratelimit) {
> +		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
> +		*p -= nr_pages_dirtied;
> +		current->nr_dirtied += nr_pages_dirtied;
> +	}
>  	preempt_enable();
>  
>  	if (unlikely(current->nr_dirtied >= ratelimit))
> --- linux-next.orig/kernel/exit.c	2011-08-26 16:19:27.000000000 +0800
> +++ linux-next/kernel/exit.c	2011-08-29 19:14:22.000000000 +0800
> @@ -1044,6 +1044,8 @@ NORET_TYPE void do_exit(long code)
>  	validate_creds_for_do_exit(tsk);
>  
>  	preempt_disable();
> +	if (tsk->nr_dirtied)
> +		__this_cpu_add(dirty_leaks, tsk->nr_dirtied);
>  	exit_rcu();
>  	/* causes final put_task_struct in finish_task_switch(). */
>  	tsk->state = TASK_DEAD;
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
  2011-09-06 16:18     ` Peter Zijlstra
@ 2011-09-07  0:22       ` Jan Kara
  -1 siblings, 0 replies; 175+ messages in thread
From: Jan Kara @ 2011-09-07  0:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Tue 06-09-11 18:18:56, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > De-account the accumulative dirty counters on page redirty.
> > 
> > Page redirties (very common in ext4) will introduce mismatch between
> > counters (a) and (b)
> > 
> > a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
> > b) NR_WRITTEN, BDI_WRITTEN
> > 
> > This will introduce systematic errors in balanced_rate and result in
> > dirty page position errors (ie. the dirty pages are no longer balanced
> > around the global/bdi setpoints).
> > 
> 
> So wtf is ext4 doing? Shouldn't a page stay dirty until its written out?
> 
> That is, should we really frob around this behaviour or fix ext4 because
> its on crack?
  Fengguang, could you please verify your findings with recent kernel? I
believe ext4 got fixed in this regard some time ago already (and yes, old
delalloc writeback code in ext4 was terrible).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
@ 2011-09-07  0:22       ` Jan Kara
  0 siblings, 0 replies; 175+ messages in thread
From: Jan Kara @ 2011-09-07  0:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Tue 06-09-11 18:18:56, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > De-account the accumulative dirty counters on page redirty.
> > 
> > Page redirties (very common in ext4) will introduce mismatch between
> > counters (a) and (b)
> > 
> > a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
> > b) NR_WRITTEN, BDI_WRITTEN
> > 
> > This will introduce systematic errors in balanced_rate and result in
> > dirty page position errors (ie. the dirty pages are no longer balanced
> > around the global/bdi setpoints).
> > 
> 
> So wtf is ext4 doing? Shouldn't a page stay dirty until its written out?
> 
> That is, should we really frob around this behaviour or fix ext4 because
> its on crack?
  Fengguang, could you please verify your findings with recent kernel? I
believe ext4 got fixed in this regard some time ago already (and yes, old
delalloc writeback code in ext4 was terrible).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 05/18] writeback: per task dirty rate limit
  2011-09-06 15:47     ` Peter Zijlstra
@ 2011-09-07  1:04       ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07  1:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Sep 06, 2011 at 11:47:10PM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> >  /*
> > + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> > + * will look to see if it needs to start dirty throttling.
> > + *
> > + * If dirty_poll_interval is too low, big NUMA machines will call the expensive
> > + * global_page_state() too often. So scale it near-sqrt to the safety margin
> > + * (the number of pages we may dirty without exceeding the dirty limits).
> > + */
> > +static unsigned long dirty_poll_interval(unsigned long dirty,
> > +                                        unsigned long thresh)
> > +{
> > +       if (thresh > dirty)
> > +               return 1UL << (ilog2(thresh - dirty) >> 1);
> > +
> > +       return 1;
> > +}
> 
> Where does that sqrt come from? 

Ideally if we know there are N dirtiers, it's safe to let each task
poll at (thresh-dirty)/N without exceeding the dirty limit.

However we neither know the current N, nor is sure whether it will
rush high at next second. So sqrt is used to tolerate larger N on
increased (thresh-dirty) gap:

irb> 0.upto(10) { |i| mb=2**i; pages=mb<<(20-12); printf "%4d\t%4d\n", mb, Math.sqrt(pages)}
   1      16
   2      22
   4      32
   8      45
  16      64
  32      90
  64     128
 128     181
 256     256
 512     362
1024     512

The above table means, given 1MB (or 1GB) gap and the dd tasks polling
balance_dirty_pages() on every 16 (or 512) pages, the dirty limit
won't be exceeded as long as there are less than 16 (or 512) concurrent
dd's.

Note that dirty_poll_interval() will mainly be used when (dirty < freerun).
When the dirty pages are floating in range [freerun, limit],
"[PATCH 14/18] writeback: control dirty pause time" will independently
adjust tsk->nr_dirtied_pause to get suitable pause time.

So the sqrt naturally leads to less overheads and more N tolerance for
large memory servers, which have large (thresh-freerun) gaps.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 05/18] writeback: per task dirty rate limit
@ 2011-09-07  1:04       ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07  1:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Sep 06, 2011 at 11:47:10PM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> >  /*
> > + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> > + * will look to see if it needs to start dirty throttling.
> > + *
> > + * If dirty_poll_interval is too low, big NUMA machines will call the expensive
> > + * global_page_state() too often. So scale it near-sqrt to the safety margin
> > + * (the number of pages we may dirty without exceeding the dirty limits).
> > + */
> > +static unsigned long dirty_poll_interval(unsigned long dirty,
> > +                                        unsigned long thresh)
> > +{
> > +       if (thresh > dirty)
> > +               return 1UL << (ilog2(thresh - dirty) >> 1);
> > +
> > +       return 1;
> > +}
> 
> Where does that sqrt come from? 

Ideally if we know there are N dirtiers, it's safe to let each task
poll at (thresh-dirty)/N without exceeding the dirty limit.

However we neither know the current N, nor is sure whether it will
rush high at next second. So sqrt is used to tolerate larger N on
increased (thresh-dirty) gap:

irb> 0.upto(10) { |i| mb=2**i; pages=mb<<(20-12); printf "%4d\t%4d\n", mb, Math.sqrt(pages)}
   1      16
   2      22
   4      32
   8      45
  16      64
  32      90
  64     128
 128     181
 256     256
 512     362
1024     512

The above table means, given 1MB (or 1GB) gap and the dd tasks polling
balance_dirty_pages() on every 16 (or 512) pages, the dirty limit
won't be exceeded as long as there are less than 16 (or 512) concurrent
dd's.

Note that dirty_poll_interval() will mainly be used when (dirty < freerun).
When the dirty pages are floating in range [freerun, limit],
"[PATCH 14/18] writeback: control dirty pause time" will independently
adjust tsk->nr_dirtied_pause to get suitable pause time.

So the sqrt naturally leads to less overheads and more N tolerance for
large memory servers, which have large (thresh-freerun) gaps.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
  2011-09-07  0:22       ` Jan Kara
  (?)
@ 2011-09-07  1:18       ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07  1:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 1279 bytes --]

On Wed, Sep 07, 2011 at 08:22:22AM +0800, Jan Kara wrote:
> On Tue 06-09-11 18:18:56, Peter Zijlstra wrote:
> > On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > > De-account the accumulative dirty counters on page redirty.
> > > 
> > > Page redirties (very common in ext4) will introduce mismatch between
> > > counters (a) and (b)
> > > 
> > > a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
> > > b) NR_WRITTEN, BDI_WRITTEN
> > > 
> > > This will introduce systematic errors in balanced_rate and result in
> > > dirty page position errors (ie. the dirty pages are no longer balanced
> > > around the global/bdi setpoints).
> > > 
> > 
> > So wtf is ext4 doing? Shouldn't a page stay dirty until its written out?
> > 
> > That is, should we really frob around this behaviour or fix ext4 because
> > its on crack?
>   Fengguang, could you please verify your findings with recent kernel? I
> believe ext4 got fixed in this regard some time ago already (and yes, old
> delalloc writeback code in ext4 was terrible).

Jan, attached is the results for 3.1-rc4, before and after this patchset.
The test case is ext4, 1 dd, bs=4k, dirty_bytes=1GB.

Judging from global_dirtied_written.png, the dirtied/written lines are
still departing away from each other...

Thanks,
Fengguang

[-- Attachment #2: global_dirtied_written.png --]
[-- Type: image/png, Size: 38142 bytes --]

[-- Attachment #3: balance_dirty_pages-pause.png --]
[-- Type: image/png, Size: 22406 bytes --]

[-- Attachment #4: balance_dirty_pages-pages.png --]
[-- Type: image/png, Size: 41955 bytes --]

[-- Attachment #5: global_dirtied_written.png --]
[-- Type: image/png, Size: 38597 bytes --]

[-- Attachment #6: balance_dirty_pages-pause.png --]
[-- Type: image/png, Size: 28829 bytes --]

[-- Attachment #7: balance_dirty_pages-pages.png --]
[-- Type: image/png, Size: 71535 bytes --]

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 14/18] writeback: control dirty pause time
  2011-09-06 15:51     ` Peter Zijlstra
@ 2011-09-07  2:02       ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07  2:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Sep 06, 2011 at 11:51:25PM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > plain text document attachment (max-pause-adaption)
> > The dirty pause time shall ultimately be controlled by adjusting
> > nr_dirtied_pause, since there is relationship
> > 
> > 	pause = pages_dirtied / task_ratelimit
> > 
> > Assuming
> > 
> > 	pages_dirtied ~= nr_dirtied_pause
> > 	task_ratelimit ~= dirty_ratelimit
> > 
> > We get
> > 
> > 	nr_dirtied_pause ~= dirty_ratelimit * desired_pause
> > 
> > Here dirty_ratelimit is preferred over task_ratelimit because it's
> > more stable.
> > 
> > It's also important to limit possible large transitional errors:
> > 
> > - bw is changing quickly
> > - pages_dirtied << nr_dirtied_pause on entering dirty exceeded area
> > - pages_dirtied >> nr_dirtied_pause on btrfs (to be improved by a
> >   separate fix, but still expect non-trivial errors)
> > 
> > So we end up using the above formula inside clamp_val().
> > 
> > The best test case for this code is to run 100 "dd bs=4M" tasks on
> > btrfs and check its pause time distribution.
> 
> 
> 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |   15 ++++++++++++++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2011-08-29 19:08:43.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2011-08-29 19:08:44.000000000 +0800
> > @@ -1193,7 +1193,20 @@ pause:
> >  	if (!dirty_exceeded && bdi->dirty_exceeded)
> >  		bdi->dirty_exceeded = 0;
> >  
> > -	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
> > +	if (pause == 0)
> > +		current->nr_dirtied_pause =
> > +				dirty_poll_interval(nr_dirty, dirty_thresh);
> > +	else if (period <= max_pause / 4 &&
> > +		 pages_dirtied >= current->nr_dirtied_pause)
> > +		current->nr_dirtied_pause = clamp_val(
> > +					dirty_ratelimit * (max_pause / 2) / HZ,
> > +					pages_dirtied + pages_dirtied / 8,
> > +					pages_dirtied * 4);
> > +	else if (pause >= max_pause)
> > +		current->nr_dirtied_pause = 1 | clamp_val(
> > +					dirty_ratelimit * (max_pause * 3/8)/HZ,
> > +					pages_dirtied / 4,
> > +					pages_dirtied * 7/8);
> >  
> 
> I very much prefer { } over multi line stmts, even if not strictly
> needed.

Yeah, that does look better.

> I'm also not quite sure why pause==0 is a special case,

Good question, it covers the important case that dirty pages are still
in the freerun area, where we don't do pause at all and hence cannot
adaptively adjust current->nr_dirtied_pause based on the pause time.

I'll add a simple comment for that condition:

        if (pause == 0) { /* in freerun area */

> also, do the two other line segments connect on the transition
> point?

I guess we can simply unify the other two formulas into one:

        } else if (period <= max_pause / 4 &&
                 pages_dirtied >= current->nr_dirtied_pause) {
                current->nr_dirtied_pause = clamp_val(
==>                                     dirty_ratelimit * (max_pause / 2) / HZ,
                                        pages_dirtied + pages_dirtied / 8,
                                        pages_dirtied * 4);
        } else if (pause >= max_pause) {
                current->nr_dirtied_pause = 1 | clamp_val(
==>                                     dirty_ratelimit * (max_pause / 2) / HZ,
                                        pages_dirtied / 4,
                                        pages_dirtied - pages_dirtied / 8);
        }

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 14/18] writeback: control dirty pause time
@ 2011-09-07  2:02       ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07  2:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Sep 06, 2011 at 11:51:25PM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > plain text document attachment (max-pause-adaption)
> > The dirty pause time shall ultimately be controlled by adjusting
> > nr_dirtied_pause, since there is relationship
> > 
> > 	pause = pages_dirtied / task_ratelimit
> > 
> > Assuming
> > 
> > 	pages_dirtied ~= nr_dirtied_pause
> > 	task_ratelimit ~= dirty_ratelimit
> > 
> > We get
> > 
> > 	nr_dirtied_pause ~= dirty_ratelimit * desired_pause
> > 
> > Here dirty_ratelimit is preferred over task_ratelimit because it's
> > more stable.
> > 
> > It's also important to limit possible large transitional errors:
> > 
> > - bw is changing quickly
> > - pages_dirtied << nr_dirtied_pause on entering dirty exceeded area
> > - pages_dirtied >> nr_dirtied_pause on btrfs (to be improved by a
> >   separate fix, but still expect non-trivial errors)
> > 
> > So we end up using the above formula inside clamp_val().
> > 
> > The best test case for this code is to run 100 "dd bs=4M" tasks on
> > btrfs and check its pause time distribution.
> 
> 
> 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |   15 ++++++++++++++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2011-08-29 19:08:43.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2011-08-29 19:08:44.000000000 +0800
> > @@ -1193,7 +1193,20 @@ pause:
> >  	if (!dirty_exceeded && bdi->dirty_exceeded)
> >  		bdi->dirty_exceeded = 0;
> >  
> > -	current->nr_dirtied_pause = dirty_poll_interval(nr_dirty, dirty_thresh);
> > +	if (pause == 0)
> > +		current->nr_dirtied_pause =
> > +				dirty_poll_interval(nr_dirty, dirty_thresh);
> > +	else if (period <= max_pause / 4 &&
> > +		 pages_dirtied >= current->nr_dirtied_pause)
> > +		current->nr_dirtied_pause = clamp_val(
> > +					dirty_ratelimit * (max_pause / 2) / HZ,
> > +					pages_dirtied + pages_dirtied / 8,
> > +					pages_dirtied * 4);
> > +	else if (pause >= max_pause)
> > +		current->nr_dirtied_pause = 1 | clamp_val(
> > +					dirty_ratelimit * (max_pause * 3/8)/HZ,
> > +					pages_dirtied / 4,
> > +					pages_dirtied * 7/8);
> >  
> 
> I very much prefer { } over multi line stmts, even if not strictly
> needed.

Yeah, that does look better.

> I'm also not quite sure why pause==0 is a special case,

Good question, it covers the important case that dirty pages are still
in the freerun area, where we don't do pause at all and hence cannot
adaptively adjust current->nr_dirtied_pause based on the pause time.

I'll add a simple comment for that condition:

        if (pause == 0) { /* in freerun area */

> also, do the two other line segments connect on the transition
> point?

I guess we can simply unify the other two formulas into one:

        } else if (period <= max_pause / 4 &&
                 pages_dirtied >= current->nr_dirtied_pause) {
                current->nr_dirtied_pause = clamp_val(
==>                                     dirty_ratelimit * (max_pause / 2) / HZ,
                                        pages_dirtied + pages_dirtied / 8,
                                        pages_dirtied * 4);
        } else if (pause >= max_pause) {
                current->nr_dirtied_pause = 1 | clamp_val(
==>                                     dirty_ratelimit * (max_pause / 2) / HZ,
                                        pages_dirtied / 4,
                                        pages_dirtied - pages_dirtied / 8);
        }

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 13/18] writeback: limit max dirty pause time
  2011-09-06 14:52     ` Peter Zijlstra
@ 2011-09-07  2:35       ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07  2:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Sep 06, 2011 at 10:52:06PM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> 
> > +static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
> > +				   unsigned long bdi_dirty)
> > +{
> > +	unsigned long hi = ilog2(bdi->write_bandwidth);
> > +	unsigned long lo = ilog2(bdi->dirty_ratelimit);
> > +	unsigned long t;
> > +
> > +	/* target for ~10ms pause on 1-dd case */
> > +	t = HZ / 50;
> 
> 1k/50 usually ends up being 20 something

Right, 20ms for max_pause. Plus that the next patch will target for
(max_pause / 2) pause time, result in ~10ms typical pause time.

That does sound twisted, so I'll change the comment to "20ms max pause".

> > +	/*
> > +	 * Scale up pause time for concurrent dirtiers in order to reduce CPU
> > +	 * overheads.
> > +	 *
> > +	 * (N * 20ms) on 2^N concurrent tasks.
> > +	 */
> > +	if (hi > lo)
> > +		t += (hi - lo) * (20 * HZ) / 1024;
> > +
> > +	/*
> > +	 * Limit pause time for small memory systems. If sleeping for too long
> > +	 * time, a small pool of dirty/writeback pages may go empty and disk go
> > +	 * idle.
> > +	 *
> > +	 * 1ms for every 1MB; may further consider bdi bandwidth.
> > +	 */
> > +	if (bdi_dirty)
> > +		t = min(t, bdi_dirty >> (30 - PAGE_CACHE_SHIFT - ilog2(HZ)));
> 
> Yeah, I would add the bdi->avg_write_bandwidth term in there, 1g/s as an
> avg bandwidth is just too wrong..

Fair enough. On average, it will take

        T = bdi_dirty / write_bw

to clean all the bdi dirty pages. Applying a safety ratio of 8 and
convert to jiffies, we get

        T' = (T / 8) * HZ
           = bdi_dirty * HZ / (write_bw * 8)

        t = min(t, T')

> > +
> > +	/*
> > +	 * The pause time will be settled within range (max_pause/4, max_pause).
> > +	 * Apply a minimal value of 4 to get a non-zero max_pause/4.
> > +	 */
> > +	return clamp_val(t, 4, MAX_PAUSE);
> 
> So you limit to 50ms min? That still seems fairly large. Is that because
> your min sleep granularity might be something like 10ms since you're
> using jiffies?

With HZ=100, the minimal valid pause range will be (10ms, 40ms), with
typical value at 20ms.

So yeah, the HZ value does impact the minimal available sleep time...

Thanks,
Fengguang
---
Subject: writeback: limit max dirty pause time
Date: Sat Jun 11 19:21:43 CST 2011

Apply two policies to scale down the max pause time for

1) small number of concurrent dirtiers
2) small memory system (comparing to storage bandwidth)

MAX_PAUSE=200ms may only be suitable for high end servers with lots of
concurrent dirtiers, where the large pause time can reduce much overheads.

Otherwise, smaller pause time is desirable whenever possible, so as to
get good responsiveness and smooth user experiences. It's actually
required for good disk utilization in the case when all the dirty pages
can be synced to disk within MAX_PAUSE=200ms.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   46 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 43 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-09-07 09:33:03.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-09-07 10:33:31.000000000 +0800
@@ -953,6 +953,43 @@ static unsigned long dirty_poll_interval
 	return 1;
 }
 
+static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
+				   unsigned long bdi_dirty)
+{
+	unsigned long bw = bdi->avg_write_bandwidth;
+	unsigned long hi = ilog2(bw);
+	unsigned long lo = ilog2(bdi->dirty_ratelimit);
+	unsigned long t;
+
+	/* target for 20ms max pause on 1-dd case */
+	t = HZ / 50;
+
+	/*
+	 * Scale up pause time for concurrent dirtiers in order to reduce CPU
+	 * overheads.
+	 *
+	 * (N * 20ms) on 2^N concurrent tasks.
+	 */
+	if (hi > lo)
+		t += (hi - lo) * (20 * HZ) / 1024;
+
+	/*
+	 * Limit pause time for small memory systems. If sleeping for too long
+	 * time, a small pool of dirty/writeback pages may go empty and disk go
+	 * idle.
+	 *
+	 * 8 serves as the safety ratio.
+	 */
+	if (bdi_dirty)
+		t = min(t, bdi_dirty * HZ / (8 * bw + 1);
+
+	/*
+	 * The pause time will be settled within range (max_pause/4, max_pause).
+	 * Apply a minimal value of 4 to get a non-zero max_pause/4.
+	 */
+	return clamp_val(t, 4, MAX_PAUSE);
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -973,6 +1010,7 @@ static void balance_dirty_pages(struct a
 	unsigned long bdi_thresh;
 	long period;
 	long pause = 0;
+	long max_pause;
 	bool dirty_exceeded = false;
 	unsigned long task_ratelimit;
 	unsigned long dirty_ratelimit;
@@ -1058,13 +1096,15 @@ static void balance_dirty_pages(struct a
 		if (unlikely(!dirty_exceeded && bdi_async_underrun(bdi)))
 			break;
 
+		max_pause = bdi_max_pause(bdi, bdi_dirty);
+
 		dirty_ratelimit = bdi->dirty_ratelimit;
 		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
 					       background_thresh, nr_dirty,
 					       bdi_thresh, bdi_dirty);
 		if (unlikely(pos_ratio == 0)) {
-			period = MAX_PAUSE;
-			pause = MAX_PAUSE;
+			period = max_pause;
+			pause = max_pause;
 			goto pause;
 		}
 		task_ratelimit = (u64)dirty_ratelimit *
@@ -1101,7 +1141,7 @@ static void balance_dirty_pages(struct a
 			pause = 1; /* avoid resetting nr_dirtied_pause below */
 			break;
 		}
-		pause = min_t(long, pause, MAX_PAUSE);
+		pause = min(pause, max_pause);
 
 pause:
 		trace_balance_dirty_pages(bdi,

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 13/18] writeback: limit max dirty pause time
@ 2011-09-07  2:35       ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07  2:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Sep 06, 2011 at 10:52:06PM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> 
> > +static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
> > +				   unsigned long bdi_dirty)
> > +{
> > +	unsigned long hi = ilog2(bdi->write_bandwidth);
> > +	unsigned long lo = ilog2(bdi->dirty_ratelimit);
> > +	unsigned long t;
> > +
> > +	/* target for ~10ms pause on 1-dd case */
> > +	t = HZ / 50;
> 
> 1k/50 usually ends up being 20 something

Right, 20ms for max_pause. Plus that the next patch will target for
(max_pause / 2) pause time, result in ~10ms typical pause time.

That does sound twisted, so I'll change the comment to "20ms max pause".

> > +	/*
> > +	 * Scale up pause time for concurrent dirtiers in order to reduce CPU
> > +	 * overheads.
> > +	 *
> > +	 * (N * 20ms) on 2^N concurrent tasks.
> > +	 */
> > +	if (hi > lo)
> > +		t += (hi - lo) * (20 * HZ) / 1024;
> > +
> > +	/*
> > +	 * Limit pause time for small memory systems. If sleeping for too long
> > +	 * time, a small pool of dirty/writeback pages may go empty and disk go
> > +	 * idle.
> > +	 *
> > +	 * 1ms for every 1MB; may further consider bdi bandwidth.
> > +	 */
> > +	if (bdi_dirty)
> > +		t = min(t, bdi_dirty >> (30 - PAGE_CACHE_SHIFT - ilog2(HZ)));
> 
> Yeah, I would add the bdi->avg_write_bandwidth term in there, 1g/s as an
> avg bandwidth is just too wrong..

Fair enough. On average, it will take

        T = bdi_dirty / write_bw

to clean all the bdi dirty pages. Applying a safety ratio of 8 and
convert to jiffies, we get

        T' = (T / 8) * HZ
           = bdi_dirty * HZ / (write_bw * 8)

        t = min(t, T')

> > +
> > +	/*
> > +	 * The pause time will be settled within range (max_pause/4, max_pause).
> > +	 * Apply a minimal value of 4 to get a non-zero max_pause/4.
> > +	 */
> > +	return clamp_val(t, 4, MAX_PAUSE);
> 
> So you limit to 50ms min? That still seems fairly large. Is that because
> your min sleep granularity might be something like 10ms since you're
> using jiffies?

With HZ=100, the minimal valid pause range will be (10ms, 40ms), with
typical value at 20ms.

So yeah, the HZ value does impact the minimal available sleep time...

Thanks,
Fengguang
---
Subject: writeback: limit max dirty pause time
Date: Sat Jun 11 19:21:43 CST 2011

Apply two policies to scale down the max pause time for

1) small number of concurrent dirtiers
2) small memory system (comparing to storage bandwidth)

MAX_PAUSE=200ms may only be suitable for high end servers with lots of
concurrent dirtiers, where the large pause time can reduce much overheads.

Otherwise, smaller pause time is desirable whenever possible, so as to
get good responsiveness and smooth user experiences. It's actually
required for good disk utilization in the case when all the dirty pages
can be synced to disk within MAX_PAUSE=200ms.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   46 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 43 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-09-07 09:33:03.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-09-07 10:33:31.000000000 +0800
@@ -953,6 +953,43 @@ static unsigned long dirty_poll_interval
 	return 1;
 }
 
+static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
+				   unsigned long bdi_dirty)
+{
+	unsigned long bw = bdi->avg_write_bandwidth;
+	unsigned long hi = ilog2(bw);
+	unsigned long lo = ilog2(bdi->dirty_ratelimit);
+	unsigned long t;
+
+	/* target for 20ms max pause on 1-dd case */
+	t = HZ / 50;
+
+	/*
+	 * Scale up pause time for concurrent dirtiers in order to reduce CPU
+	 * overheads.
+	 *
+	 * (N * 20ms) on 2^N concurrent tasks.
+	 */
+	if (hi > lo)
+		t += (hi - lo) * (20 * HZ) / 1024;
+
+	/*
+	 * Limit pause time for small memory systems. If sleeping for too long
+	 * time, a small pool of dirty/writeback pages may go empty and disk go
+	 * idle.
+	 *
+	 * 8 serves as the safety ratio.
+	 */
+	if (bdi_dirty)
+		t = min(t, bdi_dirty * HZ / (8 * bw + 1);
+
+	/*
+	 * The pause time will be settled within range (max_pause/4, max_pause).
+	 * Apply a minimal value of 4 to get a non-zero max_pause/4.
+	 */
+	return clamp_val(t, 4, MAX_PAUSE);
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -973,6 +1010,7 @@ static void balance_dirty_pages(struct a
 	unsigned long bdi_thresh;
 	long period;
 	long pause = 0;
+	long max_pause;
 	bool dirty_exceeded = false;
 	unsigned long task_ratelimit;
 	unsigned long dirty_ratelimit;
@@ -1058,13 +1096,15 @@ static void balance_dirty_pages(struct a
 		if (unlikely(!dirty_exceeded && bdi_async_underrun(bdi)))
 			break;
 
+		max_pause = bdi_max_pause(bdi, bdi_dirty);
+
 		dirty_ratelimit = bdi->dirty_ratelimit;
 		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
 					       background_thresh, nr_dirty,
 					       bdi_thresh, bdi_dirty);
 		if (unlikely(pos_ratio == 0)) {
-			period = MAX_PAUSE;
-			pause = MAX_PAUSE;
+			period = max_pause;
+			pause = max_pause;
 			goto pause;
 		}
 		task_ratelimit = (u64)dirty_ratelimit *
@@ -1101,7 +1141,7 @@ static void balance_dirty_pages(struct a
 			pause = 1; /* avoid resetting nr_dirtied_pause below */
 			break;
 		}
-		pause = min_t(long, pause, MAX_PAUSE);
+		pause = min(pause, max_pause);
 
 pause:
 		trace_balance_dirty_pages(bdi,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 11/18] block: add bdi flag to indicate risk of io queue underrun
  2011-09-06 14:22     ` Peter Zijlstra
@ 2011-09-07  2:37       ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07  2:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Tejun Heo, Jens Axboe, Li, Shaohua, Andrew Morton,
	Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

On Tue, Sep 06, 2011 at 10:22:48PM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > +++ linux-next/mm/page-writeback.c      2011-08-31 14:40:58.000000000 +0800
> > @@ -1067,6 +1067,9 @@ static void balance_dirty_pages(struct a
> >                                      nr_dirty, bdi_thresh, bdi_dirty,
> >                                      start_time);
> >  
> > +               if (unlikely(!dirty_exceeded && bdi_async_underrun(bdi)))
> > +                       break;
> > +
> >                 dirty_ratelimit = bdi->dirty_ratelimit;
> >                 pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> >                                                background_thresh, nr_dirty,
> 
> So dirty_exceeded looks like:
> 
> 
> 1109                 dirty_exceeded = (bdi_dirty > bdi_thresh) ||
> 1110                                   (nr_dirty > dirty_thresh);
> 
> Would it make sense to write it as:
> 
> 	if (nr_dirty > dirty_thresh || 
> 	    (nr_dirty > freerun && bdi_dirty > bdi_thresh))
> 		dirty_exceeded = 1;
> 
> So that we don't actually throttle bdi thingies when we're still in the
> freerun area?

Sounds not necessary -- (nr_dirty > freerun) is implicitly true
because there is a big break early in the loop:

        if (nr_dirty > freerun)
                break;

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 11/18] block: add bdi flag to indicate risk of io queue underrun
@ 2011-09-07  2:37       ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07  2:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Tejun Heo, Jens Axboe, Li, Shaohua, Andrew Morton,
	Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

On Tue, Sep 06, 2011 at 10:22:48PM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > +++ linux-next/mm/page-writeback.c      2011-08-31 14:40:58.000000000 +0800
> > @@ -1067,6 +1067,9 @@ static void balance_dirty_pages(struct a
> >                                      nr_dirty, bdi_thresh, bdi_dirty,
> >                                      start_time);
> >  
> > +               if (unlikely(!dirty_exceeded && bdi_async_underrun(bdi)))
> > +                       break;
> > +
> >                 dirty_ratelimit = bdi->dirty_ratelimit;
> >                 pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> >                                                background_thresh, nr_dirty,
> 
> So dirty_exceeded looks like:
> 
> 
> 1109                 dirty_exceeded = (bdi_dirty > bdi_thresh) ||
> 1110                                   (nr_dirty > dirty_thresh);
> 
> Would it make sense to write it as:
> 
> 	if (nr_dirty > dirty_thresh || 
> 	    (nr_dirty > freerun && bdi_dirty > bdi_thresh))
> 		dirty_exceeded = 1;
> 
> So that we don't actually throttle bdi thingies when we're still in the
> freerun area?

Sounds not necessary -- (nr_dirty > freerun) is implicitly true
because there is a big break early in the loop:

        if (nr_dirty > freerun)
                break;

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 06/18] writeback: IO-less balance_dirty_pages()
  2011-09-06 12:13     ` Peter Zijlstra
@ 2011-09-07  2:46       ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07  2:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Sep 06, 2011 at 08:13:53PM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > -static inline void task_dirties_fraction(struct task_struct *tsk,
> > -               long *numerator, long *denominator)
> > -{
> > -       prop_fraction_single(&vm_dirties, &tsk->dirties,
> > -                               numerator, denominator);
> > -} 
> 
> it looks like this patch removes all users of tsk->dirties, but doesn't
> in fact remove the data member from task_struct.

Good catch! This incremental patch will remove all references to
vm_dirties and tsk->dirties. Hmm, it may look more clean to make it a
standalone patch together with the chunk to remove
task_dirty_limit()/task_dirties_fraction().

Thanks,
Fengguang
---
 include/linux/sched.h |    1 -
 mm/page-writeback.c   |    9 ---------
 2 files changed, 10 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-09-07 10:42:55.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-09-07 10:43:06.000000000 +0800
@@ -1520,7 +1520,6 @@ struct task_struct {
 #ifdef CONFIG_FAULT_INJECTION
 	int make_it_fail;
 #endif
-	struct prop_local_single dirties;
 	/*
 	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
 	 * balance_dirty_pages() for some dirty throttling pause
--- linux-next.orig/mm/page-writeback.c	2011-09-07 10:43:04.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-09-07 10:43:06.000000000 +0800
@@ -128,7 +128,6 @@ unsigned long global_dirty_limit;
  *
  */
 static struct prop_descriptor vm_completions;
-static struct prop_descriptor vm_dirties;
 
 /*
  * Work out the current dirty-memory clamping and background writeout
@@ -214,7 +213,6 @@ static void update_completion_period(voi
 {
 	int shift = calc_period_shift();
 	prop_change_shift(&vm_completions, shift);
-	prop_change_shift(&vm_dirties, shift);
 
 	writeback_set_ratelimit();
 }
@@ -294,11 +292,6 @@ void bdi_writeout_inc(struct backing_dev
 }
 EXPORT_SYMBOL_GPL(bdi_writeout_inc);
 
-void task_dirty_inc(struct task_struct *tsk)
-{
-	prop_inc_single(&vm_dirties, &tsk->dirties);
-}
-
 /*
  * Obtain an accurate fraction of the BDI's portion.
  */
@@ -1286,7 +1279,6 @@ void __init page_writeback_init(void)
 
 	shift = calc_period_shift();
 	prop_descriptor_init(&vm_completions, shift);
-	prop_descriptor_init(&vm_dirties, shift);
 }
 
 /**
@@ -1615,7 +1607,6 @@ void account_page_dirtied(struct page *p
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
-		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
 	}
 }

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 06/18] writeback: IO-less balance_dirty_pages()
@ 2011-09-07  2:46       ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07  2:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Sep 06, 2011 at 08:13:53PM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > -static inline void task_dirties_fraction(struct task_struct *tsk,
> > -               long *numerator, long *denominator)
> > -{
> > -       prop_fraction_single(&vm_dirties, &tsk->dirties,
> > -                               numerator, denominator);
> > -} 
> 
> it looks like this patch removes all users of tsk->dirties, but doesn't
> in fact remove the data member from task_struct.

Good catch! This incremental patch will remove all references to
vm_dirties and tsk->dirties. Hmm, it may look more clean to make it a
standalone patch together with the chunk to remove
task_dirty_limit()/task_dirties_fraction().

Thanks,
Fengguang
---
 include/linux/sched.h |    1 -
 mm/page-writeback.c   |    9 ---------
 2 files changed, 10 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-09-07 10:42:55.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-09-07 10:43:06.000000000 +0800
@@ -1520,7 +1520,6 @@ struct task_struct {
 #ifdef CONFIG_FAULT_INJECTION
 	int make_it_fail;
 #endif
-	struct prop_local_single dirties;
 	/*
 	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
 	 * balance_dirty_pages() for some dirty throttling pause
--- linux-next.orig/mm/page-writeback.c	2011-09-07 10:43:04.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-09-07 10:43:06.000000000 +0800
@@ -128,7 +128,6 @@ unsigned long global_dirty_limit;
  *
  */
 static struct prop_descriptor vm_completions;
-static struct prop_descriptor vm_dirties;
 
 /*
  * Work out the current dirty-memory clamping and background writeout
@@ -214,7 +213,6 @@ static void update_completion_period(voi
 {
 	int shift = calc_period_shift();
 	prop_change_shift(&vm_completions, shift);
-	prop_change_shift(&vm_dirties, shift);
 
 	writeback_set_ratelimit();
 }
@@ -294,11 +292,6 @@ void bdi_writeout_inc(struct backing_dev
 }
 EXPORT_SYMBOL_GPL(bdi_writeout_inc);
 
-void task_dirty_inc(struct task_struct *tsk)
-{
-	prop_inc_single(&vm_dirties, &tsk->dirties);
-}
-
 /*
  * Obtain an accurate fraction of the BDI's portion.
  */
@@ -1286,7 +1279,6 @@ void __init page_writeback_init(void)
 
 	shift = calc_period_shift();
 	prop_descriptor_init(&vm_completions, shift);
-	prop_descriptor_init(&vm_dirties, shift);
 }
 
 /**
@@ -1615,7 +1607,6 @@ void account_page_dirtied(struct page *p
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
-		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
 	}
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
  2011-09-07  0:22       ` Jan Kara
@ 2011-09-07  6:56         ` Christoph Hellwig
  -1 siblings, 0 replies; 175+ messages in thread
From: Christoph Hellwig @ 2011-09-07  6:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, Wu Fengguang, linux-fsdevel, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Wed, Sep 07, 2011 at 02:22:22AM +0200, Jan Kara wrote:
> > So wtf is ext4 doing? Shouldn't a page stay dirty until its written out?
> > 
> > That is, should we really frob around this behaviour or fix ext4 because
> > its on crack?
>   Fengguang, could you please verify your findings with recent kernel? I
> believe ext4 got fixed in this regard some time ago already (and yes, old
> delalloc writeback code in ext4 was terrible).

The pattern we do in writeback is:

in pageout / write_cache_pages:
	lock_page();
	clear_page_dirty_for_io();

in ->writepage:
	set_page_writeback();
	unlock_page();
	end_page_writeback();

So whenever ->writepage decides it doesn't want to write things back
we have to redirty pages.  We have this happen quite a bit in every
filesystem, but ext4 hits it a lot more than usual because it refuses
to write out delalloc pages from plain ->writepage and only allows
->writepages to do it.


^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
@ 2011-09-07  6:56         ` Christoph Hellwig
  0 siblings, 0 replies; 175+ messages in thread
From: Christoph Hellwig @ 2011-09-07  6:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, Wu Fengguang, linux-fsdevel, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Wed, Sep 07, 2011 at 02:22:22AM +0200, Jan Kara wrote:
> > So wtf is ext4 doing? Shouldn't a page stay dirty until its written out?
> > 
> > That is, should we really frob around this behaviour or fix ext4 because
> > its on crack?
>   Fengguang, could you please verify your findings with recent kernel? I
> believe ext4 got fixed in this regard some time ago already (and yes, old
> delalloc writeback code in ext4 was terrible).

The pattern we do in writeback is:

in pageout / write_cache_pages:
	lock_page();
	clear_page_dirty_for_io();

in ->writepage:
	set_page_writeback();
	unlock_page();
	end_page_writeback();

So whenever ->writepage decides it doesn't want to write things back
we have to redirty pages.  We have this happen quite a bit in every
filesystem, but ext4 hits it a lot more than usual because it refuses
to write out delalloc pages from plain ->writepage and only allows
->writepages to do it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 05/18] writeback: per task dirty rate limit
  2011-09-06 23:27       ` Jan Kara
  (?)
@ 2011-09-07  7:27         ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-07  7:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 01:27 +0200, Jan Kara wrote:
> On Tue 06-09-11 17:47:10, Peter Zijlstra wrote:
> > On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > >  /*
> > > + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> > > + * will look to see if it needs to start dirty throttling.
> > > + *
> > > + * If dirty_poll_interval is too low, big NUMA machines will call the expensive
> > > + * global_page_state() too often. So scale it near-sqrt to the safety margin
> > > + * (the number of pages we may dirty without exceeding the dirty limits).
> > > + */
> > > +static unsigned long dirty_poll_interval(unsigned long dirty,
> > > +                                        unsigned long thresh)
> > > +{
> > > +       if (thresh > dirty)
> > > +               return 1UL << (ilog2(thresh - dirty) >> 1);
> > > +
> > > +       return 1;
> > > +}
> > 
> > Where does that sqrt come from? 
>   He does 2^{log_2(x)/2} which, if done in real numbers arithmetics, would
> result in x^{1/2}. Given the integer arithmetics, it might be twice as
> small but still it's some approximation...

Right, and I guess with a cpu that can do the fls its slightly faster
than our int_sqrt().

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 05/18] writeback: per task dirty rate limit
@ 2011-09-07  7:27         ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-07  7:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 01:27 +0200, Jan Kara wrote:
> On Tue 06-09-11 17:47:10, Peter Zijlstra wrote:
> > On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > >  /*
> > > + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> > > + * will look to see if it needs to start dirty throttling.
> > > + *
> > > + * If dirty_poll_interval is too low, big NUMA machines will call the expensive
> > > + * global_page_state() too often. So scale it near-sqrt to the safety margin
> > > + * (the number of pages we may dirty without exceeding the dirty limits).
> > > + */
> > > +static unsigned long dirty_poll_interval(unsigned long dirty,
> > > +                                        unsigned long thresh)
> > > +{
> > > +       if (thresh > dirty)
> > > +               return 1UL << (ilog2(thresh - dirty) >> 1);
> > > +
> > > +       return 1;
> > > +}
> > 
> > Where does that sqrt come from? 
>   He does 2^{log_2(x)/2} which, if done in real numbers arithmetics, would
> result in x^{1/2}. Given the integer arithmetics, it might be twice as
> small but still it's some approximation...

Right, and I guess with a cpu that can do the fls its slightly faster
than our int_sqrt().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 05/18] writeback: per task dirty rate limit
@ 2011-09-07  7:27         ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-07  7:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 01:27 +0200, Jan Kara wrote:
> On Tue 06-09-11 17:47:10, Peter Zijlstra wrote:
> > On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > >  /*
> > > + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> > > + * will look to see if it needs to start dirty throttling.
> > > + *
> > > + * If dirty_poll_interval is too low, big NUMA machines will call the expensive
> > > + * global_page_state() too often. So scale it near-sqrt to the safety margin
> > > + * (the number of pages we may dirty without exceeding the dirty limits).
> > > + */
> > > +static unsigned long dirty_poll_interval(unsigned long dirty,
> > > +                                        unsigned long thresh)
> > > +{
> > > +       if (thresh > dirty)
> > > +               return 1UL << (ilog2(thresh - dirty) >> 1);
> > > +
> > > +       return 1;
> > > +}
> > 
> > Where does that sqrt come from? 
>   He does 2^{log_2(x)/2} which, if done in real numbers arithmetics, would
> result in x^{1/2}. Given the integer arithmetics, it might be twice as
> small but still it's some approximation...

Right, and I guess with a cpu that can do the fls its slightly faster
than our int_sqrt().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 11/18] block: add bdi flag to indicate risk of io queue underrun
  2011-09-07  2:37       ` Wu Fengguang
  (?)
@ 2011-09-07  7:31         ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-07  7:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Tejun Heo, Jens Axboe, Li, Shaohua, Andrew Morton,
	Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 10:37 +0800, Wu Fengguang wrote:
> On Tue, Sep 06, 2011 at 10:22:48PM +0800, Peter Zijlstra wrote:
> > On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > > +++ linux-next/mm/page-writeback.c      2011-08-31 14:40:58.000000000 +0800
> > > @@ -1067,6 +1067,9 @@ static void balance_dirty_pages(struct a
> > >                                      nr_dirty, bdi_thresh, bdi_dirty,
> > >                                      start_time);
> > >  
> > > +               if (unlikely(!dirty_exceeded && bdi_async_underrun(bdi)))
> > > +                       break;
> > > +
> > >                 dirty_ratelimit = bdi->dirty_ratelimit;
> > >                 pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > >                                                background_thresh, nr_dirty,
> > 
> > So dirty_exceeded looks like:
> > 
> > 
> > 1109                 dirty_exceeded = (bdi_dirty > bdi_thresh) ||
> > 1110                                   (nr_dirty > dirty_thresh);
> > 
> > Would it make sense to write it as:
> > 
> > 	if (nr_dirty > dirty_thresh || 
> > 	    (nr_dirty > freerun && bdi_dirty > bdi_thresh))
> > 		dirty_exceeded = 1;
> > 
> > So that we don't actually throttle bdi thingies when we're still in the
> > freerun area?
> 
> Sounds not necessary -- (nr_dirty > freerun) is implicitly true
> because there is a big break early in the loop:
> 
>         if (nr_dirty > freerun)
>                 break;

Ah, totally didn't see that. Thanks!

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 11/18] block: add bdi flag to indicate risk of io queue underrun
@ 2011-09-07  7:31         ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-07  7:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Tejun Heo, Jens Axboe, Li, Shaohua, Andrew Morton,
	Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 10:37 +0800, Wu Fengguang wrote:
> On Tue, Sep 06, 2011 at 10:22:48PM +0800, Peter Zijlstra wrote:
> > On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > > +++ linux-next/mm/page-writeback.c      2011-08-31 14:40:58.000000000 +0800
> > > @@ -1067,6 +1067,9 @@ static void balance_dirty_pages(struct a
> > >                                      nr_dirty, bdi_thresh, bdi_dirty,
> > >                                      start_time);
> > >  
> > > +               if (unlikely(!dirty_exceeded && bdi_async_underrun(bdi)))
> > > +                       break;
> > > +
> > >                 dirty_ratelimit = bdi->dirty_ratelimit;
> > >                 pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > >                                                background_thresh, nr_dirty,
> > 
> > So dirty_exceeded looks like:
> > 
> > 
> > 1109                 dirty_exceeded = (bdi_dirty > bdi_thresh) ||
> > 1110                                   (nr_dirty > dirty_thresh);
> > 
> > Would it make sense to write it as:
> > 
> > 	if (nr_dirty > dirty_thresh || 
> > 	    (nr_dirty > freerun && bdi_dirty > bdi_thresh))
> > 		dirty_exceeded = 1;
> > 
> > So that we don't actually throttle bdi thingies when we're still in the
> > freerun area?
> 
> Sounds not necessary -- (nr_dirty > freerun) is implicitly true
> because there is a big break early in the loop:
> 
>         if (nr_dirty > freerun)
>                 break;

Ah, totally didn't see that. Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 11/18] block: add bdi flag to indicate risk of io queue underrun
@ 2011-09-07  7:31         ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-07  7:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Tejun Heo, Jens Axboe, Li, Shaohua, Andrew Morton,
	Jan Kara, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 10:37 +0800, Wu Fengguang wrote:
> On Tue, Sep 06, 2011 at 10:22:48PM +0800, Peter Zijlstra wrote:
> > On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > > +++ linux-next/mm/page-writeback.c      2011-08-31 14:40:58.000000000 +0800
> > > @@ -1067,6 +1067,9 @@ static void balance_dirty_pages(struct a
> > >                                      nr_dirty, bdi_thresh, bdi_dirty,
> > >                                      start_time);
> > >  
> > > +               if (unlikely(!dirty_exceeded && bdi_async_underrun(bdi)))
> > > +                       break;
> > > +
> > >                 dirty_ratelimit = bdi->dirty_ratelimit;
> > >                 pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> > >                                                background_thresh, nr_dirty,
> > 
> > So dirty_exceeded looks like:
> > 
> > 
> > 1109                 dirty_exceeded = (bdi_dirty > bdi_thresh) ||
> > 1110                                   (nr_dirty > dirty_thresh);
> > 
> > Would it make sense to write it as:
> > 
> > 	if (nr_dirty > dirty_thresh || 
> > 	    (nr_dirty > freerun && bdi_dirty > bdi_thresh))
> > 		dirty_exceeded = 1;
> > 
> > So that we don't actually throttle bdi thingies when we're still in the
> > freerun area?
> 
> Sounds not necessary -- (nr_dirty > freerun) is implicitly true
> because there is a big break early in the loop:
> 
>         if (nr_dirty > freerun)
>                 break;

Ah, totally didn't see that. Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 05/18] writeback: per task dirty rate limit
  2011-09-07  1:04       ` Wu Fengguang
  (?)
@ 2011-09-07  7:31         ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-07  7:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 09:04 +0800, Wu Fengguang wrote:

> So the sqrt naturally leads to less overheads and more N tolerance for
> large memory servers, which have large (thresh-freerun) gaps.

Thanks, and as you say its an initial guess, later refined using patch
14.

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 05/18] writeback: per task dirty rate limit
@ 2011-09-07  7:31         ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-07  7:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 09:04 +0800, Wu Fengguang wrote:

> So the sqrt naturally leads to less overheads and more N tolerance for
> large memory servers, which have large (thresh-freerun) gaps.

Thanks, and as you say its an initial guess, later refined using patch
14.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 05/18] writeback: per task dirty rate limit
@ 2011-09-07  7:31         ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-07  7:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 09:04 +0800, Wu Fengguang wrote:

> So the sqrt naturally leads to less overheads and more N tolerance for
> large memory servers, which have large (thresh-freerun) gaps.

Thanks, and as you say its an initial guess, later refined using patch
14.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
  2011-09-07  6:56         ` Christoph Hellwig
  (?)
@ 2011-09-07  8:19           ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-07  8:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Wu Fengguang, linux-fsdevel, Andrew Morton,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 08:56 +0200, Christoph Hellwig wrote:
> On Wed, Sep 07, 2011 at 02:22:22AM +0200, Jan Kara wrote:
> > > So wtf is ext4 doing? Shouldn't a page stay dirty until its written out?
> > > 
> > > That is, should we really frob around this behaviour or fix ext4 because
> > > its on crack?
> >   Fengguang, could you please verify your findings with recent kernel? I
> > believe ext4 got fixed in this regard some time ago already (and yes, old
> > delalloc writeback code in ext4 was terrible).
> 
> The pattern we do in writeback is:
> 
> in pageout / write_cache_pages:
> 	lock_page();
> 	clear_page_dirty_for_io();
> 
> in ->writepage:
> 	set_page_writeback();
> 	unlock_page();
> 	end_page_writeback();
> 
> So whenever ->writepage decides it doesn't want to write things back
> we have to redirty pages.  We have this happen quite a bit in every
> filesystem, but ext4 hits it a lot more than usual because it refuses
> to write out delalloc pages from plain ->writepage and only allows
> ->writepages to do it.

Ah, right, so it is a fairly common thing and not something easily fixed
in filesystems.

Ok so I guess the patch is good. Thanks!

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
@ 2011-09-07  8:19           ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-07  8:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Wu Fengguang, linux-fsdevel, Andrew Morton,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 08:56 +0200, Christoph Hellwig wrote:
> On Wed, Sep 07, 2011 at 02:22:22AM +0200, Jan Kara wrote:
> > > So wtf is ext4 doing? Shouldn't a page stay dirty until its written out?
> > > 
> > > That is, should we really frob around this behaviour or fix ext4 because
> > > its on crack?
> >   Fengguang, could you please verify your findings with recent kernel? I
> > believe ext4 got fixed in this regard some time ago already (and yes, old
> > delalloc writeback code in ext4 was terrible).
> 
> The pattern we do in writeback is:
> 
> in pageout / write_cache_pages:
> 	lock_page();
> 	clear_page_dirty_for_io();
> 
> in ->writepage:
> 	set_page_writeback();
> 	unlock_page();
> 	end_page_writeback();
> 
> So whenever ->writepage decides it doesn't want to write things back
> we have to redirty pages.  We have this happen quite a bit in every
> filesystem, but ext4 hits it a lot more than usual because it refuses
> to write out delalloc pages from plain ->writepage and only allows
> ->writepages to do it.

Ah, right, so it is a fairly common thing and not something easily fixed
in filesystems.

Ok so I guess the patch is good. Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
@ 2011-09-07  8:19           ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-07  8:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Wu Fengguang, linux-fsdevel, Andrew Morton,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 08:56 +0200, Christoph Hellwig wrote:
> On Wed, Sep 07, 2011 at 02:22:22AM +0200, Jan Kara wrote:
> > > So wtf is ext4 doing? Shouldn't a page stay dirty until its written out?
> > > 
> > > That is, should we really frob around this behaviour or fix ext4 because
> > > its on crack?
> >   Fengguang, could you please verify your findings with recent kernel? I
> > believe ext4 got fixed in this regard some time ago already (and yes, old
> > delalloc writeback code in ext4 was terrible).
> 
> The pattern we do in writeback is:
> 
> in pageout / write_cache_pages:
> 	lock_page();
> 	clear_page_dirty_for_io();
> 
> in ->writepage:
> 	set_page_writeback();
> 	unlock_page();
> 	end_page_writeback();
> 
> So whenever ->writepage decides it doesn't want to write things back
> we have to redirty pages.  We have this happen quite a bit in every
> filesystem, but ext4 hits it a lot more than usual because it refuses
> to write out delalloc pages from plain ->writepage and only allows
> ->writepages to do it.

Ah, right, so it is a fairly common thing and not something easily fixed
in filesystems.

Ok so I guess the patch is good. Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 15/18] writeback: charge leaked page dirties to active tasks
  2011-09-06 16:16     ` Peter Zijlstra
@ 2011-09-07  9:06       ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07  9:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Sep 07, 2011 at 12:16:36AM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > The solution is to charge the pages dirtied by the exited gcc to the
> > other random gcc/dd instances.
> 
> random dirtying task, seeing it lacks a !strcmp(t->comm, "gcc") || !
> strcmp(t->comm, "dd") clause.

OK.

> >  It sounds not perfect, however should
> > behave good enough in practice. 
> 
> Seeing as that throttled tasks aren't actually running so those that are
> running are more likely to pick it up and get throttled, therefore
> promoting an equal spread.. ?

Exactly. Let me write that into changelog :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 15/18] writeback: charge leaked page dirties to active tasks
@ 2011-09-07  9:06       ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07  9:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Sep 07, 2011 at 12:16:36AM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > The solution is to charge the pages dirtied by the exited gcc to the
> > other random gcc/dd instances.
> 
> random dirtying task, seeing it lacks a !strcmp(t->comm, "gcc") || !
> strcmp(t->comm, "dd") clause.

OK.

> >  It sounds not perfect, however should
> > behave good enough in practice. 
> 
> Seeing as that throttled tasks aren't actually running so those that are
> running are more likely to pick it up and get throttled, therefore
> promoting an equal spread.. ?

Exactly. Let me write that into changelog :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 15/18] writeback: charge leaked page dirties to active tasks
  2011-09-07  0:17     ` Jan Kara
  (?)
@ 2011-09-07  9:37     ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07  9:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 3212 bytes --]

On Wed, Sep 07, 2011 at 08:17:42AM +0800, Jan Kara wrote:
> On Sun 04-09-11 09:53:20, Wu Fengguang wrote:
> > It's a years long problem that a large number of short-lived dirtiers
> > (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
> > (eg. dd) as well as pushing the dirty pages to the global hard limit.
>   I don't think it's years long problem. When we do per-cpu ratelimiting,
> short lived processes have the same chance (proportional to the number of
> pages dirtied) of hitting balance_dirty_pages() as long-run dirtiers have.

You are right in that all tasks will hit balance_dirty_pages().
However the caveat is, short lived tasks will see higher
task_bdi_thresh and hence immediately break out of the loop based on
condition !dirty_exceeded.

> So this problem seems to be introduced by your per task dirty ratelimiting?
> But given that you kept per-cpu ratelimiting in the end, is this still an
> issue?

The per-cpu ratelimit now (see "writeback: per task dirty rate limit")
only serves to backup the per-task ratelimit in case the latter fails.

In particular, the per-cpu thresh will typically be much higher than
the per-task thresh and the per-cpu counter will be reset each time
balance_dirty_pages() is called. So in practice the per-cpu thresh
will hardly trigger balance_dirty_pages(), which is exactly the
desired behavior: it will only kick in when the per-task thresh is not
working effectively due to sudden start of too many tasks.

> Do you have some numbers for this patch?

Good question! When trying to do so, I find it only works as expected
after applying this fix (well the zero current->dirty_paused_when
issue once hit my mind and unfortunately slip off later...):

@@ -1103,7 +1103,10 @@ static void balance_dirty_pages(struct a
                task_ratelimit = (u64)dirty_ratelimit *
                                        pos_ratio >> RATELIMIT_CALC_SHIFT;
                period = (HZ * pages_dirtied) / (task_ratelimit | 1);
-               pause = current->dirty_paused_when + period - now;
+               if (current->dirty_paused_when)
+                       pause = current->dirty_paused_when + period - now;
+               else
+                       pause = period;
                /*
                 * For less than 1s think time (ext3/4 may block the dirtier
                 * for up to 800ms from time to time on 1-HDD; so does xfs,

The test case is to run one normal dd and two series of short lived dd's:

        dd $DD_OPTS bs=${bs:-1M} if=/dev/zero of=$mnt/zero-$i &

        (
        file=$mnt/zero-append
        touch $file
        while test -f $file
        do
                dd $DD_OPTS oflag=append conv=notrunc if=/dev/zero of=$file bs=8k count=8
        done
        ) &

        (
        file=$mnt/zero-append-2
        touch $file
        while test -f $file
        do
                dd $DD_OPTS oflag=append conv=notrunc if=/dev/zero of=$file bs=8k count=8
        done
        ) &

The attached figures show the behaviors before/after patch.  Without
patch, the dirty pages hits @limit and bdi->dirty_ratelimit hits 1;
with the patch, the position&rate balances are effectively restored.

Thanks,
Fengguang

[-- Attachment #2: balance_dirty_pages-pages.png --]
[-- Type: image/png, Size: 59240 bytes --]

[-- Attachment #3: balance_dirty_pages-pages.png --]
[-- Type: image/png, Size: 76771 bytes --]

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 05/18] writeback: per task dirty rate limit
  2011-09-07  7:31         ` Peter Zijlstra
@ 2011-09-07 11:00           ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07 11:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Sep 07, 2011 at 03:31:56PM +0800, Peter Zijlstra wrote:
> On Wed, 2011-09-07 at 09:04 +0800, Wu Fengguang wrote:
> 
> > So the sqrt naturally leads to less overheads and more N tolerance for
> > large memory servers, which have large (thresh-freerun) gaps.
> 
> Thanks, and as you say its an initial guess, later refined using patch
> 14.

Yes, exactly.

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 05/18] writeback: per task dirty rate limit
@ 2011-09-07 11:00           ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07 11:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Sep 07, 2011 at 03:31:56PM +0800, Peter Zijlstra wrote:
> On Wed, 2011-09-07 at 09:04 +0800, Wu Fengguang wrote:
> 
> > So the sqrt naturally leads to less overheads and more N tolerance for
> > large memory servers, which have large (thresh-freerun) gaps.
> 
> Thanks, and as you say its an initial guess, later refined using patch
> 14.

Yes, exactly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
  2011-09-06 14:09     ` Peter Zijlstra
@ 2011-09-07 12:31       ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07 12:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Sep 06, 2011 at 10:09:39PM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > plain text document attachment (bdi-reserve-area)
> > Keep a minimal pool of dirty pages for each bdi, so that the disk IO
> > queues won't underrun.
> > 
> > It's particularly useful for JBOD and small memory system.
> > 
> > Note that this is not enough when memory is really tight (in comparison
> > to write bandwidth). It may result in (pos_ratio > 1) at the setpoint
> > and push the dirty pages high. This is more or less intended because the
> > bdi is in the danger of IO queue underflow. However the global dirty
> > pages, when pushed close to limit, will eventually conteract our desire
> > to push up the low bdi_dirty.
> > 
> > In low memory JBOD tests we do see disks under-utilized from time to
> > time. The additional fix may be to add a BDI_async_underrun flag to
> > indicate that the block write queue is running low and it's time to
> > quickly fill the queue by unthrottling the tasks regardless of the
> > global limit.
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |   26 ++++++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2011-08-26 20:12:19.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2011-08-26 20:13:21.000000000 +0800
> > @@ -487,6 +487,16 @@ unsigned long bdi_dirty_limit(struct bac
> >   *   0 +------------.------------------.----------------------*------------->
> >   *           freerun^          setpoint^                 limit^   dirty pages
> >   *
> > + * (o) bdi reserve area
> > + *
> > + * The bdi reserve area tries to keep a reasonable number of dirty pages for
> > + * preventing block queue underrun.
> > + *
> > + * reserve area, scale up rate as dirty pages drop low
> > + * |<----------------------------------------------->|
> > + * |-------------------------------------------------------*-------|----------
> > + * 0                                           bdi setpoint^       ^bdi_thresh
> 
> 
> So why not call the thing bdi freerun ?

Yeah I remember tried the "bdi freerun" concept in some earlier
version. The main problem is, comparing to the global freerun, it
risks exceeding the dirty limit. So if we are to do any bdi freerun
area, it must be kept as small as possible.

Or we can do conditional bdi freerun area as long as under global
dirty limit. Something like

        bdi_freerun = min(limit - nr_dirty, write_bw + 4MBps) / 8

I'll do some experiments and check how well it performs in JBOD setups.

It's not likely to obsolete the bdi underrun flag, because the latter
helps a lot the 1-disk dirty_bytes=1MB case, where the bdi freerun
should be a NOP as there is already the global freerun.

> >   * (o) bdi control lines
> >   *
> >   * The control lines for the global/bdi setpoints both stretch up to @limit.
> > @@ -634,6 +644,22 @@ static unsigned long bdi_position_ratio(
> >  	pos_ratio *= x_intercept - bdi_dirty;
> >  	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
> >  
> > +	/*
> > +	 * bdi reserve area, safeguard against dirty pool underrun and disk idle
> > +	 *
> > +	 * It may push the desired control point of global dirty pages higher
> > +	 * than setpoint. It's not necessary in single-bdi case because a
> > +	 * minimal pool of @freerun dirty pages will already be guaranteed.
> > +	 */
> > +	x_intercept = min(write_bw, freerun);
> > +	if (bdi_dirty < x_intercept) {
> 
> So the point of the freerun point is that we never throttle before it,
> so basically all the below shouldn't be needed at all, right? 

Yes!

> > +		if (bdi_dirty > x_intercept / 8) {
> > +			pos_ratio *= x_intercept;
> > +			do_div(pos_ratio, bdi_dirty);
> > +		} else
> > +			pos_ratio *= 8;
> > +	}
> > +
> >  	return pos_ratio;
> >  }
> 
> 
> So why not add:
> 
> 	if (likely(dirty < freerun))
> 		return 2;
> 
> at the start of this function and leave it at that?

Because we already has

        if (nr_dirty < freerun)
                break;

in the main balance_dirty_pages() loop ;)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
@ 2011-09-07 12:31       ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07 12:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Sep 06, 2011 at 10:09:39PM +0800, Peter Zijlstra wrote:
> On Sun, 2011-09-04 at 09:53 +0800, Wu Fengguang wrote:
> > plain text document attachment (bdi-reserve-area)
> > Keep a minimal pool of dirty pages for each bdi, so that the disk IO
> > queues won't underrun.
> > 
> > It's particularly useful for JBOD and small memory system.
> > 
> > Note that this is not enough when memory is really tight (in comparison
> > to write bandwidth). It may result in (pos_ratio > 1) at the setpoint
> > and push the dirty pages high. This is more or less intended because the
> > bdi is in the danger of IO queue underflow. However the global dirty
> > pages, when pushed close to limit, will eventually conteract our desire
> > to push up the low bdi_dirty.
> > 
> > In low memory JBOD tests we do see disks under-utilized from time to
> > time. The additional fix may be to add a BDI_async_underrun flag to
> > indicate that the block write queue is running low and it's time to
> > quickly fill the queue by unthrottling the tasks regardless of the
> > global limit.
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |   26 ++++++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2011-08-26 20:12:19.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2011-08-26 20:13:21.000000000 +0800
> > @@ -487,6 +487,16 @@ unsigned long bdi_dirty_limit(struct bac
> >   *   0 +------------.------------------.----------------------*------------->
> >   *           freerun^          setpoint^                 limit^   dirty pages
> >   *
> > + * (o) bdi reserve area
> > + *
> > + * The bdi reserve area tries to keep a reasonable number of dirty pages for
> > + * preventing block queue underrun.
> > + *
> > + * reserve area, scale up rate as dirty pages drop low
> > + * |<----------------------------------------------->|
> > + * |-------------------------------------------------------*-------|----------
> > + * 0                                           bdi setpoint^       ^bdi_thresh
> 
> 
> So why not call the thing bdi freerun ?

Yeah I remember tried the "bdi freerun" concept in some earlier
version. The main problem is, comparing to the global freerun, it
risks exceeding the dirty limit. So if we are to do any bdi freerun
area, it must be kept as small as possible.

Or we can do conditional bdi freerun area as long as under global
dirty limit. Something like

        bdi_freerun = min(limit - nr_dirty, write_bw + 4MBps) / 8

I'll do some experiments and check how well it performs in JBOD setups.

It's not likely to obsolete the bdi underrun flag, because the latter
helps a lot the 1-disk dirty_bytes=1MB case, where the bdi freerun
should be a NOP as there is already the global freerun.

> >   * (o) bdi control lines
> >   *
> >   * The control lines for the global/bdi setpoints both stretch up to @limit.
> > @@ -634,6 +644,22 @@ static unsigned long bdi_position_ratio(
> >  	pos_ratio *= x_intercept - bdi_dirty;
> >  	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
> >  
> > +	/*
> > +	 * bdi reserve area, safeguard against dirty pool underrun and disk idle
> > +	 *
> > +	 * It may push the desired control point of global dirty pages higher
> > +	 * than setpoint. It's not necessary in single-bdi case because a
> > +	 * minimal pool of @freerun dirty pages will already be guaranteed.
> > +	 */
> > +	x_intercept = min(write_bw, freerun);
> > +	if (bdi_dirty < x_intercept) {
> 
> So the point of the freerun point is that we never throttle before it,
> so basically all the below shouldn't be needed at all, right? 

Yes!

> > +		if (bdi_dirty > x_intercept / 8) {
> > +			pos_ratio *= x_intercept;
> > +			do_div(pos_ratio, bdi_dirty);
> > +		} else
> > +			pos_ratio *= 8;
> > +	}
> > +
> >  	return pos_ratio;
> >  }
> 
> 
> So why not add:
> 
> 	if (likely(dirty < freerun))
> 		return 2;
> 
> at the start of this function and leave it at that?

Because we already has

        if (nr_dirty < freerun)
                break;

in the main balance_dirty_pages() loop ;)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 00/18] IO-less dirty throttling v11
  2011-09-04  1:53 ` Wu Fengguang
@ 2011-09-07 13:32   ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07 13:32 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML, Trond Myklebust

> Finally, the complete IO-less balance_dirty_pages(). NFS is observed to perform
> better or worse depending on the memory size. Otherwise the added patches can
> address all known regressions.

I find that the NFS performance regressions on large memory system can
be fixed by this patch. It tries to make the progress more smooth by
reasonably reducing the commit size.

Thanks,
Fengguang
---
Subject: nfs: limit the commit size to reduce fluctuations
Date: Thu Dec 16 13:22:43 CST 2010

Limit the commit size to half the dirty control scope, so that the
arrival of one commit will not knock the overall dirty pages off the
scope.

Also limit the commit size to one second worth of data. This will
obviously help make the pipeline run more smoothly.

Also change "<=" to "<": if an inode has only one dirty page in the end,
it should be committed. I wonder why the "<=" didn't cause a bug...

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

After patch, there are still drop offs from the control scope,

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/balance_dirty_pages-pages.png

due to bursty arrival of commits:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/nfs-commit.png

--- linux-next.orig/fs/nfs/write.c	2011-09-07 21:29:15.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-09-07 21:29:32.000000000 +0800
@@ -1543,10 +1543,14 @@ static int nfs_commit_unstable_pages(str
 	int ret = 0;
 
 	if (wbc->sync_mode == WB_SYNC_NONE) {
+		unsigned long bw = MIN_WRITEBACK_PAGES +
+			NFS_SERVER(inode)->backing_dev_info.avg_write_bandwidth;
+
 		/* Don't commit yet if this is a non-blocking flush and there
-		 * are a lot of outstanding writes for this mapping.
+		 * are a lot of outstanding writes for this mapping, until
+		 * collected enough pages to commit.
 		 */
-		if (nfsi->ncommit <= (nfsi->npages >> 1))
+		if (nfsi->ncommit < min(nfsi->npages / DIRTY_SCOPE, bw))
 			goto out_mark_dirty;
 
 		/* don't wait for the COMMIT response */

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 00/18] IO-less dirty throttling v11
@ 2011-09-07 13:32   ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-07 13:32 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML, Trond Myklebust

> Finally, the complete IO-less balance_dirty_pages(). NFS is observed to perform
> better or worse depending on the memory size. Otherwise the added patches can
> address all known regressions.

I find that the NFS performance regressions on large memory system can
be fixed by this patch. It tries to make the progress more smooth by
reasonably reducing the commit size.

Thanks,
Fengguang
---
Subject: nfs: limit the commit size to reduce fluctuations
Date: Thu Dec 16 13:22:43 CST 2010

Limit the commit size to half the dirty control scope, so that the
arrival of one commit will not knock the overall dirty pages off the
scope.

Also limit the commit size to one second worth of data. This will
obviously help make the pipeline run more smoothly.

Also change "<=" to "<": if an inode has only one dirty page in the end,
it should be committed. I wonder why the "<=" didn't cause a bug...

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

After patch, there are still drop offs from the control scope,

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/balance_dirty_pages-pages.png

due to bursty arrival of commits:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/nfs-commit.png

--- linux-next.orig/fs/nfs/write.c	2011-09-07 21:29:15.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-09-07 21:29:32.000000000 +0800
@@ -1543,10 +1543,14 @@ static int nfs_commit_unstable_pages(str
 	int ret = 0;
 
 	if (wbc->sync_mode == WB_SYNC_NONE) {
+		unsigned long bw = MIN_WRITEBACK_PAGES +
+			NFS_SERVER(inode)->backing_dev_info.avg_write_bandwidth;
+
 		/* Don't commit yet if this is a non-blocking flush and there
-		 * are a lot of outstanding writes for this mapping.
+		 * are a lot of outstanding writes for this mapping, until
+		 * collected enough pages to commit.
 		 */
-		if (nfsi->ncommit <= (nfsi->npages >> 1))
+		if (nfsi->ncommit < min(nfsi->npages / DIRTY_SCOPE, bw))
 			goto out_mark_dirty;
 
 		/* don't wait for the COMMIT response */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
  2011-09-07  8:19           ` Peter Zijlstra
@ 2011-09-07 16:42             ` Jan Kara
  -1 siblings, 0 replies; 175+ messages in thread
From: Jan Kara @ 2011-09-07 16:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Hellwig, Jan Kara, Wu Fengguang, linux-fsdevel,
	Andrew Morton, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Wed 07-09-11 10:19:47, Peter Zijlstra wrote:
> On Wed, 2011-09-07 at 08:56 +0200, Christoph Hellwig wrote:
> > On Wed, Sep 07, 2011 at 02:22:22AM +0200, Jan Kara wrote:
> > > > So wtf is ext4 doing? Shouldn't a page stay dirty until its written out?
> > > > 
> > > > That is, should we really frob around this behaviour or fix ext4 because
> > > > its on crack?
> > >   Fengguang, could you please verify your findings with recent kernel? I
> > > believe ext4 got fixed in this regard some time ago already (and yes, old
> > > delalloc writeback code in ext4 was terrible).
> > 
> > The pattern we do in writeback is:
> > 
> > in pageout / write_cache_pages:
> > 	lock_page();
> > 	clear_page_dirty_for_io();
> > 
> > in ->writepage:
> > 	set_page_writeback();
> > 	unlock_page();
> > 	end_page_writeback();
> > 
> > So whenever ->writepage decides it doesn't want to write things back
> > we have to redirty pages.  We have this happen quite a bit in every
> > filesystem, but ext4 hits it a lot more than usual because it refuses
> > to write out delalloc pages from plain ->writepage and only allows
> > ->writepages to do it.
> 
> Ah, right, so it is a fairly common thing and not something easily fixed
> in filesystems.
  Well, it depends on what you call common - usually, ->writepage is called
from kswapd which shouldn't be common compared to writeback from a flusher
thread. But now I've realized that JBD2 also calls ->writepage to fulfill
data=ordered mode guarantees and that's what causes most of redirtying of
pages on ext4. That's going away eventually but it will take some time. So
for now writeback has to handle redirtying...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
@ 2011-09-07 16:42             ` Jan Kara
  0 siblings, 0 replies; 175+ messages in thread
From: Jan Kara @ 2011-09-07 16:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Hellwig, Jan Kara, Wu Fengguang, linux-fsdevel,
	Andrew Morton, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Wed 07-09-11 10:19:47, Peter Zijlstra wrote:
> On Wed, 2011-09-07 at 08:56 +0200, Christoph Hellwig wrote:
> > On Wed, Sep 07, 2011 at 02:22:22AM +0200, Jan Kara wrote:
> > > > So wtf is ext4 doing? Shouldn't a page stay dirty until its written out?
> > > > 
> > > > That is, should we really frob around this behaviour or fix ext4 because
> > > > its on crack?
> > >   Fengguang, could you please verify your findings with recent kernel? I
> > > believe ext4 got fixed in this regard some time ago already (and yes, old
> > > delalloc writeback code in ext4 was terrible).
> > 
> > The pattern we do in writeback is:
> > 
> > in pageout / write_cache_pages:
> > 	lock_page();
> > 	clear_page_dirty_for_io();
> > 
> > in ->writepage:
> > 	set_page_writeback();
> > 	unlock_page();
> > 	end_page_writeback();
> > 
> > So whenever ->writepage decides it doesn't want to write things back
> > we have to redirty pages.  We have this happen quite a bit in every
> > filesystem, but ext4 hits it a lot more than usual because it refuses
> > to write out delalloc pages from plain ->writepage and only allows
> > ->writepages to do it.
> 
> Ah, right, so it is a fairly common thing and not something easily fixed
> in filesystems.
  Well, it depends on what you call common - usually, ->writepage is called
from kswapd which shouldn't be common compared to writeback from a flusher
thread. But now I've realized that JBD2 also calls ->writepage to fulfill
data=ordered mode guarantees and that's what causes most of redirtying of
pages on ext4. That's going away eventually but it will take some time. So
for now writeback has to handle redirtying...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
  2011-09-07 16:42             ` Jan Kara
@ 2011-09-07 16:46               ` Christoph Hellwig
  -1 siblings, 0 replies; 175+ messages in thread
From: Christoph Hellwig @ 2011-09-07 16:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, Christoph Hellwig, Wu Fengguang, linux-fsdevel,
	Andrew Morton, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Wed, Sep 07, 2011 at 06:42:16PM +0200, Jan Kara wrote:
>   Well, it depends on what you call common - usually, ->writepage is called
> from kswapd which shouldn't be common compared to writeback from a flusher
> thread. But now I've realized that JBD2 also calls ->writepage to fulfill
> data=ordered mode guarantees and that's what causes most of redirtying of
> pages on ext4. That's going away eventually but it will take some time. So
> for now writeback has to handle redirtying...

Under the "right" loads it may also happen for xfs because we can't
take lock non-blockingly in the fluser thread for example.


^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
@ 2011-09-07 16:46               ` Christoph Hellwig
  0 siblings, 0 replies; 175+ messages in thread
From: Christoph Hellwig @ 2011-09-07 16:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, Christoph Hellwig, Wu Fengguang, linux-fsdevel,
	Andrew Morton, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Wed, Sep 07, 2011 at 06:42:16PM +0200, Jan Kara wrote:
>   Well, it depends on what you call common - usually, ->writepage is called
> from kswapd which shouldn't be common compared to writeback from a flusher
> thread. But now I've realized that JBD2 also calls ->writepage to fulfill
> data=ordered mode guarantees and that's what causes most of redirtying of
> pages on ext4. That's going away eventually but it will take some time. So
> for now writeback has to handle redirtying...

Under the "right" loads it may also happen for xfs because we can't
take lock non-blockingly in the fluser thread for example.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 00/18] IO-less dirty throttling v11
  2011-09-07 13:32   ` Wu Fengguang
@ 2011-09-07 19:14     ` Trond Myklebust
  -1 siblings, 0 replies; 175+ messages in thread
From: Trond Myklebust @ 2011-09-07 19:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 21:32 +0800, Wu Fengguang wrote: 
> > Finally, the complete IO-less balance_dirty_pages(). NFS is observed to perform
> > better or worse depending on the memory size. Otherwise the added patches can
> > address all known regressions.
> 
> I find that the NFS performance regressions on large memory system can
> be fixed by this patch. It tries to make the progress more smooth by
> reasonably reducing the commit size.
> 
> Thanks,
> Fengguang
> ---
> Subject: nfs: limit the commit size to reduce fluctuations
> Date: Thu Dec 16 13:22:43 CST 2010
> 
> Limit the commit size to half the dirty control scope, so that the
> arrival of one commit will not knock the overall dirty pages off the
> scope.
> 
> Also limit the commit size to one second worth of data. This will
> obviously help make the pipeline run more smoothly.
> 
> Also change "<=" to "<": if an inode has only one dirty page in the end,
> it should be committed. I wonder why the "<=" didn't cause a bug...
> 
> CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/nfs/write.c |    8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> After patch, there are still drop offs from the control scope,
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/balance_dirty_pages-pages.png
> 
> due to bursty arrival of commits:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/nfs-commit.png
> 
> --- linux-next.orig/fs/nfs/write.c	2011-09-07 21:29:15.000000000 +0800
> +++ linux-next/fs/nfs/write.c	2011-09-07 21:29:32.000000000 +0800
> @@ -1543,10 +1543,14 @@ static int nfs_commit_unstable_pages(str
>  	int ret = 0;
>  
>  	if (wbc->sync_mode == WB_SYNC_NONE) {
> +		unsigned long bw = MIN_WRITEBACK_PAGES +
> +			NFS_SERVER(inode)->backing_dev_info.avg_write_bandwidth;
> +
>  		/* Don't commit yet if this is a non-blocking flush and there
> -		 * are a lot of outstanding writes for this mapping.
> +		 * are a lot of outstanding writes for this mapping, until
> +		 * collected enough pages to commit.
>  		 */
> -		if (nfsi->ncommit <= (nfsi->npages >> 1))
> +		if (nfsi->ncommit < min(nfsi->npages / DIRTY_SCOPE, bw))
>  			goto out_mark_dirty;
>  
>  		/* don't wait for the COMMIT response */

So what goes into the 'avg_write_bandwidth' variable that makes it a
good measure above (why 1 second of data instead of 10 seconds or
1ms, ...)? What is the 'DIRTY_SCOPE' value?

IOW: what new black magic are we introducing above and why is it so
obviously better than what we have (yes, I see you have graphs, but that
is just measuring _one_ NFS setup and workload).

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 00/18] IO-less dirty throttling v11
@ 2011-09-07 19:14     ` Trond Myklebust
  0 siblings, 0 replies; 175+ messages in thread
From: Trond Myklebust @ 2011-09-07 19:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 21:32 +0800, Wu Fengguang wrote: 
> > Finally, the complete IO-less balance_dirty_pages(). NFS is observed to perform
> > better or worse depending on the memory size. Otherwise the added patches can
> > address all known regressions.
> 
> I find that the NFS performance regressions on large memory system can
> be fixed by this patch. It tries to make the progress more smooth by
> reasonably reducing the commit size.
> 
> Thanks,
> Fengguang
> ---
> Subject: nfs: limit the commit size to reduce fluctuations
> Date: Thu Dec 16 13:22:43 CST 2010
> 
> Limit the commit size to half the dirty control scope, so that the
> arrival of one commit will not knock the overall dirty pages off the
> scope.
> 
> Also limit the commit size to one second worth of data. This will
> obviously help make the pipeline run more smoothly.
> 
> Also change "<=" to "<": if an inode has only one dirty page in the end,
> it should be committed. I wonder why the "<=" didn't cause a bug...
> 
> CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/nfs/write.c |    8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> After patch, there are still drop offs from the control scope,
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/balance_dirty_pages-pages.png
> 
> due to bursty arrival of commits:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/nfs-commit.png
> 
> --- linux-next.orig/fs/nfs/write.c	2011-09-07 21:29:15.000000000 +0800
> +++ linux-next/fs/nfs/write.c	2011-09-07 21:29:32.000000000 +0800
> @@ -1543,10 +1543,14 @@ static int nfs_commit_unstable_pages(str
>  	int ret = 0;
>  
>  	if (wbc->sync_mode == WB_SYNC_NONE) {
> +		unsigned long bw = MIN_WRITEBACK_PAGES +
> +			NFS_SERVER(inode)->backing_dev_info.avg_write_bandwidth;
> +
>  		/* Don't commit yet if this is a non-blocking flush and there
> -		 * are a lot of outstanding writes for this mapping.
> +		 * are a lot of outstanding writes for this mapping, until
> +		 * collected enough pages to commit.
>  		 */
> -		if (nfsi->ncommit <= (nfsi->npages >> 1))
> +		if (nfsi->ncommit < min(nfsi->npages / DIRTY_SCOPE, bw))
>  			goto out_mark_dirty;
>  
>  		/* don't wait for the COMMIT response */

So what goes into the 'avg_write_bandwidth' variable that makes it a
good measure above (why 1 second of data instead of 10 seconds or
1ms, ...)? What is the 'DIRTY_SCOPE' value?

IOW: what new black magic are we introducing above and why is it so
obviously better than what we have (yes, I see you have graphs, but that
is just measuring _one_ NFS setup and workload).

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 02/18] writeback: dirty position control
  2011-09-06 18:20     ` Vivek Goyal
@ 2011-09-08  2:53       ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-08  2:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Jan Kara, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Sep 07, 2011 at 02:20:34AM +0800, Vivek Goyal wrote:
> On Sun, Sep 04, 2011 at 09:53:07AM +0800, Wu Fengguang wrote:
> 
> [..]
> > - in memory tight systems, (1) becomes strong enough to squeeze dirty
> >   pages inside the control scope
> > 
> > - in large memory systems where the "gravity" of (1) for pulling the
> >   dirty pages to setpoint is too weak, (2) can back (1) up and drive
> >   dirty pages to bdi_setpoint ~= setpoint reasonably fast.
> > 
> > Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> > is related to memory size due to the interferences between disks.  In
> > this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
> 
> Can you please elaborate a little more that what changes in JBOD setup.
> 
> > 
> > Given equations
> > 
> >         span = x_intercept - bdi_setpoint
> >         k = df/dx = - 1 / span
> > 
> > and the extremum values
> > 
> >         span = bdi_thresh
> >         dx = bdi_thresh
> > 
> > we get
> > 
> >         df = - dx / span = - 1.0
> > 
> > That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence
> > task ratelimit will fluctuate by -100%.
> 
> I am not sure I understand above calculation. I understood the part that
> for single bdi case, you want 12.5% varation of bdi_setpoint over a
> range of write_bw [SP-write_bw/2, SP+write_bw/2]. This requirement will
> lead to.
> 
> k = -1/8*write_bw
> 
> OR span = 8*write_bw, hence
> k= -1/span

That's right.

> Now I missed the part that what is different in case of JBOD setup and
> how do you come up with values for that setup so that slope of bdi
> setpoint is sharper.
> 
> IIUC, in case of single bdi case you want to use k=-1/(8*write_bw) and in
> case of JBOD you want to use k=-1/(bdi_thresh)?

Yeah.

> That means for single bdi case you want to trust bdi, write_bw but in
> case of JBOD you stop trusting that and just switch to bdi_thresh. Not
> sure what does it mean.

The main differences are,

1) in JBOD setup, bdi_thresh is fluctuating; in single bdi case,
   bdi_thresh is pretty stable. The fluctuating bdi_thresh means
   even if bdi_dirty is stable, dx=(bdi_dirty-bdi_setpoint) will be
   fluctuating a lot. And the dx range is no long bounded by the
   bdi write bandwidth, but proportional to bdi_thresh.

2) for single bdi case, bdi_dirty=nr_dirty is controlled by both
   the memory based global control line and the bandwidth based bdi
   control line. However for JBOD, we want to keep bdi_dirty reasonably
   close to bdi_setpoint, however the global control line is not going
   to help us directly. The bdi_thresh based slope can better serve
   this purpose than the write bandwidth.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 02/18] writeback: dirty position control
@ 2011-09-08  2:53       ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-08  2:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Jan Kara, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Sep 07, 2011 at 02:20:34AM +0800, Vivek Goyal wrote:
> On Sun, Sep 04, 2011 at 09:53:07AM +0800, Wu Fengguang wrote:
> 
> [..]
> > - in memory tight systems, (1) becomes strong enough to squeeze dirty
> >   pages inside the control scope
> > 
> > - in large memory systems where the "gravity" of (1) for pulling the
> >   dirty pages to setpoint is too weak, (2) can back (1) up and drive
> >   dirty pages to bdi_setpoint ~= setpoint reasonably fast.
> > 
> > Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> > is related to memory size due to the interferences between disks.  In
> > this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
> 
> Can you please elaborate a little more that what changes in JBOD setup.
> 
> > 
> > Given equations
> > 
> >         span = x_intercept - bdi_setpoint
> >         k = df/dx = - 1 / span
> > 
> > and the extremum values
> > 
> >         span = bdi_thresh
> >         dx = bdi_thresh
> > 
> > we get
> > 
> >         df = - dx / span = - 1.0
> > 
> > That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence
> > task ratelimit will fluctuate by -100%.
> 
> I am not sure I understand above calculation. I understood the part that
> for single bdi case, you want 12.5% varation of bdi_setpoint over a
> range of write_bw [SP-write_bw/2, SP+write_bw/2]. This requirement will
> lead to.
> 
> k = -1/8*write_bw
> 
> OR span = 8*write_bw, hence
> k= -1/span

That's right.

> Now I missed the part that what is different in case of JBOD setup and
> how do you come up with values for that setup so that slope of bdi
> setpoint is sharper.
> 
> IIUC, in case of single bdi case you want to use k=-1/(8*write_bw) and in
> case of JBOD you want to use k=-1/(bdi_thresh)?

Yeah.

> That means for single bdi case you want to trust bdi, write_bw but in
> case of JBOD you stop trusting that and just switch to bdi_thresh. Not
> sure what does it mean.

The main differences are,

1) in JBOD setup, bdi_thresh is fluctuating; in single bdi case,
   bdi_thresh is pretty stable. The fluctuating bdi_thresh means
   even if bdi_dirty is stable, dx=(bdi_dirty-bdi_setpoint) will be
   fluctuating a lot. And the dx range is no long bounded by the
   bdi write bandwidth, but proportional to bdi_thresh.

2) for single bdi case, bdi_dirty=nr_dirty is controlled by both
   the memory based global control line and the bandwidth based bdi
   control line. However for JBOD, we want to keep bdi_dirty reasonably
   close to bdi_setpoint, however the global control line is not going
   to help us directly. The bdi_thresh based slope can better serve
   this purpose than the write bandwidth.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
  2011-09-07 16:46               ` Christoph Hellwig
@ 2011-09-08  8:51                 ` Steven Whitehouse
  -1 siblings, 0 replies; 175+ messages in thread
From: Steven Whitehouse @ 2011-09-08  8:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, linux-fsdevel,
	Andrew Morton, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

Hi,

On Wed, 2011-09-07 at 18:46 +0200, Christoph Hellwig wrote:
> On Wed, Sep 07, 2011 at 06:42:16PM +0200, Jan Kara wrote:
> >   Well, it depends on what you call common - usually, ->writepage is called
> > from kswapd which shouldn't be common compared to writeback from a flusher
> > thread. But now I've realized that JBD2 also calls ->writepage to fulfill
> > data=ordered mode guarantees and that's what causes most of redirtying of
> > pages on ext4. That's going away eventually but it will take some time. So
> > for now writeback has to handle redirtying...
> 
> Under the "right" loads it may also happen for xfs because we can't
> take lock non-blockingly in the fluser thread for example.
> 

GFS2 uses this trick for journaled data pages - the lock ordering is
transaction lock before page lock, so we cannot handle pages which are
already locked before they are handed to the fs if a transaction is
required. So we have our own ->writepages which gets the locks in the
correct order, and ->writepage will simply redirty the page if it would
have required a transaction in order to write out the page,

Steve.



^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 17/18] writeback: fix dirtied pages accounting on redirty
@ 2011-09-08  8:51                 ` Steven Whitehouse
  0 siblings, 0 replies; 175+ messages in thread
From: Steven Whitehouse @ 2011-09-08  8:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, linux-fsdevel,
	Andrew Morton, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

Hi,

On Wed, 2011-09-07 at 18:46 +0200, Christoph Hellwig wrote:
> On Wed, Sep 07, 2011 at 06:42:16PM +0200, Jan Kara wrote:
> >   Well, it depends on what you call common - usually, ->writepage is called
> > from kswapd which shouldn't be common compared to writeback from a flusher
> > thread. But now I've realized that JBD2 also calls ->writepage to fulfill
> > data=ordered mode guarantees and that's what causes most of redirtying of
> > pages on ext4. That's going away eventually but it will take some time. So
> > for now writeback has to handle redirtying...
> 
> Under the "right" loads it may also happen for xfs because we can't
> take lock non-blockingly in the fluser thread for example.
> 

GFS2 uses this trick for journaled data pages - the lock ordering is
transaction lock before page lock, so we cannot handle pages which are
already locked before they are handed to the fs if a transaction is
required. So we have our own ->writepages which gets the locks in the
correct order, and ->writepage will simply redirty the page if it would
have required a transaction in order to write out the page,

Steve.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
  2011-09-07 12:31       ` Wu Fengguang
  (?)
@ 2011-09-12 10:19         ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-12 10:19 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 20:31 +0800, Wu Fengguang wrote:
> > > +   x_intercept = min(write_bw, freerun);
> > > +   if (bdi_dirty < x_intercept) {
> > 
> > So the point of the freerun point is that we never throttle before it,
> > so basically all the below shouldn't be needed at all, right? 
> 
> Yes!
> 
> > > +           if (bdi_dirty > x_intercept / 8) {
> > > +                   pos_ratio *= x_intercept;
> > > +                   do_div(pos_ratio, bdi_dirty);
> > > +           } else
> > > +                   pos_ratio *= 8;
> > > +   }
> > > +
> > >     return pos_ratio;
> > >  }

Does that mean we can remove this whole block?

> > 
> > So why not add:
> > 
> >       if (likely(dirty < freerun))
> >               return 2;
> > 
> > at the start of this function and leave it at that?
> 
> Because we already has
> 
>         if (nr_dirty < freerun)
>                 break;
> 
> in the main balance_dirty_pages() loop ;)

Bah! I keep missing that ;-)

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
@ 2011-09-12 10:19         ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-12 10:19 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 20:31 +0800, Wu Fengguang wrote:
> > > +   x_intercept = min(write_bw, freerun);
> > > +   if (bdi_dirty < x_intercept) {
> > 
> > So the point of the freerun point is that we never throttle before it,
> > so basically all the below shouldn't be needed at all, right? 
> 
> Yes!
> 
> > > +           if (bdi_dirty > x_intercept / 8) {
> > > +                   pos_ratio *= x_intercept;
> > > +                   do_div(pos_ratio, bdi_dirty);
> > > +           } else
> > > +                   pos_ratio *= 8;
> > > +   }
> > > +
> > >     return pos_ratio;
> > >  }

Does that mean we can remove this whole block?

> > 
> > So why not add:
> > 
> >       if (likely(dirty < freerun))
> >               return 2;
> > 
> > at the start of this function and leave it at that?
> 
> Because we already has
> 
>         if (nr_dirty < freerun)
>                 break;
> 
> in the main balance_dirty_pages() loop ;)

Bah! I keep missing that ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
@ 2011-09-12 10:19         ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-12 10:19 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 20:31 +0800, Wu Fengguang wrote:
> > > +   x_intercept = min(write_bw, freerun);
> > > +   if (bdi_dirty < x_intercept) {
> > 
> > So the point of the freerun point is that we never throttle before it,
> > so basically all the below shouldn't be needed at all, right? 
> 
> Yes!
> 
> > > +           if (bdi_dirty > x_intercept / 8) {
> > > +                   pos_ratio *= x_intercept;
> > > +                   do_div(pos_ratio, bdi_dirty);
> > > +           } else
> > > +                   pos_ratio *= 8;
> > > +   }
> > > +
> > >     return pos_ratio;
> > >  }

Does that mean we can remove this whole block?

> > 
> > So why not add:
> > 
> >       if (likely(dirty < freerun))
> >               return 2;
> > 
> > at the start of this function and leave it at that?
> 
> Because we already has
> 
>         if (nr_dirty < freerun)
>                 break;
> 
> in the main balance_dirty_pages() loop ;)

Bah! I keep missing that ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 13/18] writeback: limit max dirty pause time
  2011-09-07  2:35       ` Wu Fengguang
  (?)
@ 2011-09-12 10:22         ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-12 10:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 10:35 +0800, Wu Fengguang wrote:
> So yeah, the HZ value does impact the minimal available sleep time...

There's always schedule_hrtimeout() and we could trivially add a
io_schedule_hrtimeout() variant if you need it.

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 13/18] writeback: limit max dirty pause time
@ 2011-09-12 10:22         ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-12 10:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 10:35 +0800, Wu Fengguang wrote:
> So yeah, the HZ value does impact the minimal available sleep time...

There's always schedule_hrtimeout() and we could trivially add a
io_schedule_hrtimeout() variant if you need it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 13/18] writeback: limit max dirty pause time
@ 2011-09-12 10:22         ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-12 10:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 10:35 +0800, Wu Fengguang wrote:
> So yeah, the HZ value does impact the minimal available sleep time...

There's always schedule_hrtimeout() and we could trivially add a
io_schedule_hrtimeout() variant if you need it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 14/18] writeback: control dirty pause time
  2011-09-07  2:02       ` Wu Fengguang
  (?)
@ 2011-09-12 10:28         ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-12 10:28 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 10:02 +0800, Wu Fengguang wrote:
> > also, do the two other line segments connect on the transition
> > point?
> 
> I guess we can simply unify the other two formulas into one:
> 
>         } else if (period <= max_pause / 4 &&
>                  pages_dirtied >= current->nr_dirtied_pause) {
>                 current->nr_dirtied_pause = clamp_val(
> ==>                                     dirty_ratelimit * (max_pause / 2) / HZ,
>                                         pages_dirtied + pages_dirtied / 8,
>                                         pages_dirtied * 4);
>         } else if (pause >= max_pause) {
>                 current->nr_dirtied_pause = 1 | clamp_val(
> ==>                                     dirty_ratelimit * (max_pause / 2) / HZ,
>                                         pages_dirtied / 4,
>                                         pages_dirtied - pages_dirtied / 8);
>         } 


There's still the clamping, that combined with the various conditionals
make it very hard to tell if the functions are connected or jump around.

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 14/18] writeback: control dirty pause time
@ 2011-09-12 10:28         ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-12 10:28 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 10:02 +0800, Wu Fengguang wrote:
> > also, do the two other line segments connect on the transition
> > point?
> 
> I guess we can simply unify the other two formulas into one:
> 
>         } else if (period <= max_pause / 4 &&
>                  pages_dirtied >= current->nr_dirtied_pause) {
>                 current->nr_dirtied_pause = clamp_val(
> ==>                                     dirty_ratelimit * (max_pause / 2) / HZ,
>                                         pages_dirtied + pages_dirtied / 8,
>                                         pages_dirtied * 4);
>         } else if (pause >= max_pause) {
>                 current->nr_dirtied_pause = 1 | clamp_val(
> ==>                                     dirty_ratelimit * (max_pause / 2) / HZ,
>                                         pages_dirtied / 4,
>                                         pages_dirtied - pages_dirtied / 8);
>         } 


There's still the clamping, that combined with the various conditionals
make it very hard to tell if the functions are connected or jump around.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 14/18] writeback: control dirty pause time
@ 2011-09-12 10:28         ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-12 10:28 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-07 at 10:02 +0800, Wu Fengguang wrote:
> > also, do the two other line segments connect on the transition
> > point?
> 
> I guess we can simply unify the other two formulas into one:
> 
>         } else if (period <= max_pause / 4 &&
>                  pages_dirtied >= current->nr_dirtied_pause) {
>                 current->nr_dirtied_pause = clamp_val(
> ==>                                     dirty_ratelimit * (max_pause / 2) / HZ,
>                                         pages_dirtied + pages_dirtied / 8,
>                                         pages_dirtied * 4);
>         } else if (pause >= max_pause) {
>                 current->nr_dirtied_pause = 1 | clamp_val(
> ==>                                     dirty_ratelimit * (max_pause / 2) / HZ,
>                                         pages_dirtied / 4,
>                                         pages_dirtied - pages_dirtied / 8);
>         } 


There's still the clamping, that combined with the various conditionals
make it very hard to tell if the functions are connected or jump around.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
  2011-09-12 10:19         ` Peter Zijlstra
  (?)
  (?)
@ 2011-09-18 14:17         ` Wu Fengguang
  2011-09-18 14:37             ` Wu Fengguang
  -1 siblings, 1 reply; 175+ messages in thread
From: Wu Fengguang @ 2011-09-18 14:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 6558 bytes --]

On Mon, Sep 12, 2011 at 06:19:38PM +0800, Peter Zijlstra wrote:
> On Wed, 2011-09-07 at 20:31 +0800, Wu Fengguang wrote:
> > > > +   x_intercept = min(write_bw, freerun);
> > > > +   if (bdi_dirty < x_intercept) {
> > > 
> > > So the point of the freerun point is that we never throttle before it,
> > > so basically all the below shouldn't be needed at all, right? 
> > 
> > Yes!
> > 
> > > > +           if (bdi_dirty > x_intercept / 8) {
> > > > +                   pos_ratio *= x_intercept;
> > > > +                   do_div(pos_ratio, bdi_dirty);
> > > > +           } else
> > > > +                   pos_ratio *= 8;
> > > > +   }
> > > > +
> > > >     return pos_ratio;
> > > >  }
> 
> Does that mean we can remove this whole block?

Right, if the bdi freerun concept is proved to work fine.

Unfortunately I find it mostly yields lower performance than bdi
reserve area. Patch is attached. If you would like me try other
patches, I can easily kick off new tests and redo the comparison.

Here is the nr_written numbers over various JBOD test cases,
the larger, the better:

bdi-reserve     bdi-freerun    diff    case
---------------------------------------------------------------------------------------
38375271        31553807      -17.8%	JBOD-10HDD-6G/xfs-100dd-1M-16p-5895M-20
30478879        28631491       -6.1%	JBOD-10HDD-6G/xfs-10dd-1M-16p-5895M-20
29735407        28871956       -2.9%	JBOD-10HDD-6G/xfs-1dd-1M-16p-5895M-20
30850350        28344165       -8.1%	JBOD-10HDD-6G/xfs-2dd-1M-16p-5895M-20
17706200        16174684       -8.6%	JBOD-10HDD-thresh=100M/xfs-100dd-1M-16p-5895M-100M
23374918        14376942      -38.5%	JBOD-10HDD-thresh=100M/xfs-10dd-1M-16p-5895M-100M
20659278        19640375       -4.9%	JBOD-10HDD-thresh=100M/xfs-1dd-1M-16p-5895M-100M
22517497        14552321      -35.4%	JBOD-10HDD-thresh=100M/xfs-2dd-1M-16p-5895M-100M
68287850        61078553      -10.6%	JBOD-10HDD-thresh=2G/xfs-100dd-1M-16p-5895M-2048M
33835247        32018425       -5.4%	JBOD-10HDD-thresh=2G/xfs-10dd-1M-16p-5895M-2048M
30187817        29942083       -0.8%	JBOD-10HDD-thresh=2G/xfs-1dd-1M-16p-5895M-2048M
30563144        30204022       -1.2%	JBOD-10HDD-thresh=2G/xfs-2dd-1M-16p-5895M-2048M
34476862        34645398       +0.5%	JBOD-10HDD-thresh=4G/xfs-10dd-1M-16p-5895M-4096M
30326479        30097263       -0.8%	JBOD-10HDD-thresh=4G/xfs-1dd-1M-16p-5895M-4096M
30446767        30339683       -0.4%	JBOD-10HDD-thresh=4G/xfs-2dd-1M-16p-5895M-4096M
40793956        45936678      +12.6%	JBOD-10HDD-thresh=800M/xfs-100dd-1M-16p-5895M-800M
27481305        24867282       -9.5%	JBOD-10HDD-thresh=800M/xfs-10dd-1M-16p-5895M-800M
25651257        22507406      -12.3%	JBOD-10HDD-thresh=800M/xfs-1dd-1M-16p-5895M-800M
19849350        21298787       +7.3%	JBOD-10HDD-thresh=800M/xfs-2dd-1M-16p-5895M-800M

raw data by "grep":

JBOD-10HDD-6G/xfs-100dd-1M-16p-5895M-20:10-3.1.0-rc4+/vmstat-end:nr_written 38375271
JBOD-10HDD-6G/xfs-10dd-1M-16p-5895M-20:10-3.1.0-rc4+/vmstat-end:nr_written 30478879
JBOD-10HDD-6G/xfs-1dd-1M-16p-5895M-20:10-3.1.0-rc4+/vmstat-end:nr_written 29735407
JBOD-10HDD-6G/xfs-2dd-1M-16p-5895M-20:10-3.1.0-rc4+/vmstat-end:nr_written 30850350
JBOD-10HDD-thresh=100M/xfs-100dd-1M-16p-5895M-100M:10-3.1.0-rc4+/vmstat-end:nr_written 17706200
JBOD-10HDD-thresh=100M/xfs-10dd-1M-16p-5895M-100M:10-3.1.0-rc4+/vmstat-end:nr_written 23374918
JBOD-10HDD-thresh=100M/xfs-1dd-1M-16p-5895M-100M:10-3.1.0-rc4+/vmstat-end:nr_written 20659278
JBOD-10HDD-thresh=100M/xfs-2dd-1M-16p-5895M-100M:10-3.1.0-rc4+/vmstat-end:nr_written 22517497
JBOD-10HDD-thresh=2G/xfs-100dd-1M-16p-5895M-2048M:10-3.1.0-rc4+/vmstat-end:nr_written 68287850
JBOD-10HDD-thresh=2G/xfs-10dd-1M-16p-5895M-2048M:10-3.1.0-rc4+/vmstat-end:nr_written 33835247
JBOD-10HDD-thresh=2G/xfs-1dd-1M-16p-5895M-2048M:10-3.1.0-rc4+/vmstat-end:nr_written 30187817
JBOD-10HDD-thresh=2G/xfs-2dd-1M-16p-5895M-2048M:10-3.1.0-rc4+/vmstat-end:nr_written 30563144
JBOD-10HDD-thresh=4G/xfs-10dd-1M-16p-5895M-4096M:10-3.1.0-rc4+/vmstat-end:nr_written 34476862
JBOD-10HDD-thresh=4G/xfs-1dd-1M-16p-5895M-4096M:10-3.1.0-rc4+/vmstat-end:nr_written 30326479
JBOD-10HDD-thresh=4G/xfs-2dd-1M-16p-5895M-4096M:10-3.1.0-rc4+/vmstat-end:nr_written 30446767
JBOD-10HDD-thresh=800M/xfs-100dd-1M-16p-5895M-800M:10-3.1.0-rc4+/vmstat-end:nr_written 40793956
JBOD-10HDD-thresh=800M/xfs-10dd-1M-16p-5895M-800M:10-3.1.0-rc4+/vmstat-end:nr_written 27481305
JBOD-10HDD-thresh=800M/xfs-1dd-1M-16p-5895M-800M:10-3.1.0-rc4+/vmstat-end:nr_written 25651257
JBOD-10HDD-thresh=800M/xfs-2dd-1M-16p-5895M-800M:10-3.1.0-rc4+/vmstat-end:nr_written 19849350

JBOD-10HDD-6G/xfs-100dd-1M-16p-5895M-20:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 31553807
JBOD-10HDD-6G/xfs-10dd-1M-16p-5895M-20:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 28631491
JBOD-10HDD-6G/xfs-1dd-1M-16p-5895M-20:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 28871956
JBOD-10HDD-6G/xfs-2dd-1M-16p-5895M-20:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 28344165
JBOD-10HDD-thresh=100M/xfs-100dd-1M-16p-5895M-100M:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 16174684
JBOD-10HDD-thresh=100M/xfs-10dd-1M-16p-5895M-100M:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 14376942
JBOD-10HDD-thresh=100M/xfs-1dd-1M-16p-5895M-100M:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 19640375
JBOD-10HDD-thresh=100M/xfs-2dd-1M-16p-5895M-100M:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 14552321
JBOD-10HDD-thresh=2G/xfs-100dd-1M-16p-5895M-2048M:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 61078553
JBOD-10HDD-thresh=2G/xfs-10dd-1M-16p-5895M-2048M:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 32018425
JBOD-10HDD-thresh=2G/xfs-1dd-1M-16p-5895M-2048M:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 29942083
JBOD-10HDD-thresh=2G/xfs-2dd-1M-16p-5895M-2048M:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 30204022
JBOD-10HDD-thresh=4G/xfs-10dd-1M-16p-5895M-4096M:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 34645398
JBOD-10HDD-thresh=4G/xfs-1dd-1M-16p-5895M-4096M:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 30097263
JBOD-10HDD-thresh=4G/xfs-2dd-1M-16p-5895M-4096M:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 30339683
JBOD-10HDD-thresh=800M/xfs-100dd-1M-16p-5895M-800M:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 45936678
JBOD-10HDD-thresh=800M/xfs-10dd-1M-16p-5895M-800M:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 24867282
JBOD-10HDD-thresh=800M/xfs-1dd-1M-16p-5895M-800M:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 22507406
JBOD-10HDD-thresh=800M/xfs-2dd-1M-16p-5895M-800M:10-3.1.0-rc4-bdi-freerun+/vmstat-end:nr_written 21298787

[-- Attachment #2: bdi-freerun --]
[-- Type: text/plain, Size: 1488 bytes --]

Subject: 
Date: Wed Sep 14 22:57:43 CST 2011


Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-09-14 22:50:33.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-09-14 22:58:15.000000000 +0800
@@ -614,22 +614,6 @@ static unsigned long bdi_position_ratio(
 	} else
 		pos_ratio /= 4;
 
-	/*
-	 * bdi reserve area, safeguard against dirty pool underrun and disk idle
-	 *
-	 * It may push the desired control point of global dirty pages higher
-	 * than setpoint. It's not necessary in single-bdi case because a
-	 * minimal pool of @freerun dirty pages will already be guaranteed.
-	 */
-	x_intercept = min(write_bw, freerun);
-	if (bdi_dirty < x_intercept) {
-		if (bdi_dirty > x_intercept / 8) {
-			pos_ratio *= x_intercept;
-			do_div(pos_ratio, bdi_dirty);
-		} else
-			pos_ratio *= 8;
-	}
-
 	return pos_ratio;
 }
 
@@ -1089,8 +1073,14 @@ static void balance_dirty_pages(struct a
 				     nr_dirty, bdi_thresh, bdi_dirty,
 				     start_time);
 
-		if (unlikely(!dirty_exceeded && bdi_async_underrun(bdi)))
-			break;
+		freerun = min(bdi->avg_write_bandwidth + MIN_WRITEBACK_PAGES,
+			      global_dirty_limit - nr_dirty) / 8;
+		if (!dirty_exceeded) {
+			if (unlikely(bdi_dirty < freerun))
+				break;
+			if (unlikely(bdi_async_underrun(bdi)))
+				break;
+		}
 
 		max_pause = bdi_max_pause(bdi, bdi_dirty);
 

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 13/18] writeback: limit max dirty pause time
  2011-09-12 10:22         ` Peter Zijlstra
@ 2011-09-18 14:23           ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-18 14:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Sep 12, 2011 at 06:22:31PM +0800, Peter Zijlstra wrote:
> On Wed, 2011-09-07 at 10:35 +0800, Wu Fengguang wrote:
> > So yeah, the HZ value does impact the minimal available sleep time...
> 
> There's always schedule_hrtimeout() and we could trivially add a
> io_schedule_hrtimeout() variant if you need it.

Yeah, we could do that when get done with the basic functions :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 13/18] writeback: limit max dirty pause time
@ 2011-09-18 14:23           ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-18 14:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Sep 12, 2011 at 06:22:31PM +0800, Peter Zijlstra wrote:
> On Wed, 2011-09-07 at 10:35 +0800, Wu Fengguang wrote:
> > So yeah, the HZ value does impact the minimal available sleep time...
> 
> There's always schedule_hrtimeout() and we could trivially add a
> io_schedule_hrtimeout() variant if you need it.

Yeah, we could do that when get done with the basic functions :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
  2011-09-18 14:17         ` Wu Fengguang
@ 2011-09-18 14:37             ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-18 14:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, Sep 18, 2011 at 10:17:05PM +0800, Wu Fengguang wrote:
> On Mon, Sep 12, 2011 at 06:19:38PM +0800, Peter Zijlstra wrote:
> > On Wed, 2011-09-07 at 20:31 +0800, Wu Fengguang wrote:
> > > > > +   x_intercept = min(write_bw, freerun);
> > > > > +   if (bdi_dirty < x_intercept) {
> > > > 
> > > > So the point of the freerun point is that we never throttle before it,
> > > > so basically all the below shouldn't be needed at all, right? 
> > > 
> > > Yes!
> > > 
> > > > > +           if (bdi_dirty > x_intercept / 8) {
> > > > > +                   pos_ratio *= x_intercept;
> > > > > +                   do_div(pos_ratio, bdi_dirty);
> > > > > +           } else
> > > > > +                   pos_ratio *= 8;
> > > > > +   }
> > > > > +
> > > > >     return pos_ratio;
> > > > >  }
> > 
> > Does that mean we can remove this whole block?
> 
> Right, if the bdi freerun concept is proved to work fine.
> 
> Unfortunately I find it mostly yields lower performance than bdi
> reserve area. Patch is attached. If you would like me try other
> patches, I can easily kick off new tests and redo the comparison.
> 
> Here is the nr_written numbers over various JBOD test cases,
> the larger, the better:
> 
> bdi-reserve     bdi-freerun    diff    case
> ---------------------------------------------------------------------------------------
> 38375271        31553807      -17.8%	JBOD-10HDD-6G/xfs-100dd-1M-16p-5895M-20
> 30478879        28631491       -6.1%	JBOD-10HDD-6G/xfs-10dd-1M-16p-5895M-20
> 29735407        28871956       -2.9%	JBOD-10HDD-6G/xfs-1dd-1M-16p-5895M-20
> 30850350        28344165       -8.1%	JBOD-10HDD-6G/xfs-2dd-1M-16p-5895M-20
> 17706200        16174684       -8.6%	JBOD-10HDD-thresh=100M/xfs-100dd-1M-16p-5895M-100M
> 23374918        14376942      -38.5%	JBOD-10HDD-thresh=100M/xfs-10dd-1M-16p-5895M-100M
> 20659278        19640375       -4.9%	JBOD-10HDD-thresh=100M/xfs-1dd-1M-16p-5895M-100M
> 22517497        14552321      -35.4%	JBOD-10HDD-thresh=100M/xfs-2dd-1M-16p-5895M-100M
> 68287850        61078553      -10.6%	JBOD-10HDD-thresh=2G/xfs-100dd-1M-16p-5895M-2048M
> 33835247        32018425       -5.4%	JBOD-10HDD-thresh=2G/xfs-10dd-1M-16p-5895M-2048M
> 30187817        29942083       -0.8%	JBOD-10HDD-thresh=2G/xfs-1dd-1M-16p-5895M-2048M
> 30563144        30204022       -1.2%	JBOD-10HDD-thresh=2G/xfs-2dd-1M-16p-5895M-2048M
> 34476862        34645398       +0.5%	JBOD-10HDD-thresh=4G/xfs-10dd-1M-16p-5895M-4096M
> 30326479        30097263       -0.8%	JBOD-10HDD-thresh=4G/xfs-1dd-1M-16p-5895M-4096M
> 30446767        30339683       -0.4%	JBOD-10HDD-thresh=4G/xfs-2dd-1M-16p-5895M-4096M
> 40793956        45936678      +12.6%	JBOD-10HDD-thresh=800M/xfs-100dd-1M-16p-5895M-800M
> 27481305        24867282       -9.5%	JBOD-10HDD-thresh=800M/xfs-10dd-1M-16p-5895M-800M
> 25651257        22507406      -12.3%	JBOD-10HDD-thresh=800M/xfs-1dd-1M-16p-5895M-800M
> 19849350        21298787       +7.3%	JBOD-10HDD-thresh=800M/xfs-2dd-1M-16p-5895M-800M

BTW, I also compared the IO-less patchset and the vanilla kernel's
JBOD performance. Basically, the performance is lightly improved
under large memory, and reduced a lot in small memory servers.

 vanillla IO-less  
--------------------------------------------------------------------------------
 31189025 34476862      +10.5%  JBOD-10HDD-thresh=4G/xfs-10dd-1M-16p-5895M-4096M
 30441974 30326479       -0.4%  JBOD-10HDD-thresh=4G/xfs-1dd-1M-16p-5895M-4096M
 30484578 30446767       -0.1%  JBOD-10HDD-thresh=4G/xfs-2dd-1M-16p-5895M-4096M

 68532421 68287850       -0.4%  JBOD-10HDD-thresh=2G/xfs-100dd-1M-16p-5895M-2048M
 31606793 33835247       +7.1%  JBOD-10HDD-thresh=2G/xfs-10dd-1M-16p-5895M-2048M
 30404955 30187817       -0.7%  JBOD-10HDD-thresh=2G/xfs-1dd-1M-16p-5895M-2048M
 30425591 30563144       +0.5%  JBOD-10HDD-thresh=2G/xfs-2dd-1M-16p-5895M-2048M

 40451069 38375271       -5.1%  JBOD-10HDD-6G/xfs-100dd-1M-16p-5895M-20
 30903629 30478879       -1.4%  JBOD-10HDD-6G/xfs-10dd-1M-16p-5895M-20
 30113560 29735407       -1.3%  JBOD-10HDD-6G/xfs-1dd-1M-16p-5895M-20
 30181418 30850350       +2.2%  JBOD-10HDD-6G/xfs-2dd-1M-16p-5895M-20

 46067335 40793956      -11.4%  JBOD-10HDD-thresh=800M/xfs-100dd-1M-16p-5895M-800M
 30425063 27481305       -9.7%  JBOD-10HDD-thresh=800M/xfs-10dd-1M-16p-5895M-800M
 28437929 25651257       -9.8%  JBOD-10HDD-thresh=800M/xfs-1dd-1M-16p-5895M-800M
 29409406 19849350      -32.5%  JBOD-10HDD-thresh=800M/xfs-2dd-1M-16p-5895M-800M

 26508063 17706200      -33.2%  JBOD-10HDD-thresh=100M/xfs-100dd-1M-16p-5895M-100M
 23767810 23374918       -1.7%  JBOD-10HDD-thresh=100M/xfs-10dd-1M-16p-5895M-100M
 28032891 20659278      -26.3%  JBOD-10HDD-thresh=100M/xfs-1dd-1M-16p-5895M-100M
 26049973 22517497      -13.6%  JBOD-10HDD-thresh=100M/xfs-2dd-1M-16p-5895M-100M

There are still some itches in JBOD..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
@ 2011-09-18 14:37             ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-18 14:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sun, Sep 18, 2011 at 10:17:05PM +0800, Wu Fengguang wrote:
> On Mon, Sep 12, 2011 at 06:19:38PM +0800, Peter Zijlstra wrote:
> > On Wed, 2011-09-07 at 20:31 +0800, Wu Fengguang wrote:
> > > > > +   x_intercept = min(write_bw, freerun);
> > > > > +   if (bdi_dirty < x_intercept) {
> > > > 
> > > > So the point of the freerun point is that we never throttle before it,
> > > > so basically all the below shouldn't be needed at all, right? 
> > > 
> > > Yes!
> > > 
> > > > > +           if (bdi_dirty > x_intercept / 8) {
> > > > > +                   pos_ratio *= x_intercept;
> > > > > +                   do_div(pos_ratio, bdi_dirty);
> > > > > +           } else
> > > > > +                   pos_ratio *= 8;
> > > > > +   }
> > > > > +
> > > > >     return pos_ratio;
> > > > >  }
> > 
> > Does that mean we can remove this whole block?
> 
> Right, if the bdi freerun concept is proved to work fine.
> 
> Unfortunately I find it mostly yields lower performance than bdi
> reserve area. Patch is attached. If you would like me try other
> patches, I can easily kick off new tests and redo the comparison.
> 
> Here is the nr_written numbers over various JBOD test cases,
> the larger, the better:
> 
> bdi-reserve     bdi-freerun    diff    case
> ---------------------------------------------------------------------------------------
> 38375271        31553807      -17.8%	JBOD-10HDD-6G/xfs-100dd-1M-16p-5895M-20
> 30478879        28631491       -6.1%	JBOD-10HDD-6G/xfs-10dd-1M-16p-5895M-20
> 29735407        28871956       -2.9%	JBOD-10HDD-6G/xfs-1dd-1M-16p-5895M-20
> 30850350        28344165       -8.1%	JBOD-10HDD-6G/xfs-2dd-1M-16p-5895M-20
> 17706200        16174684       -8.6%	JBOD-10HDD-thresh=100M/xfs-100dd-1M-16p-5895M-100M
> 23374918        14376942      -38.5%	JBOD-10HDD-thresh=100M/xfs-10dd-1M-16p-5895M-100M
> 20659278        19640375       -4.9%	JBOD-10HDD-thresh=100M/xfs-1dd-1M-16p-5895M-100M
> 22517497        14552321      -35.4%	JBOD-10HDD-thresh=100M/xfs-2dd-1M-16p-5895M-100M
> 68287850        61078553      -10.6%	JBOD-10HDD-thresh=2G/xfs-100dd-1M-16p-5895M-2048M
> 33835247        32018425       -5.4%	JBOD-10HDD-thresh=2G/xfs-10dd-1M-16p-5895M-2048M
> 30187817        29942083       -0.8%	JBOD-10HDD-thresh=2G/xfs-1dd-1M-16p-5895M-2048M
> 30563144        30204022       -1.2%	JBOD-10HDD-thresh=2G/xfs-2dd-1M-16p-5895M-2048M
> 34476862        34645398       +0.5%	JBOD-10HDD-thresh=4G/xfs-10dd-1M-16p-5895M-4096M
> 30326479        30097263       -0.8%	JBOD-10HDD-thresh=4G/xfs-1dd-1M-16p-5895M-4096M
> 30446767        30339683       -0.4%	JBOD-10HDD-thresh=4G/xfs-2dd-1M-16p-5895M-4096M
> 40793956        45936678      +12.6%	JBOD-10HDD-thresh=800M/xfs-100dd-1M-16p-5895M-800M
> 27481305        24867282       -9.5%	JBOD-10HDD-thresh=800M/xfs-10dd-1M-16p-5895M-800M
> 25651257        22507406      -12.3%	JBOD-10HDD-thresh=800M/xfs-1dd-1M-16p-5895M-800M
> 19849350        21298787       +7.3%	JBOD-10HDD-thresh=800M/xfs-2dd-1M-16p-5895M-800M

BTW, I also compared the IO-less patchset and the vanilla kernel's
JBOD performance. Basically, the performance is lightly improved
under large memory, and reduced a lot in small memory servers.

 vanillla IO-less  
--------------------------------------------------------------------------------
 31189025 34476862      +10.5%  JBOD-10HDD-thresh=4G/xfs-10dd-1M-16p-5895M-4096M
 30441974 30326479       -0.4%  JBOD-10HDD-thresh=4G/xfs-1dd-1M-16p-5895M-4096M
 30484578 30446767       -0.1%  JBOD-10HDD-thresh=4G/xfs-2dd-1M-16p-5895M-4096M

 68532421 68287850       -0.4%  JBOD-10HDD-thresh=2G/xfs-100dd-1M-16p-5895M-2048M
 31606793 33835247       +7.1%  JBOD-10HDD-thresh=2G/xfs-10dd-1M-16p-5895M-2048M
 30404955 30187817       -0.7%  JBOD-10HDD-thresh=2G/xfs-1dd-1M-16p-5895M-2048M
 30425591 30563144       +0.5%  JBOD-10HDD-thresh=2G/xfs-2dd-1M-16p-5895M-2048M

 40451069 38375271       -5.1%  JBOD-10HDD-6G/xfs-100dd-1M-16p-5895M-20
 30903629 30478879       -1.4%  JBOD-10HDD-6G/xfs-10dd-1M-16p-5895M-20
 30113560 29735407       -1.3%  JBOD-10HDD-6G/xfs-1dd-1M-16p-5895M-20
 30181418 30850350       +2.2%  JBOD-10HDD-6G/xfs-2dd-1M-16p-5895M-20

 46067335 40793956      -11.4%  JBOD-10HDD-thresh=800M/xfs-100dd-1M-16p-5895M-800M
 30425063 27481305       -9.7%  JBOD-10HDD-thresh=800M/xfs-10dd-1M-16p-5895M-800M
 28437929 25651257       -9.8%  JBOD-10HDD-thresh=800M/xfs-1dd-1M-16p-5895M-800M
 29409406 19849350      -32.5%  JBOD-10HDD-thresh=800M/xfs-2dd-1M-16p-5895M-800M

 26508063 17706200      -33.2%  JBOD-10HDD-thresh=100M/xfs-100dd-1M-16p-5895M-100M
 23767810 23374918       -1.7%  JBOD-10HDD-thresh=100M/xfs-10dd-1M-16p-5895M-100M
 28032891 20659278      -26.3%  JBOD-10HDD-thresh=100M/xfs-1dd-1M-16p-5895M-100M
 26049973 22517497      -13.6%  JBOD-10HDD-thresh=100M/xfs-2dd-1M-16p-5895M-100M

There are still some itches in JBOD..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
  2011-09-18 14:37             ` Wu Fengguang
@ 2011-09-18 14:47               ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-18 14:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> BTW, I also compared the IO-less patchset and the vanilla kernel's
> JBOD performance. Basically, the performance is lightly improved
> under large memory, and reduced a lot in small memory servers.
> 
>  vanillla IO-less  
> --------------------------------------------------------------------------------
[...]
>  26508063 17706200      -33.2%  JBOD-10HDD-thresh=100M/xfs-100dd-1M-16p-5895M-100M
>  23767810 23374918       -1.7%  JBOD-10HDD-thresh=100M/xfs-10dd-1M-16p-5895M-100M
>  28032891 20659278      -26.3%  JBOD-10HDD-thresh=100M/xfs-1dd-1M-16p-5895M-100M
>  26049973 22517497      -13.6%  JBOD-10HDD-thresh=100M/xfs-2dd-1M-16p-5895M-100M
> 
> There are still some itches in JBOD..

OK, in the dirty_bytes=100M case, I find that the bdi threshold _and_
writeout bandwidth may drop close to 0 in long periods. This change
may avoid one bdi being stuck:

        /*
         * bdi reserve area, safeguard against dirty pool underrun and disk idle
         *
         * It may push the desired control point of global dirty pages higher
         * than setpoint. It's not necessary in single-bdi case because a
         * minimal pool of @freerun dirty pages will already be guaranteed.
         */
-       x_intercept = min(write_bw, freerun);
+       x_intercept = min(write_bw + MIN_WRITEBACK_PAGES, freerun);
        if (bdi_dirty < x_intercept) {
                if (bdi_dirty > x_intercept / 8) {
                        pos_ratio *= x_intercept;
                        do_div(pos_ratio, bdi_dirty);
                } else
                        pos_ratio *= 8;
        }

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
@ 2011-09-18 14:47               ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-18 14:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> BTW, I also compared the IO-less patchset and the vanilla kernel's
> JBOD performance. Basically, the performance is lightly improved
> under large memory, and reduced a lot in small memory servers.
> 
>  vanillla IO-less  
> --------------------------------------------------------------------------------
[...]
>  26508063 17706200      -33.2%  JBOD-10HDD-thresh=100M/xfs-100dd-1M-16p-5895M-100M
>  23767810 23374918       -1.7%  JBOD-10HDD-thresh=100M/xfs-10dd-1M-16p-5895M-100M
>  28032891 20659278      -26.3%  JBOD-10HDD-thresh=100M/xfs-1dd-1M-16p-5895M-100M
>  26049973 22517497      -13.6%  JBOD-10HDD-thresh=100M/xfs-2dd-1M-16p-5895M-100M
> 
> There are still some itches in JBOD..

OK, in the dirty_bytes=100M case, I find that the bdi threshold _and_
writeout bandwidth may drop close to 0 in long periods. This change
may avoid one bdi being stuck:

        /*
         * bdi reserve area, safeguard against dirty pool underrun and disk idle
         *
         * It may push the desired control point of global dirty pages higher
         * than setpoint. It's not necessary in single-bdi case because a
         * minimal pool of @freerun dirty pages will already be guaranteed.
         */
-       x_intercept = min(write_bw, freerun);
+       x_intercept = min(write_bw + MIN_WRITEBACK_PAGES, freerun);
        if (bdi_dirty < x_intercept) {
                if (bdi_dirty > x_intercept / 8) {
                        pos_ratio *= x_intercept;
                        do_div(pos_ratio, bdi_dirty);
                } else
                        pos_ratio *= 8;
        }

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
  2011-09-18 14:47               ` Wu Fengguang
  (?)
@ 2011-09-28 14:02               ` Wu Fengguang
  2011-09-28 14:50                   ` Peter Zijlstra
  2011-09-29 12:15                 ` Wu Fengguang
  -1 siblings, 2 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-28 14:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 7040 bytes --]

Hi Peter,

On Sun, Sep 18, 2011 at 10:47:51PM +0800, Wu Fengguang wrote:
> > BTW, I also compared the IO-less patchset and the vanilla kernel's
> > JBOD performance. Basically, the performance is lightly improved
> > under large memory, and reduced a lot in small memory servers.
> > 
> >  vanillla IO-less  
> > --------------------------------------------------------------------------------
> [...]
> >  26508063 17706200      -33.2%  JBOD-10HDD-thresh=100M/xfs-100dd-1M-16p-5895M-100M
> >  23767810 23374918       -1.7%  JBOD-10HDD-thresh=100M/xfs-10dd-1M-16p-5895M-100M
> >  28032891 20659278      -26.3%  JBOD-10HDD-thresh=100M/xfs-1dd-1M-16p-5895M-100M
> >  26049973 22517497      -13.6%  JBOD-10HDD-thresh=100M/xfs-2dd-1M-16p-5895M-100M
> > 
> > There are still some itches in JBOD..
> 
> OK, in the dirty_bytes=100M case, I find that the bdi threshold _and_
> writeout bandwidth may drop close to 0 in long periods. This change
> may avoid one bdi being stuck:
> 
>         /*
>          * bdi reserve area, safeguard against dirty pool underrun and disk idle
>          *
>          * It may push the desired control point of global dirty pages higher
>          * than setpoint. It's not necessary in single-bdi case because a
>          * minimal pool of @freerun dirty pages will already be guaranteed.
>          */
> -       x_intercept = min(write_bw, freerun);
> +       x_intercept = min(write_bw + MIN_WRITEBACK_PAGES, freerun);

After lots of experiments, I end up with this bdi reserve point

+       x_intercept = bdi_thresh / 2 + MIN_WRITEBACK_PAGES;

together with this chunk to avoid a bdi stuck in bdi_thresh=0 state:

@@ -590,6 +590,7 @@ static unsigned long bdi_position_ratio(
         */
        if (unlikely(bdi_thresh > thresh))
                bdi_thresh = thresh;
+       bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
        /*
         * scale global setpoint to bdi's:
         *      bdi_setpoint = setpoint * bdi_thresh / thresh

The above changes are good enough to keep reasonable amount of bdi
dirty pages, so the bdi underrun flag ("[PATCH 11/18] block: add bdi
flag to indicate risk of io queue underrun") is dropped.

I also tried various bdi freerun patches, however the results are not
satisfactory. Basically the bdi reserve area approach (this patch)
yields noticeably more smooth/resilient behavior than the
freerun/underrun approaches. I noticed that the bdi underrun flag
could lead to sudden surge of dirty pages (especially if not
safeguarded by the dirty_exceeded condition) in the very small
window..

To dig performance increases/drops out of the large number of test
results, I wrote a convenient script (attached) to compare the
vmstat:nr_written numbers between 2+ set of test runs. It helped a lot
for fine tuning the parameters for different cases.

The current JBOD performance numbers are encouraging:

$ ./compare.rb JBOD*/*-vanilla+ JBOD*/*-bgthresh3+
      3.1.0-rc4-vanilla+      3.1.0-rc4-bgthresh3+
------------------------  ------------------------
                52934365        +3.2%     54643527  JBOD-10HDD-thresh=100M/ext4-100dd-1M-24p-16384M-100M:10-X
                45488896       +18.2%     53785605  JBOD-10HDD-thresh=100M/ext4-10dd-1M-24p-16384M-100M:10-X
                47217534       +12.2%     53001031  JBOD-10HDD-thresh=100M/ext4-1dd-1M-24p-16384M-100M:10-X
                32286924       +25.4%     40492312  JBOD-10HDD-thresh=100M/xfs-10dd-1M-24p-16384M-100M:10-X
                38676965       +14.2%     44177606  JBOD-10HDD-thresh=100M/xfs-1dd-1M-24p-16384M-100M:10-X
                59662173       +11.1%     66269621  JBOD-10HDD-thresh=800M/ext4-10dd-1M-24p-16384M-800M:10-X
                57510438        +2.3%     58855181  JBOD-10HDD-thresh=800M/ext4-1dd-1M-24p-16384M-800M:10-X
                63691922       +64.0%    104460352  JBOD-10HDD-thresh=800M/xfs-100dd-1M-24p-16384M-800M:10-X
                51978567       +16.0%     60298210  JBOD-10HDD-thresh=800M/xfs-10dd-1M-24p-16384M-800M:10-X
                47641062        +6.4%     50681038  JBOD-10HDD-thresh=800M/xfs-1dd-1M-24p-16384M-800M:10-X

The common single disk cases also see good numbers except for slight
drops in the dirty_bytes=100MB case:

$ ./compare.rb thresh*/*vanilla+ thresh*/*bgthresh3+
      3.1.0-rc4-vanilla+      3.1.0-rc4-bgthresh3+  
------------------------  ------------------------  
                 4092719        -2.5%      3988742  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                 4956323        -4.0%      4758884  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                 4640118        -0.4%      4621240  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                 3545136        -3.5%      3420717  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                 4399437        -0.9%      4361830  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                 4100655        -3.3%      3964043  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                 4780624        -0.1%      4776216  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
                 4904565        +0.0%      4905293  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
                 3578539        +9.1%      3903390  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
                 4029890        +0.8%      4063717  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
                 2449031       +20.0%      2937926  thresh=1M/ext4-10dd-4k-8p-4096M-1M:10-X
                 4161896        +7.5%      4472552  thresh=1M/ext4-1dd-4k-8p-4096M-1M:10-X
                 3437787       +18.8%      4085707  thresh=1M/ext4-2dd-4k-8p-4096M-1M:10-X
                 1921914       +14.8%      2206897  thresh=1M/xfs-10dd-4k-8p-4096M-1M:10-X
                 2537481       +65.8%      4207336  thresh=1M/xfs-1dd-4k-8p-4096M-1M:10-X
                 3329176       +12.3%      3739888  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
                 4587856        +1.8%      4672501  thresh=400M-300M/ext4-10dd-4k-8p-4096M-400M:300M-X
                 4883525        +0.0%      4884957  thresh=400M-300M/ext4-1dd-4k-8p-4096M-400M:300M-X
                 4799105        +2.3%      4907525  thresh=400M-300M/ext4-2dd-4k-8p-4096M-400M:300M-X
                 3931315        +3.0%      4048277  thresh=400M-300M/xfs-10dd-4k-8p-4096M-400M:300M-X
                 4238389        +3.9%      4401927  thresh=400M-300M/xfs-1dd-4k-8p-4096M-400M:300M-X
                 4032798        +2.3%      4123838  thresh=400M-300M/xfs-2dd-4k-8p-4096M-400M:300M-X
                 2425253       +35.2%      3279302  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
                 4728506        +2.2%      4834878  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
                 2782860       +62.1%      4511120  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
                 1966133       +24.3%      2443874  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
                 4238402        +1.7%      4308416  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
                 3299446       +13.3%      3739810  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X

Thanks,
Fengguang

[-- Attachment #2: compare.rb --]
[-- Type: application/x-ruby, Size: 2755 bytes --]

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
  2011-09-28 14:02               ` Wu Fengguang
@ 2011-09-28 14:50                   ` Peter Zijlstra
  2011-09-29 12:15                 ` Wu Fengguang
  1 sibling, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-28 14:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-28 at 22:02 +0800, Wu Fengguang wrote:

/me attempts to swap back neurons related to writeback

> After lots of experiments, I end up with this bdi reserve point
> 
> +       x_intercept = bdi_thresh / 2 + MIN_WRITEBACK_PAGES;
> 
> together with this chunk to avoid a bdi stuck in bdi_thresh=0 state:
> 
> @@ -590,6 +590,7 @@ static unsigned long bdi_position_ratio(
>          */
>         if (unlikely(bdi_thresh > thresh))
>                 bdi_thresh = thresh;
> +       bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
>         /*
>          * scale global setpoint to bdi's:
>          *      bdi_setpoint = setpoint * bdi_thresh / thresh

So you cap bdi_thresh at a minimum of (limit-dirty)/8 which can be
pretty close to 0 if we have a spike in dirty or a negative spike in
writeout bandwidth (sudden seeks or whatnot).


> The above changes are good enough to keep reasonable amount of bdi
> dirty pages, so the bdi underrun flag ("[PATCH 11/18] block: add bdi
> flag to indicate risk of io queue underrun") is dropped.

That sounds like goodness ;-)

> I also tried various bdi freerun patches, however the results are not
> satisfactory. Basically the bdi reserve area approach (this patch)
> yields noticeably more smooth/resilient behavior than the
> freerun/underrun approaches. I noticed that the bdi underrun flag
> could lead to sudden surge of dirty pages (especially if not
> safeguarded by the dirty_exceeded condition) in the very small
> window.. 

OK, so let me try and parse this magic:

+       x_intercept = bdi_thresh / 2 + MIN_WRITEBACK_PAGES;
+       if (bdi_dirty < x_intercept) {
+               if (bdi_dirty > x_intercept / 8) {
+                       pos_ratio *= x_intercept;
+                       do_div(pos_ratio, bdi_dirty);
+               } else
+                       pos_ratio *= 8;
+       }

So we set our target some place north of MIN_WRITEBACK_PAGES: if we're
short we add a factor of: x_intercept/bdi_dirty. 

Now, since bdi_dirty < x_intercept, this is > 1 and thus we promote more
dirties.

Additionally we don't let the factor get larger than 8 to avoid silly
large fluctuations (8 already seems quite generous to me).


Now I guess the only problem is when nr_bdi * MIN_WRITEBACK_PAGES ~
limit, at which point things go pear shaped.

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
@ 2011-09-28 14:50                   ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-28 14:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-09-28 at 22:02 +0800, Wu Fengguang wrote:

/me attempts to swap back neurons related to writeback

> After lots of experiments, I end up with this bdi reserve point
> 
> +       x_intercept = bdi_thresh / 2 + MIN_WRITEBACK_PAGES;
> 
> together with this chunk to avoid a bdi stuck in bdi_thresh=0 state:
> 
> @@ -590,6 +590,7 @@ static unsigned long bdi_position_ratio(
>          */
>         if (unlikely(bdi_thresh > thresh))
>                 bdi_thresh = thresh;
> +       bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
>         /*
>          * scale global setpoint to bdi's:
>          *      bdi_setpoint = setpoint * bdi_thresh / thresh

So you cap bdi_thresh at a minimum of (limit-dirty)/8 which can be
pretty close to 0 if we have a spike in dirty or a negative spike in
writeout bandwidth (sudden seeks or whatnot).


> The above changes are good enough to keep reasonable amount of bdi
> dirty pages, so the bdi underrun flag ("[PATCH 11/18] block: add bdi
> flag to indicate risk of io queue underrun") is dropped.

That sounds like goodness ;-)

> I also tried various bdi freerun patches, however the results are not
> satisfactory. Basically the bdi reserve area approach (this patch)
> yields noticeably more smooth/resilient behavior than the
> freerun/underrun approaches. I noticed that the bdi underrun flag
> could lead to sudden surge of dirty pages (especially if not
> safeguarded by the dirty_exceeded condition) in the very small
> window.. 

OK, so let me try and parse this magic:

+       x_intercept = bdi_thresh / 2 + MIN_WRITEBACK_PAGES;
+       if (bdi_dirty < x_intercept) {
+               if (bdi_dirty > x_intercept / 8) {
+                       pos_ratio *= x_intercept;
+                       do_div(pos_ratio, bdi_dirty);
+               } else
+                       pos_ratio *= 8;
+       }

So we set our target some place north of MIN_WRITEBACK_PAGES: if we're
short we add a factor of: x_intercept/bdi_dirty. 

Now, since bdi_dirty < x_intercept, this is > 1 and thus we promote more
dirties.

Additionally we don't let the factor get larger than 8 to avoid silly
large fluctuations (8 already seems quite generous to me).


Now I guess the only problem is when nr_bdi * MIN_WRITEBACK_PAGES ~
limit, at which point things go pear shaped.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 00/18] IO-less dirty throttling v11
  2011-09-04  1:53 ` Wu Fengguang
@ 2011-09-28 14:58   ` Christoph Hellwig
  -1 siblings, 0 replies; 175+ messages in thread
From: Christoph Hellwig @ 2011-09-28 14:58 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Sun, Sep 04, 2011 at 09:53:05AM +0800, Wu Fengguang wrote:
> Hi,
> 
> Finally, the complete IO-less balance_dirty_pages(). NFS is observed to perform
> better or worse depending on the memory size. Otherwise the added patches can
> address all known regressions.
> 
>         git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v11
> 	(to be updated; currently it contains a pre-release v11)

Fengguang,

is there any chance we could start doing just the IO-less
balance_dirty_pages, but not all the subtile other changes?  I.e. are
the any known issues that make things work than current mainline if we
only put in patches 1 to 6?  We're getting close to another merge
window, and we're still busy trying to figure out all the details of
the bandwith estimation.  I think we'd have a much more robust tree
if we'd first only merge the infrastructure (IO-less
balance_dirty_pages()) and then work on the algorithms separately.


^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 00/18] IO-less dirty throttling v11
@ 2011-09-28 14:58   ` Christoph Hellwig
  0 siblings, 0 replies; 175+ messages in thread
From: Christoph Hellwig @ 2011-09-28 14:58 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Sun, Sep 04, 2011 at 09:53:05AM +0800, Wu Fengguang wrote:
> Hi,
> 
> Finally, the complete IO-less balance_dirty_pages(). NFS is observed to perform
> better or worse depending on the memory size. Otherwise the added patches can
> address all known regressions.
> 
>         git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v11
> 	(to be updated; currently it contains a pre-release v11)

Fengguang,

is there any chance we could start doing just the IO-less
balance_dirty_pages, but not all the subtile other changes?  I.e. are
the any known issues that make things work than current mainline if we
only put in patches 1 to 6?  We're getting close to another merge
window, and we're still busy trying to figure out all the details of
the bandwith estimation.  I think we'd have a much more robust tree
if we'd first only merge the infrastructure (IO-less
balance_dirty_pages()) and then work on the algorithms separately.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
  2011-09-28 14:50                   ` Peter Zijlstra
@ 2011-09-29  3:32                     ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-29  3:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Sep 28, 2011 at 10:50:35PM +0800, Peter Zijlstra wrote:
> On Wed, 2011-09-28 at 22:02 +0800, Wu Fengguang wrote:
> 
> /me attempts to swap back neurons related to writeback
> 
> > After lots of experiments, I end up with this bdi reserve point
> > 
> > +       x_intercept = bdi_thresh / 2 + MIN_WRITEBACK_PAGES;
> > 
> > together with this chunk to avoid a bdi stuck in bdi_thresh=0 state:
> > 
> > @@ -590,6 +590,7 @@ static unsigned long bdi_position_ratio(
> >          */
> >         if (unlikely(bdi_thresh > thresh))
> >                 bdi_thresh = thresh;
> > +       bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
> >         /*
> >          * scale global setpoint to bdi's:
> >          *      bdi_setpoint = setpoint * bdi_thresh / thresh
> 
> So you cap bdi_thresh at a minimum of (limit-dirty)/8 which can be
> pretty close to 0 if we have a spike in dirty or a negative spike in
> writeout bandwidth (sudden seeks or whatnot).

That's right. However to bring bdi_thresh out of the close-to-zero
state, it's only required that (limit-dirty)/8 is reasonable large for
the _majority_ time, which is not a problem for the servers unless
something goes wrong.

> 
> > The above changes are good enough to keep reasonable amount of bdi
> > dirty pages, so the bdi underrun flag ("[PATCH 11/18] block: add bdi
> > flag to indicate risk of io queue underrun") is dropped.
> 
> That sounds like goodness ;-)

Yeah!

> > I also tried various bdi freerun patches, however the results are not
> > satisfactory. Basically the bdi reserve area approach (this patch)
> > yields noticeably more smooth/resilient behavior than the
> > freerun/underrun approaches. I noticed that the bdi underrun flag
> > could lead to sudden surge of dirty pages (especially if not
> > safeguarded by the dirty_exceeded condition) in the very small
> > window.. 
> 
> OK, so let me try and parse this magic:
> 
> +       x_intercept = bdi_thresh / 2 + MIN_WRITEBACK_PAGES;
> +       if (bdi_dirty < x_intercept) {
> +               if (bdi_dirty > x_intercept / 8) {
> +                       pos_ratio *= x_intercept;
> +                       do_div(pos_ratio, bdi_dirty);
> +               } else
> +                       pos_ratio *= 8;
> +       }
> 
> So we set our target some place north of MIN_WRITEBACK_PAGES: if we're
> short we add a factor of: x_intercept/bdi_dirty. 
> 
> Now, since bdi_dirty < x_intercept, this is > 1 and thus we promote more
> dirties.

That's right.

> Additionally we don't let the factor get larger than 8 to avoid silly
> large fluctuations (8 already seems quite generous to me).

I actually increased 8 to 128 and still think it safe: for the
promotion ratio to be 128, bdi_dirty should be around bdi_thresh/2/128
(or 0.4% bdi_thresh). Whatever large the promotion ratio is, it won't
be more radical than some bdi freerun threshold.

In the tests, what the bdi reserve area protect is mainly small memory
systems (small dirty threshold comparing to writeout bandwidth), where
an IO completion could bring down bdi_dirty considerably (relatively)
and we really need to ramp it up fast at the point to feed the disk.

> Now I guess the only problem is when nr_bdi * MIN_WRITEBACK_PAGES ~
> limit, at which point things go pear shaped.

Yes. In that case the global @dirty will always be drove up to @limit.
Once @dirty dropped reasonably below, whichever bdi task wakeup first
will take the chance to fill the gap, which is not fair for bdi's of
different speed.

Let me retry the thresh=1M,10M test cases without MIN_WRITEBACK_PAGES.
Hopefully the removal of it won't impact performance a lot.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
@ 2011-09-29  3:32                     ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-29  3:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Sep 28, 2011 at 10:50:35PM +0800, Peter Zijlstra wrote:
> On Wed, 2011-09-28 at 22:02 +0800, Wu Fengguang wrote:
> 
> /me attempts to swap back neurons related to writeback
> 
> > After lots of experiments, I end up with this bdi reserve point
> > 
> > +       x_intercept = bdi_thresh / 2 + MIN_WRITEBACK_PAGES;
> > 
> > together with this chunk to avoid a bdi stuck in bdi_thresh=0 state:
> > 
> > @@ -590,6 +590,7 @@ static unsigned long bdi_position_ratio(
> >          */
> >         if (unlikely(bdi_thresh > thresh))
> >                 bdi_thresh = thresh;
> > +       bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
> >         /*
> >          * scale global setpoint to bdi's:
> >          *      bdi_setpoint = setpoint * bdi_thresh / thresh
> 
> So you cap bdi_thresh at a minimum of (limit-dirty)/8 which can be
> pretty close to 0 if we have a spike in dirty or a negative spike in
> writeout bandwidth (sudden seeks or whatnot).

That's right. However to bring bdi_thresh out of the close-to-zero
state, it's only required that (limit-dirty)/8 is reasonable large for
the _majority_ time, which is not a problem for the servers unless
something goes wrong.

> 
> > The above changes are good enough to keep reasonable amount of bdi
> > dirty pages, so the bdi underrun flag ("[PATCH 11/18] block: add bdi
> > flag to indicate risk of io queue underrun") is dropped.
> 
> That sounds like goodness ;-)

Yeah!

> > I also tried various bdi freerun patches, however the results are not
> > satisfactory. Basically the bdi reserve area approach (this patch)
> > yields noticeably more smooth/resilient behavior than the
> > freerun/underrun approaches. I noticed that the bdi underrun flag
> > could lead to sudden surge of dirty pages (especially if not
> > safeguarded by the dirty_exceeded condition) in the very small
> > window.. 
> 
> OK, so let me try and parse this magic:
> 
> +       x_intercept = bdi_thresh / 2 + MIN_WRITEBACK_PAGES;
> +       if (bdi_dirty < x_intercept) {
> +               if (bdi_dirty > x_intercept / 8) {
> +                       pos_ratio *= x_intercept;
> +                       do_div(pos_ratio, bdi_dirty);
> +               } else
> +                       pos_ratio *= 8;
> +       }
> 
> So we set our target some place north of MIN_WRITEBACK_PAGES: if we're
> short we add a factor of: x_intercept/bdi_dirty. 
> 
> Now, since bdi_dirty < x_intercept, this is > 1 and thus we promote more
> dirties.

That's right.

> Additionally we don't let the factor get larger than 8 to avoid silly
> large fluctuations (8 already seems quite generous to me).

I actually increased 8 to 128 and still think it safe: for the
promotion ratio to be 128, bdi_dirty should be around bdi_thresh/2/128
(or 0.4% bdi_thresh). Whatever large the promotion ratio is, it won't
be more radical than some bdi freerun threshold.

In the tests, what the bdi reserve area protect is mainly small memory
systems (small dirty threshold comparing to writeout bandwidth), where
an IO completion could bring down bdi_dirty considerably (relatively)
and we really need to ramp it up fast at the point to feed the disk.

> Now I guess the only problem is when nr_bdi * MIN_WRITEBACK_PAGES ~
> limit, at which point things go pear shaped.

Yes. In that case the global @dirty will always be drove up to @limit.
Once @dirty dropped reasonably below, whichever bdi task wakeup first
will take the chance to fill the gap, which is not fair for bdi's of
different speed.

Let me retry the thresh=1M,10M test cases without MIN_WRITEBACK_PAGES.
Hopefully the removal of it won't impact performance a lot.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 00/18] IO-less dirty throttling v11
  2011-09-28 14:58   ` Christoph Hellwig
@ 2011-09-29  4:11     ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-29  4:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

Hi Christoph,

On Wed, Sep 28, 2011 at 10:58:57PM +0800, Christoph Hellwig wrote:
> On Sun, Sep 04, 2011 at 09:53:05AM +0800, Wu Fengguang wrote:
> > Hi,
> > 
> > Finally, the complete IO-less balance_dirty_pages(). NFS is observed to perform
> > better or worse depending on the memory size. Otherwise the added patches can
> > address all known regressions.
> > 
> >         git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v11
> > 	(to be updated; currently it contains a pre-release v11)
> 
> Fengguang,
> 
> is there any chance we could start doing just the IO-less
> balance_dirty_pages, but not all the subtile other changes?  I.e. are
> the any known issues that make things work than current mainline if we
> only put in patches 1 to 6?

Patches 1-6 are the bare IO-less framework, the followed patches are

1) tracing for easy debug
2) regression fixes (eg. under-utilized disk in small memory systems)
3) improvements

My recent focus is trying to measure and fix the various regressions.
Up to now the JBOD regressions have been addressed and single disk
performance also looks good.

NFS throughputs are observed to drop/rise somehow randomly in
different cases and cannot be fixed fundamentally with the trivial
approaches I've experimented.

3.1.0-rc4-vanilla+  3.1.0-rc4-bgthresh3+  3.1.0-rc4-nfs-smooth+
------------------  --------------------  ---------------------

           3459793   -33.2%      2310900     +2.4%      3543478  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
           3371104   -32.8%      2265584    -13.9%      2902573  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
           2798005   +13.4%      3171975    +21.4%      3395410  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X

           1641479   +13.9%      1869541    +52.7%      2506587  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
           3036860   -19.4%      2447633    -32.1%      2063006  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
           2050746   +19.8%      2456601    +28.4%      2634044  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X

           1042855    +2.7%      1070893     +0.9%      1052112  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
           2106794   -41.6%      1231128    -54.6%       957305  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
           2034313   -40.4%      1212212    -51.7%       982609  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X

            239379                     0    +10.2%       263894  NFS-thresh=1M/nfs-10dd-1M-32p-32768M-1M:10-X
            521149   -42.3%       300872    +13.9%       593485  NFS-thresh=1M/nfs-1dd-1M-32p-32768M-1M:10-X
            564565                     0    -49.6%       284397  NFS-thresh=1M/nfs-2dd-1M-32p-32768M-1M:10-X

> We're getting close to another merge window, and we're still busy
> trying to figure out all the details of the bandwith estimation.  I
> think we'd have a much more robust tree if we'd first only merge the
> infrastructure (IO-less balance_dirty_pages()) and then work on the
> algorithms separately.

Agreed.  Let me sort out the minimal set of patches that can still
maintain the vanilla kernel performance, plus the tracing patches.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 00/18] IO-less dirty throttling v11
@ 2011-09-29  4:11     ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-29  4:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

Hi Christoph,

On Wed, Sep 28, 2011 at 10:58:57PM +0800, Christoph Hellwig wrote:
> On Sun, Sep 04, 2011 at 09:53:05AM +0800, Wu Fengguang wrote:
> > Hi,
> > 
> > Finally, the complete IO-less balance_dirty_pages(). NFS is observed to perform
> > better or worse depending on the memory size. Otherwise the added patches can
> > address all known regressions.
> > 
> >         git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v11
> > 	(to be updated; currently it contains a pre-release v11)
> 
> Fengguang,
> 
> is there any chance we could start doing just the IO-less
> balance_dirty_pages, but not all the subtile other changes?  I.e. are
> the any known issues that make things work than current mainline if we
> only put in patches 1 to 6?

Patches 1-6 are the bare IO-less framework, the followed patches are

1) tracing for easy debug
2) regression fixes (eg. under-utilized disk in small memory systems)
3) improvements

My recent focus is trying to measure and fix the various regressions.
Up to now the JBOD regressions have been addressed and single disk
performance also looks good.

NFS throughputs are observed to drop/rise somehow randomly in
different cases and cannot be fixed fundamentally with the trivial
approaches I've experimented.

3.1.0-rc4-vanilla+  3.1.0-rc4-bgthresh3+  3.1.0-rc4-nfs-smooth+
------------------  --------------------  ---------------------

           3459793   -33.2%      2310900     +2.4%      3543478  NFS-thresh=1G/nfs-10dd-1M-32p-32768M-1024M:10-X
           3371104   -32.8%      2265584    -13.9%      2902573  NFS-thresh=1G/nfs-1dd-1M-32p-32768M-1024M:10-X
           2798005   +13.4%      3171975    +21.4%      3395410  NFS-thresh=1G/nfs-2dd-1M-32p-32768M-1024M:10-X

           1641479   +13.9%      1869541    +52.7%      2506587  NFS-thresh=100M/nfs-10dd-1M-32p-32768M-100M:10-X
           3036860   -19.4%      2447633    -32.1%      2063006  NFS-thresh=100M/nfs-1dd-1M-32p-32768M-100M:10-X
           2050746   +19.8%      2456601    +28.4%      2634044  NFS-thresh=100M/nfs-2dd-1M-32p-32768M-100M:10-X

           1042855    +2.7%      1070893     +0.9%      1052112  NFS-thresh=10M/nfs-10dd-1M-32p-32768M-10M:10-X
           2106794   -41.6%      1231128    -54.6%       957305  NFS-thresh=10M/nfs-1dd-1M-32p-32768M-10M:10-X
           2034313   -40.4%      1212212    -51.7%       982609  NFS-thresh=10M/nfs-2dd-1M-32p-32768M-10M:10-X

            239379                     0    +10.2%       263894  NFS-thresh=1M/nfs-10dd-1M-32p-32768M-1M:10-X
            521149   -42.3%       300872    +13.9%       593485  NFS-thresh=1M/nfs-1dd-1M-32p-32768M-1M:10-X
            564565                     0    -49.6%       284397  NFS-thresh=1M/nfs-2dd-1M-32p-32768M-1M:10-X

> We're getting close to another merge window, and we're still busy
> trying to figure out all the details of the bandwith estimation.  I
> think we'd have a much more robust tree if we'd first only merge the
> infrastructure (IO-less balance_dirty_pages()) and then work on the
> algorithms separately.

Agreed.  Let me sort out the minimal set of patches that can still
maintain the vanilla kernel performance, plus the tracing patches.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
  2011-09-29  3:32                     ` Wu Fengguang
  (?)
@ 2011-09-29  8:49                       ` Peter Zijlstra
  -1 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-29  8:49 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, 2011-09-29 at 11:32 +0800, Wu Fengguang wrote:
> > Now I guess the only problem is when nr_bdi * MIN_WRITEBACK_PAGES ~
> > limit, at which point things go pear shaped.
> 
> Yes. In that case the global @dirty will always be drove up to @limit.
> Once @dirty dropped reasonably below, whichever bdi task wakeup first
> will take the chance to fill the gap, which is not fair for bdi's of
> different speed.
> 
> Let me retry the thresh=1M,10M test cases without MIN_WRITEBACK_PAGES.
> Hopefully the removal of it won't impact performance a lot. 


Right, so alternatively we could try an argument that this is
sufficiently rare and shouldn't happen. People with lots of disks tend
to also have lots of memory, etc.

If we do find it happens we can always look at it again.



^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
@ 2011-09-29  8:49                       ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-29  8:49 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, 2011-09-29 at 11:32 +0800, Wu Fengguang wrote:
> > Now I guess the only problem is when nr_bdi * MIN_WRITEBACK_PAGES ~
> > limit, at which point things go pear shaped.
> 
> Yes. In that case the global @dirty will always be drove up to @limit.
> Once @dirty dropped reasonably below, whichever bdi task wakeup first
> will take the chance to fill the gap, which is not fair for bdi's of
> different speed.
> 
> Let me retry the thresh=1M,10M test cases without MIN_WRITEBACK_PAGES.
> Hopefully the removal of it won't impact performance a lot. 


Right, so alternatively we could try an argument that this is
sufficiently rare and shouldn't happen. People with lots of disks tend
to also have lots of memory, etc.

If we do find it happens we can always look at it again.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
@ 2011-09-29  8:49                       ` Peter Zijlstra
  0 siblings, 0 replies; 175+ messages in thread
From: Peter Zijlstra @ 2011-09-29  8:49 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, 2011-09-29 at 11:32 +0800, Wu Fengguang wrote:
> > Now I guess the only problem is when nr_bdi * MIN_WRITEBACK_PAGES ~
> > limit, at which point things go pear shaped.
> 
> Yes. In that case the global @dirty will always be drove up to @limit.
> Once @dirty dropped reasonably below, whichever bdi task wakeup first
> will take the chance to fill the gap, which is not fair for bdi's of
> different speed.
> 
> Let me retry the thresh=1M,10M test cases without MIN_WRITEBACK_PAGES.
> Hopefully the removal of it won't impact performance a lot. 


Right, so alternatively we could try an argument that this is
sufficiently rare and shouldn't happen. People with lots of disks tend
to also have lots of memory, etc.

If we do find it happens we can always look at it again.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
  2011-09-29  8:49                       ` Peter Zijlstra
@ 2011-09-29 11:05                         ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-29 11:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Sep 29, 2011 at 04:49:57PM +0800, Peter Zijlstra wrote:
> On Thu, 2011-09-29 at 11:32 +0800, Wu Fengguang wrote:
> > > Now I guess the only problem is when nr_bdi * MIN_WRITEBACK_PAGES ~
> > > limit, at which point things go pear shaped.
> > 
> > Yes. In that case the global @dirty will always be drove up to @limit.
> > Once @dirty dropped reasonably below, whichever bdi task wakeup first
> > will take the chance to fill the gap, which is not fair for bdi's of
> > different speed.
> > 
> > Let me retry the thresh=1M,10M test cases without MIN_WRITEBACK_PAGES.
> > Hopefully the removal of it won't impact performance a lot. 
> 
> 
> Right, so alternatively we could try an argument that this is
> sufficiently rare and shouldn't happen. People with lots of disks tend
> to also have lots of memory, etc.

Right.

> If we do find it happens we can always look at it again.

Sure.  Now I got the results for single disk thresh=1M,8M,100M cases
and find no big differences if removing MIN_WRITEBACK_PAGES:

    3.1.0-rc4-bgthresh3+      3.1.0-rc4-bgthresh4+
------------------------  ------------------------
                 3988742        +1.9%      4063217  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                 4758884        +1.5%      4829320  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                 4621240        +1.6%      4693525  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                 3420717        +0.1%      3423712  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                 4361830        +1.4%      4423554  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                 3964043        +0.2%      3972057  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                 2937926        +0.6%      2956870  thresh=1M/ext4-10dd-4k-8p-4096M-1M:10-X
                 4472552        -1.9%      4387457  thresh=1M/ext4-1dd-4k-8p-4096M-1M:10-X
                 4085707        -3.0%      3961155  thresh=1M/ext4-2dd-4k-8p-4096M-1M:10-X
                 2206897        +2.1%      2253839  thresh=1M/xfs-10dd-4k-8p-4096M-1M:10-X
                 4207336        -2.1%      4119821  thresh=1M/xfs-1dd-4k-8p-4096M-1M:10-X
                 3739888        -3.6%      3604315  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
                 3279302        -0.2%      3273310  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
                 4834878        +1.6%      4912372  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
                 4511120        -1.7%      4435193  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
                 2443874        -0.5%      2432188  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
                 4308416        -0.6%      4283110  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
                 3739810        +0.6%      3763320  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X

Or lowering the largest promotion ratio from 128 to 8:

    3.1.0-rc4-bgthresh4+      3.1.0-rc4-bgthresh5+
------------------------  ------------------------
                 4063217        -0.0%      4062022  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                 4829320        +1.1%      4882829  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                 4693525        +0.1%      4700537  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                 3423712        +0.2%      3431603  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                 4423554        -0.3%      4408912  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                 3972057        -0.1%      3968535  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                 2956870        -0.9%      2929605  thresh=1M/ext4-10dd-4k-8p-4096M-1M:10-X
                 4387457        -0.2%      4378233  thresh=1M/ext4-1dd-4k-8p-4096M-1M:10-X
                 3961155        -0.5%      3940075  thresh=1M/ext4-2dd-4k-8p-4096M-1M:10-X
                 2253839        -0.9%      2232976  thresh=1M/xfs-10dd-4k-8p-4096M-1M:10-X
                 4119821        -2.1%      4031983  thresh=1M/xfs-1dd-4k-8p-4096M-1M:10-X
                 3604315        -3.1%      3493042  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
                 3273310        -1.1%      3237060  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
                 4912372        -0.0%      4911287  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
                 4435193        +0.1%      4441581  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
                 2432188        +1.1%      2459249  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
                 4283110        +0.1%      4289456  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
                 3763320        -0.1%      3758938  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X

As for the thresh=100M JBOD cases, I don't see much occurrences of promotion
ratio > 2. So the simplification should make no difference, too.

Thus the finalized code will be:

+       x_intercept = bdi_thresh / 2;
+       if (bdi_dirty < x_intercept) {
+               if (bdi_dirty > x_intercept / 8) {
+                       pos_ratio *= x_intercept;
+                       do_div(pos_ratio, bdi_dirty);
+               } else
+                       pos_ratio *= 8;
+       }

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
@ 2011-09-29 11:05                         ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-29 11:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Sep 29, 2011 at 04:49:57PM +0800, Peter Zijlstra wrote:
> On Thu, 2011-09-29 at 11:32 +0800, Wu Fengguang wrote:
> > > Now I guess the only problem is when nr_bdi * MIN_WRITEBACK_PAGES ~
> > > limit, at which point things go pear shaped.
> > 
> > Yes. In that case the global @dirty will always be drove up to @limit.
> > Once @dirty dropped reasonably below, whichever bdi task wakeup first
> > will take the chance to fill the gap, which is not fair for bdi's of
> > different speed.
> > 
> > Let me retry the thresh=1M,10M test cases without MIN_WRITEBACK_PAGES.
> > Hopefully the removal of it won't impact performance a lot. 
> 
> 
> Right, so alternatively we could try an argument that this is
> sufficiently rare and shouldn't happen. People with lots of disks tend
> to also have lots of memory, etc.

Right.

> If we do find it happens we can always look at it again.

Sure.  Now I got the results for single disk thresh=1M,8M,100M cases
and find no big differences if removing MIN_WRITEBACK_PAGES:

    3.1.0-rc4-bgthresh3+      3.1.0-rc4-bgthresh4+
------------------------  ------------------------
                 3988742        +1.9%      4063217  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                 4758884        +1.5%      4829320  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                 4621240        +1.6%      4693525  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                 3420717        +0.1%      3423712  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                 4361830        +1.4%      4423554  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                 3964043        +0.2%      3972057  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                 2937926        +0.6%      2956870  thresh=1M/ext4-10dd-4k-8p-4096M-1M:10-X
                 4472552        -1.9%      4387457  thresh=1M/ext4-1dd-4k-8p-4096M-1M:10-X
                 4085707        -3.0%      3961155  thresh=1M/ext4-2dd-4k-8p-4096M-1M:10-X
                 2206897        +2.1%      2253839  thresh=1M/xfs-10dd-4k-8p-4096M-1M:10-X
                 4207336        -2.1%      4119821  thresh=1M/xfs-1dd-4k-8p-4096M-1M:10-X
                 3739888        -3.6%      3604315  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
                 3279302        -0.2%      3273310  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
                 4834878        +1.6%      4912372  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
                 4511120        -1.7%      4435193  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
                 2443874        -0.5%      2432188  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
                 4308416        -0.6%      4283110  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
                 3739810        +0.6%      3763320  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X

Or lowering the largest promotion ratio from 128 to 8:

    3.1.0-rc4-bgthresh4+      3.1.0-rc4-bgthresh5+
------------------------  ------------------------
                 4063217        -0.0%      4062022  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                 4829320        +1.1%      4882829  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                 4693525        +0.1%      4700537  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                 3423712        +0.2%      3431603  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                 4423554        -0.3%      4408912  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                 3972057        -0.1%      3968535  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                 2956870        -0.9%      2929605  thresh=1M/ext4-10dd-4k-8p-4096M-1M:10-X
                 4387457        -0.2%      4378233  thresh=1M/ext4-1dd-4k-8p-4096M-1M:10-X
                 3961155        -0.5%      3940075  thresh=1M/ext4-2dd-4k-8p-4096M-1M:10-X
                 2253839        -0.9%      2232976  thresh=1M/xfs-10dd-4k-8p-4096M-1M:10-X
                 4119821        -2.1%      4031983  thresh=1M/xfs-1dd-4k-8p-4096M-1M:10-X
                 3604315        -3.1%      3493042  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
                 3273310        -1.1%      3237060  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
                 4912372        -0.0%      4911287  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
                 4435193        +0.1%      4441581  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
                 2432188        +1.1%      2459249  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
                 4283110        +0.1%      4289456  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
                 3763320        -0.1%      3758938  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X

As for the thresh=100M JBOD cases, I don't see much occurrences of promotion
ratio > 2. So the simplification should make no difference, too.

Thus the finalized code will be:

+       x_intercept = bdi_thresh / 2;
+       if (bdi_dirty < x_intercept) {
+               if (bdi_dirty > x_intercept / 8) {
+                       pos_ratio *= x_intercept;
+                       do_div(pos_ratio, bdi_dirty);
+               } else
+                       pos_ratio *= 8;
+       }

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 03/18] writeback: dirty rate control
  2011-09-04  1:53   ` Wu Fengguang
@ 2011-09-29 11:57     ` Wu Fengguang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-29 11:57 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

A minor fix to this patch.

While testing the fio mmap workload, bdi->dirty_ratelimit is observed
to be knocked down to 1 and then brought up high in regular intervals.

The showed up problem is, it took long delays to bring up
bdi->dirty_ratelimit due to the round-down problem of the below
task_ratelimit calculation: when dirty_ratelimit=1 and pos_ratio = 1.5,
the resulted task_ratelimit will be 1, which fooled stops the logic
from increasing dirty_ratelimit as long as pos_ratio < 2. The below
change (from round-down to round-up) can nicely fix this problem.

Thanks,
Fengguang
---

--- linux-next.orig/mm/page-writeback.c	2011-09-24 15:52:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-09-24 15:52:11.000000000 +0800
@@ -766,6 +766,7 @@ static void bdi_update_dirty_ratelimit(s
 	 */
 	task_ratelimit = (u64)dirty_ratelimit *
 					pos_ratio >> RATELIMIT_CALC_SHIFT;
+	task_ratelimit++; /* it helps rampup dirty_ratelimit from tiny values */
 
 	/*
 	 * A linear estimation of the "balanced" throttle rate. The theory is,

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 03/18] writeback: dirty rate control
@ 2011-09-29 11:57     ` Wu Fengguang
  0 siblings, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-29 11:57 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

A minor fix to this patch.

While testing the fio mmap workload, bdi->dirty_ratelimit is observed
to be knocked down to 1 and then brought up high in regular intervals.

The showed up problem is, it took long delays to bring up
bdi->dirty_ratelimit due to the round-down problem of the below
task_ratelimit calculation: when dirty_ratelimit=1 and pos_ratio = 1.5,
the resulted task_ratelimit will be 1, which fooled stops the logic
from increasing dirty_ratelimit as long as pos_ratio < 2. The below
change (from round-down to round-up) can nicely fix this problem.

Thanks,
Fengguang
---

--- linux-next.orig/mm/page-writeback.c	2011-09-24 15:52:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-09-24 15:52:11.000000000 +0800
@@ -766,6 +766,7 @@ static void bdi_update_dirty_ratelimit(s
 	 */
 	task_ratelimit = (u64)dirty_ratelimit *
 					pos_ratio >> RATELIMIT_CALC_SHIFT;
+	task_ratelimit++; /* it helps rampup dirty_ratelimit from tiny values */
 
 	/*
 	 * A linear estimation of the "balanced" throttle rate. The theory is,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 10/18] writeback: dirty position control - bdi reserve area
  2011-09-28 14:02               ` Wu Fengguang
  2011-09-28 14:50                   ` Peter Zijlstra
@ 2011-09-29 12:15                 ` Wu Fengguang
  1 sibling, 0 replies; 175+ messages in thread
From: Wu Fengguang @ 2011-09-29 12:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 6644 bytes --]

On Wed, Sep 28, 2011 at 10:02:05PM +0800, Wu Fengguang wrote:
> Hi Peter,
> 
> On Sun, Sep 18, 2011 at 10:47:51PM +0800, Wu Fengguang wrote:
> > > BTW, I also compared the IO-less patchset and the vanilla kernel's
> > > JBOD performance. Basically, the performance is lightly improved
> > > under large memory, and reduced a lot in small memory servers.
> > > 
> > >  vanillla IO-less  
> > > --------------------------------------------------------------------------------
> > [...]
> > >  26508063 17706200      -33.2%  JBOD-10HDD-thresh=100M/xfs-100dd-1M-16p-5895M-100M
> > >  23767810 23374918       -1.7%  JBOD-10HDD-thresh=100M/xfs-10dd-1M-16p-5895M-100M
> > >  28032891 20659278      -26.3%  JBOD-10HDD-thresh=100M/xfs-1dd-1M-16p-5895M-100M
> > >  26049973 22517497      -13.6%  JBOD-10HDD-thresh=100M/xfs-2dd-1M-16p-5895M-100M
> > > 
> > > There are still some itches in JBOD..
> > 
> > OK, in the dirty_bytes=100M case, I find that the bdi threshold _and_
> > writeout bandwidth may drop close to 0 in long periods. This change
> > may avoid one bdi being stuck:
> > 
> >         /*
> >          * bdi reserve area, safeguard against dirty pool underrun and disk idle
> >          *
> >          * It may push the desired control point of global dirty pages higher
> >          * than setpoint. It's not necessary in single-bdi case because a
> >          * minimal pool of @freerun dirty pages will already be guaranteed.
> >          */
> > -       x_intercept = min(write_bw, freerun);
> > +       x_intercept = min(write_bw + MIN_WRITEBACK_PAGES, freerun);
> 
> After lots of experiments, I end up with this bdi reserve point
> 
> +       x_intercept = bdi_thresh / 2 + MIN_WRITEBACK_PAGES;
> 
> together with this chunk to avoid a bdi stuck in bdi_thresh=0 state:
> 
> @@ -590,6 +590,7 @@ static unsigned long bdi_position_ratio(
>          */
>         if (unlikely(bdi_thresh > thresh))
>                 bdi_thresh = thresh;
> +       bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
>         /*
>          * scale global setpoint to bdi's:
>          *      bdi_setpoint = setpoint * bdi_thresh / thresh
> 
> The above changes are good enough to keep reasonable amount of bdi
> dirty pages, so the bdi underrun flag ("[PATCH 11/18] block: add bdi
> flag to indicate risk of io queue underrun") is dropped.
> 
> I also tried various bdi freerun patches, however the results are not
> satisfactory. Basically the bdi reserve area approach (this patch)
> yields noticeably more smooth/resilient behavior than the
> freerun/underrun approaches. I noticed that the bdi underrun flag
> could lead to sudden surge of dirty pages (especially if not
> safeguarded by the dirty_exceeded condition) in the very small
> window..
> 
> To dig performance increases/drops out of the large number of test
> results, I wrote a convenient script (attached) to compare the
> vmstat:nr_written numbers between 2+ set of test runs. It helped a lot
> for fine tuning the parameters for different cases.
> 
> The current JBOD performance numbers are encouraging:
> 
> $ ./compare.rb JBOD*/*-vanilla+ JBOD*/*-bgthresh3+
>       3.1.0-rc4-vanilla+      3.1.0-rc4-bgthresh3+
> ------------------------  ------------------------
>                 52934365        +3.2%     54643527  JBOD-10HDD-thresh=100M/ext4-100dd-1M-24p-16384M-100M:10-X
>                 45488896       +18.2%     53785605  JBOD-10HDD-thresh=100M/ext4-10dd-1M-24p-16384M-100M:10-X
>                 47217534       +12.2%     53001031  JBOD-10HDD-thresh=100M/ext4-1dd-1M-24p-16384M-100M:10-X
>                 32286924       +25.4%     40492312  JBOD-10HDD-thresh=100M/xfs-10dd-1M-24p-16384M-100M:10-X
>                 38676965       +14.2%     44177606  JBOD-10HDD-thresh=100M/xfs-1dd-1M-24p-16384M-100M:10-X
>                 59662173       +11.1%     66269621  JBOD-10HDD-thresh=800M/ext4-10dd-1M-24p-16384M-800M:10-X
>                 57510438        +2.3%     58855181  JBOD-10HDD-thresh=800M/ext4-1dd-1M-24p-16384M-800M:10-X
>                 63691922       +64.0%    104460352  JBOD-10HDD-thresh=800M/xfs-100dd-1M-24p-16384M-800M:10-X
>                 51978567       +16.0%     60298210  JBOD-10HDD-thresh=800M/xfs-10dd-1M-24p-16384M-800M:10-X
>                 47641062        +6.4%     50681038  JBOD-10HDD-thresh=800M/xfs-1dd-1M-24p-16384M-800M:10-X
[snip]

I forgot to mention one important change that lead to the increased
JBOD performance: the per-bdi background threshold as in the below
patch.

One thing puzzled me is that in JBOD case, the per-disk writeout
performance is smaller than the corresponding single-disk case even
when they have comparable bdi_thresh. So I wrote the attached tracing
patch and find that in single disk case, bdi_writeback is always kept
high while in JBOD case, it could drop low from time to time and
correspondingly bdi_reclaimable could sometimes rush high.

The fix is to watch bdi_reclaimable and kick background writeback as
soon as it goes high. This resembles the global background threshold
but in per-bdi manner. The trick is, as long as bdi_reclaimable does
not go high, bdi_writeback naturally won't go low because
bdi_reclaimable+bdi_writeback ~= bdi_thresh. With enough writeback
pages, good performance is maintained.

Thanks,
Fengguang
---

--- linux-next.orig/fs/fs-writeback.c	2011-09-25 10:08:43.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-09-25 15:36:41.000000000 +0800
@@ -678,14 +678,18 @@ long writeback_inodes_wb(struct bdi_writ
 	return nr_pages - work.nr_pages;
 }
 
-static inline bool over_bground_thresh(void)
+static bool over_bground_thresh(struct backing_dev_info *bdi)
 {
 	unsigned long background_thresh, dirty_thresh;
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 
-	return (global_page_state(NR_FILE_DIRTY) +
-		global_page_state(NR_UNSTABLE_NFS) > background_thresh);
+	if (global_page_state(NR_FILE_DIRTY) +
+	    global_page_state(NR_UNSTABLE_NFS) > background_thresh)
+		return true;
+
+	return bdi_stat(bdi, BDI_RECLAIMABLE) >
+				bdi_dirty_limit(bdi, background_thresh);
 }
 
 /*
@@ -747,7 +751,7 @@ static long wb_writeback(struct bdi_writ
 		 * For background writeout, stop when we are below the
 		 * background dirty threshold
 		 */
-		if (work->for_background && !over_bground_thresh())
+		if (work->for_background && !over_bground_thresh(wb->bdi))
 			break;
 
 		if (work->for_kupdate) {
@@ -831,7 +835,7 @@ static unsigned long get_nr_dirty_pages(
 
 static long wb_check_background_flush(struct bdi_writeback *wb)
 {
-	if (over_bground_thresh()) {
+	if (over_bground_thresh(wb->bdi)) {
 
 		struct wb_writeback_work work = {
 			.nr_pages	= LONG_MAX,

[-- Attachment #2: trace-bdi-dirty-state.patch --]
[-- Type: text/x-diff, Size: 2122 bytes --]

Subject: 
Date: Thu Sep 01 09:56:44 CST 2011


Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   41 ++++++++++++++++++++++++++++-
 mm/page-writeback.c              |    2 +
 2 files changed, 42 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2011-09-01 10:09:58.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-09-01 10:13:38.000000000 +0800
@@ -1104,6 +1104,8 @@ static void balance_dirty_pages(struct a
 			bdi_dirty = bdi_reclaimable +
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
+		trace_bdi_dirty_state(bdi, bdi_thresh,
+				      bdi_dirty, bdi_reclaimable);
 
 		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
 				  (nr_dirty > dirty_thresh);
--- linux-next.orig/include/trace/events/writeback.h	2011-09-01 10:09:58.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-09-01 10:12:54.000000000 +0800
@@ -265,6 +264,46 @@ TRACE_EVENT(global_dirty_state,
 	)
 );
 
+TRACE_EVENT(bdi_dirty_state,
+
+	TP_PROTO(struct backing_dev_info *bdi,
+		 unsigned long bdi_thresh,
+		 unsigned long bdi_dirty,
+		 unsigned long bdi_reclaimable
+	),
+
+	TP_ARGS(bdi, bdi_thresh, bdi_dirty, bdi_reclaimable),
+
+	TP_STRUCT__entry(
+		__array(char,		bdi, 32)
+		__field(unsigned long,	bdi_reclaimable)
+		__field(unsigned long,	bdi_writeback)
+		__field(unsigned long,	bdi_thresh)
+		__field(unsigned long,	bdi_dirtied)
+		__field(unsigned long,	bdi_written)
+	),
+
+	TP_fast_assign(
+		strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
+		__entry->bdi_reclaimable	= bdi_reclaimable;
+		__entry->bdi_writeback		= bdi_dirty - bdi_reclaimable;
+		__entry->bdi_thresh		= bdi_thresh;
+		__entry->bdi_dirtied		= bdi_stat(bdi, BDI_DIRTIED);
+		__entry->bdi_written		= bdi_stat(bdi, BDI_WRITTEN);
+	),
+
+	TP_printk("bdi %s: reclaimable=%lu writeback=%lu "
+		  "thresh=%lu "
+		  "dirtied=%lu written=%lu",
+		  __entry->bdi,
+		  __entry->bdi_reclaimable,
+		  __entry->bdi_writeback,
+		  __entry->bdi_thresh,
+		  __entry->bdi_dirtied,
+		  __entry->bdi_written
+	)
+);
+
 #define KBps(x)			((x) << (PAGE_SHIFT - 10))
 
 TRACE_EVENT(dirty_ratelimit,

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 02/18] writeback: dirty position control
  2011-09-04  1:53   ` Wu Fengguang
@ 2011-11-12  5:44     ` Nai Xia
  -1 siblings, 0 replies; 175+ messages in thread
From: Nai Xia @ 2011-11-12  5:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Jan Kara, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

Hello Fengguang,

Is this the similar idea&algo behind TCP congestion control
since 2.6.19 ?

Same situation: Multiple tcp connections contending for network
bandwidth V.S. multiple process contending for BDI bandwidth.

Same solution: Per connection(v.s. process) speed control with cubic
algorithm controlled balancing.

:-)

Then the validness and efficiency in essence has been verified
in real world for years in another similar situation. Good to see we
are going to have it in write-back too!


Thanks,
Nai


On Sun, Sep 4, 2011 at 9:53 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
>
> Old scheme is,
>                                          |
>                           free run area  |  throttle area
>  ----------------------------------------+---------------------------->
>                                    thresh^                  dirty pages
>
> New scheme is,
>
>  ^ task rate limit
>  |
>  |            *
>  |             *
>  |              *
>  |[free run]      *      [smooth throttled]
>  |                  *
>  |                     *
>  |                         *
>  ..bdi->dirty_ratelimit..........*
>  |                               .     *
>  |                               .          *
>  |                               .              *
>  |                               .                 *
>  |                               .                    *
>  +-------------------------------.-----------------------*------------>
>                          setpoint^                  limit^  dirty pages
>
> The slope of the bdi control line should be
>
> 1) large enough to pull the dirty pages to setpoint reasonably fast
>
> 2) small enough to avoid big fluctuations in the resulted pos_ratio and
>   hence task ratelimit
>
> Since the fluctuation range of the bdi dirty pages is typically observed
> to be within 1-second worth of data, the bdi control line's slope is
> selected to be a linear function of bdi write bandwidth, so that it can
> adapt to slow/fast storage devices well.
>
> Assume the bdi control line
>
>        pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
>
> where k is the negative slope.
>
> If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
> are fluctuating in range
>
>        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
>
> we get slope
>
>        k = - 1 / (8 * write_bw)
>
> Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
>
>        x_intercept = bdi_setpoint + 8 * write_bw
>
> The global/bdi slopes are nicely complementing each other when the
> system has only one major bdi (indicated by bdi_thresh ~= thresh):
>
> 1) slope of global control line    => scaling to the control scope size
> 2) slope of main bdi control line  => scaling to the writeout bandwidth
>
> so that
>
> - in memory tight systems, (1) becomes strong enough to squeeze dirty
>  pages inside the control scope
>
> - in large memory systems where the "gravity" of (1) for pulling the
>  dirty pages to setpoint is too weak, (2) can back (1) up and drive
>  dirty pages to bdi_setpoint ~= setpoint reasonably fast.
>
> Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> is related to memory size due to the interferences between disks.  In
> this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
>
> Given equations
>
>        span = x_intercept - bdi_setpoint
>        k = df/dx = - 1 / span
>
> and the extremum values
>
>        span = bdi_thresh
>        dx = bdi_thresh
>
> we get
>
>        df = - dx / span = - 1.0
>
> That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence
> task ratelimit will fluctuate by -100%.
>
> peter: use 3rd order polynomial for the global control line
>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Acked-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c         |    2
>  include/linux/writeback.h |    1
>  mm/page-writeback.c       |  213 +++++++++++++++++++++++++++++++++++-
>  3 files changed, 210 insertions(+), 6 deletions(-)
>
> --- linux-next.orig/mm/page-writeback.c 2011-08-26 15:57:18.000000000 +0800
> +++ linux-next/mm/page-writeback.c      2011-08-26 15:57:34.000000000 +0800
> @@ -46,6 +46,8 @@
>  */
>  #define BANDWIDTH_INTERVAL     max(HZ/5, 1)
>
> +#define RATELIMIT_CALC_SHIFT   10
> +
>  /*
>  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>  * will look to see if it needs to force writeback or throttling.
> @@ -409,6 +411,12 @@ int bdi_set_max_ratio(struct backing_dev
>  }
>  EXPORT_SYMBOL(bdi_set_max_ratio);
>
> +static unsigned long dirty_freerun_ceiling(unsigned long thresh,
> +                                          unsigned long bg_thresh)
> +{
> +       return (thresh + bg_thresh) / 2;
> +}
> +
>  static unsigned long hard_dirty_limit(unsigned long thresh)
>  {
>        return max(thresh, global_dirty_limit);
> @@ -493,6 +501,197 @@ unsigned long bdi_dirty_limit(struct bac
>        return bdi_dirty;
>  }
>
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = dirty_ratelimit * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0              bdi_setpoint                    x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +                                       unsigned long thresh,
> +                                       unsigned long bg_thresh,
> +                                       unsigned long dirty,
> +                                       unsigned long bdi_thresh,
> +                                       unsigned long bdi_dirty)
> +{
> +       unsigned long write_bw = bdi->avg_write_bandwidth;
> +       unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +       unsigned long limit = hard_dirty_limit(thresh);
> +       unsigned long x_intercept;
> +       unsigned long setpoint;         /* dirty pages' target balance point */
> +       unsigned long bdi_setpoint;
> +       unsigned long span;
> +       long long pos_ratio;            /* for scaling up/down the rate limit */
> +       long x;
> +
> +       if (unlikely(dirty >= limit))
> +               return 0;
> +
> +       /*
> +        * global setpoint
> +        *
> +        *                           setpoint - dirty 3
> +        *        f(dirty) := 1.0 + (----------------)
> +        *                           limit - setpoint
> +        *
> +        * it's a 3rd order polynomial that subjects to
> +        *
> +        * (1) f(freerun)  = 2.0 => rampup dirty_ratelimit reasonably fast
> +        * (2) f(setpoint) = 1.0 => the balance point
> +        * (3) f(limit)    = 0   => the hard limit
> +        * (4) df/dx      <= 0   => negative feedback control
> +        * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +        *     => fast response on large errors; small oscillation near setpoint
> +        */
> +       setpoint = (freerun + limit) / 2;
> +       x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +                   limit - setpoint + 1);
> +       pos_ratio = x;
> +       pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +       pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +       pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +       /*
> +        * We have computed basic pos_ratio above based on global situation. If
> +        * the bdi is over/under its share of dirty pages, we want to scale
> +        * pos_ratio further down/up. That is done by the following mechanism.
> +        */
> +
> +       /*
> +        * bdi setpoint
> +        *
> +        *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
> +        *
> +        *                        x_intercept - bdi_dirty
> +        *                     := --------------------------
> +        *                        x_intercept - bdi_setpoint
> +        *
> +        * The main bdi control line is a linear function that subjects to
> +        *
> +        * (1) f(bdi_setpoint) = 1.0
> +        * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +        *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
> +        *
> +        * For single bdi case, the dirty pages are observed to fluctuate
> +        * regularly within range
> +        *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
> +        * for various filesystems, where (2) can yield in a reasonable 12.5%
> +        * fluctuation range for pos_ratio.
> +        *
> +        * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +        * own size, so move the slope over accordingly and choose a slope that
> +        * yields 100% pos_ratio fluctuation on suddenly doubled bdi_thresh.
> +        */
> +       if (unlikely(bdi_thresh > thresh))
> +               bdi_thresh = thresh;
> +       /*
> +        * scale global setpoint to bdi's:
> +        *      bdi_setpoint = setpoint * bdi_thresh / thresh
> +        */
> +       x = div_u64((u64)bdi_thresh << 16, thresh + 1);
> +       bdi_setpoint = setpoint * (u64)x >> 16;
> +       /*
> +        * Use span=(8*write_bw) in single bdi case as indicated by
> +        * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +        *
> +        *        bdi_thresh                    thresh - bdi_thresh
> +        * span = ---------- * (8 * write_bw) + ------------------- * bdi_thresh
> +        *          thresh                            thresh
> +        */
> +       span = (thresh - bdi_thresh + 8 * write_bw) * (u64)x >> 16;
> +       x_intercept = bdi_setpoint + span;
> +
> +       span >>= 1;
> +       if (unlikely(bdi_dirty > bdi_setpoint + span)) {
> +               if (unlikely(bdi_dirty > limit))
> +                       return 0;
> +               if (x_intercept < limit) {
> +                       x_intercept = limit;    /* auxiliary control line */
> +                       bdi_setpoint += span;
> +                       pos_ratio >>= 1;
> +               }
> +       }
> +       pos_ratio *= x_intercept - bdi_dirty;
> +       do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
> +
> +       return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>                                       unsigned long elapsed,
>                                       unsigned long written)
> @@ -591,6 +790,7 @@ static void global_update_bandwidth(unsi
>
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>                            unsigned long thresh,
> +                           unsigned long bg_thresh,
>                            unsigned long dirty,
>                            unsigned long bdi_thresh,
>                            unsigned long bdi_dirty,
> @@ -627,6 +827,7 @@ snapshot:
>
>  static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>                                 unsigned long thresh,
> +                                unsigned long bg_thresh,
>                                 unsigned long dirty,
>                                 unsigned long bdi_thresh,
>                                 unsigned long bdi_dirty,
> @@ -635,8 +836,8 @@ static void bdi_update_bandwidth(struct
>        if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
>                return;
>        spin_lock(&bdi->wb.list_lock);
> -       __bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> -                              start_time);
> +       __bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> +                              bdi_thresh, bdi_dirty, start_time);
>        spin_unlock(&bdi->wb.list_lock);
>  }
>
> @@ -677,7 +878,8 @@ static void balance_dirty_pages(struct a
>                 * catch-up. This avoids (excessively) small writeouts
>                 * when the bdi limits are ramping up.
>                 */
> -               if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> +               if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> +                                                     background_thresh))
>                        break;
>
>                bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -721,8 +923,9 @@ static void balance_dirty_pages(struct a
>                if (!bdi->dirty_exceeded)
>                        bdi->dirty_exceeded = 1;
>
> -               bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> -                                    bdi_thresh, bdi_dirty, start_time);
> +               bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> +                                    nr_dirty, bdi_thresh, bdi_dirty,
> +                                    start_time);
>
>                /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>                 * Unstable writes are a feature of certain networked
> --- linux-next.orig/fs/fs-writeback.c   2011-08-26 15:57:18.000000000 +0800
> +++ linux-next/fs/fs-writeback.c        2011-08-26 15:57:20.000000000 +0800
> @@ -675,7 +675,7 @@ static inline bool over_bground_thresh(v
>  static void wb_update_bandwidth(struct bdi_writeback *wb,
>                                unsigned long start_time)
>  {
> -       __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> +       __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
>  }
>
>  /*
> --- linux-next.orig/include/linux/writeback.h   2011-08-26 15:57:18.000000000 +0800
> +++ linux-next/include/linux/writeback.h        2011-08-26 15:57:20.000000000 +0800
> @@ -141,6 +141,7 @@ unsigned long bdi_dirty_limit(struct bac
>
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>                            unsigned long thresh,
> +                           unsigned long bg_thresh,
>                            unsigned long dirty,
>                            unsigned long bdi_thresh,
>                            unsigned long bdi_dirty,
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH 02/18] writeback: dirty position control
@ 2011-11-12  5:44     ` Nai Xia
  0 siblings, 0 replies; 175+ messages in thread
From: Nai Xia @ 2011-11-12  5:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Jan Kara, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

Hello Fengguang,

Is this the similar idea&algo behind TCP congestion control
since 2.6.19 ?

Same situation: Multiple tcp connections contending for network
bandwidth V.S. multiple process contending for BDI bandwidth.

Same solution: Per connection(v.s. process) speed control with cubic
algorithm controlled balancing.

:-)

Then the validness and efficiency in essence has been verified
in real world for years in another similar situation. Good to see we
are going to have it in write-back too!


Thanks,
Nai


On Sun, Sep 4, 2011 at 9:53 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
>
> Old scheme is,
>                                          |
>                           free run area  |  throttle area
>  ----------------------------------------+---------------------------->
>                                    thresh^                  dirty pages
>
> New scheme is,
>
>  ^ task rate limit
>  |
>  |            *
>  |             *
>  |              *
>  |[free run]      *      [smooth throttled]
>  |                  *
>  |                     *
>  |                         *
>  ..bdi->dirty_ratelimit..........*
>  |                               .     *
>  |                               .          *
>  |                               .              *
>  |                               .                 *
>  |                               .                    *
>  +-------------------------------.-----------------------*------------>
>                          setpoint^                  limit^  dirty pages
>
> The slope of the bdi control line should be
>
> 1) large enough to pull the dirty pages to setpoint reasonably fast
>
> 2) small enough to avoid big fluctuations in the resulted pos_ratio and
>   hence task ratelimit
>
> Since the fluctuation range of the bdi dirty pages is typically observed
> to be within 1-second worth of data, the bdi control line's slope is
> selected to be a linear function of bdi write bandwidth, so that it can
> adapt to slow/fast storage devices well.
>
> Assume the bdi control line
>
>        pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
>
> where k is the negative slope.
>
> If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
> are fluctuating in range
>
>        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
>
> we get slope
>
>        k = - 1 / (8 * write_bw)
>
> Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
>
>        x_intercept = bdi_setpoint + 8 * write_bw
>
> The global/bdi slopes are nicely complementing each other when the
> system has only one major bdi (indicated by bdi_thresh ~= thresh):
>
> 1) slope of global control line    => scaling to the control scope size
> 2) slope of main bdi control line  => scaling to the writeout bandwidth
>
> so that
>
> - in memory tight systems, (1) becomes strong enough to squeeze dirty
>  pages inside the control scope
>
> - in large memory systems where the "gravity" of (1) for pulling the
>  dirty pages to setpoint is too weak, (2) can back (1) up and drive
>  dirty pages to bdi_setpoint ~= setpoint reasonably fast.
>
> Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> is related to memory size due to the interferences between disks.  In
> this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
>
> Given equations
>
>        span = x_intercept - bdi_setpoint
>        k = df/dx = - 1 / span
>
> and the extremum values
>
>        span = bdi_thresh
>        dx = bdi_thresh
>
> we get
>
>        df = - dx / span = - 1.0
>
> That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence
> task ratelimit will fluctuate by -100%.
>
> peter: use 3rd order polynomial for the global control line
>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Acked-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c         |    2
>  include/linux/writeback.h |    1
>  mm/page-writeback.c       |  213 +++++++++++++++++++++++++++++++++++-
>  3 files changed, 210 insertions(+), 6 deletions(-)
>
> --- linux-next.orig/mm/page-writeback.c 2011-08-26 15:57:18.000000000 +0800
> +++ linux-next/mm/page-writeback.c      2011-08-26 15:57:34.000000000 +0800
> @@ -46,6 +46,8 @@
>  */
>  #define BANDWIDTH_INTERVAL     max(HZ/5, 1)
>
> +#define RATELIMIT_CALC_SHIFT   10
> +
>  /*
>  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>  * will look to see if it needs to force writeback or throttling.
> @@ -409,6 +411,12 @@ int bdi_set_max_ratio(struct backing_dev
>  }
>  EXPORT_SYMBOL(bdi_set_max_ratio);
>
> +static unsigned long dirty_freerun_ceiling(unsigned long thresh,
> +                                          unsigned long bg_thresh)
> +{
> +       return (thresh + bg_thresh) / 2;
> +}
> +
>  static unsigned long hard_dirty_limit(unsigned long thresh)
>  {
>        return max(thresh, global_dirty_limit);
> @@ -493,6 +501,197 @@ unsigned long bdi_dirty_limit(struct bac
>        return bdi_dirty;
>  }
>
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = dirty_ratelimit * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0              bdi_setpoint                    x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +                                       unsigned long thresh,
> +                                       unsigned long bg_thresh,
> +                                       unsigned long dirty,
> +                                       unsigned long bdi_thresh,
> +                                       unsigned long bdi_dirty)
> +{
> +       unsigned long write_bw = bdi->avg_write_bandwidth;
> +       unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +       unsigned long limit = hard_dirty_limit(thresh);
> +       unsigned long x_intercept;
> +       unsigned long setpoint;         /* dirty pages' target balance point */
> +       unsigned long bdi_setpoint;
> +       unsigned long span;
> +       long long pos_ratio;            /* for scaling up/down the rate limit */
> +       long x;
> +
> +       if (unlikely(dirty >= limit))
> +               return 0;
> +
> +       /*
> +        * global setpoint
> +        *
> +        *                           setpoint - dirty 3
> +        *        f(dirty) := 1.0 + (----------------)
> +        *                           limit - setpoint
> +        *
> +        * it's a 3rd order polynomial that subjects to
> +        *
> +        * (1) f(freerun)  = 2.0 => rampup dirty_ratelimit reasonably fast
> +        * (2) f(setpoint) = 1.0 => the balance point
> +        * (3) f(limit)    = 0   => the hard limit
> +        * (4) df/dx      <= 0   => negative feedback control
> +        * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +        *     => fast response on large errors; small oscillation near setpoint
> +        */
> +       setpoint = (freerun + limit) / 2;
> +       x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +                   limit - setpoint + 1);
> +       pos_ratio = x;
> +       pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +       pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +       pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +       /*
> +        * We have computed basic pos_ratio above based on global situation. If
> +        * the bdi is over/under its share of dirty pages, we want to scale
> +        * pos_ratio further down/up. That is done by the following mechanism.
> +        */
> +
> +       /*
> +        * bdi setpoint
> +        *
> +        *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
> +        *
> +        *                        x_intercept - bdi_dirty
> +        *                     := --------------------------
> +        *                        x_intercept - bdi_setpoint
> +        *
> +        * The main bdi control line is a linear function that subjects to
> +        *
> +        * (1) f(bdi_setpoint) = 1.0
> +        * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +        *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
> +        *
> +        * For single bdi case, the dirty pages are observed to fluctuate
> +        * regularly within range
> +        *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
> +        * for various filesystems, where (2) can yield in a reasonable 12.5%
> +        * fluctuation range for pos_ratio.
> +        *
> +        * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +        * own size, so move the slope over accordingly and choose a slope that
> +        * yields 100% pos_ratio fluctuation on suddenly doubled bdi_thresh.
> +        */
> +       if (unlikely(bdi_thresh > thresh))
> +               bdi_thresh = thresh;
> +       /*
> +        * scale global setpoint to bdi's:
> +        *      bdi_setpoint = setpoint * bdi_thresh / thresh
> +        */
> +       x = div_u64((u64)bdi_thresh << 16, thresh + 1);
> +       bdi_setpoint = setpoint * (u64)x >> 16;
> +       /*
> +        * Use span=(8*write_bw) in single bdi case as indicated by
> +        * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +        *
> +        *        bdi_thresh                    thresh - bdi_thresh
> +        * span = ---------- * (8 * write_bw) + ------------------- * bdi_thresh
> +        *          thresh                            thresh
> +        */
> +       span = (thresh - bdi_thresh + 8 * write_bw) * (u64)x >> 16;
> +       x_intercept = bdi_setpoint + span;
> +
> +       span >>= 1;
> +       if (unlikely(bdi_dirty > bdi_setpoint + span)) {
> +               if (unlikely(bdi_dirty > limit))
> +                       return 0;
> +               if (x_intercept < limit) {
> +                       x_intercept = limit;    /* auxiliary control line */
> +                       bdi_setpoint += span;
> +                       pos_ratio >>= 1;
> +               }
> +       }
> +       pos_ratio *= x_intercept - bdi_dirty;
> +       do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
> +
> +       return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>                                       unsigned long elapsed,
>                                       unsigned long written)
> @@ -591,6 +790,7 @@ static void global_update_bandwidth(unsi
>
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>                            unsigned long thresh,
> +                           unsigned long bg_thresh,
>                            unsigned long dirty,
>                            unsigned long bdi_thresh,
>                            unsigned long bdi_dirty,
> @@ -627,6 +827,7 @@ snapshot:
>
>  static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>                                 unsigned long thresh,
> +                                unsigned long bg_thresh,
>                                 unsigned long dirty,
>                                 unsigned long bdi_thresh,
>                                 unsigned long bdi_dirty,
> @@ -635,8 +836,8 @@ static void bdi_update_bandwidth(struct
>        if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
>                return;
>        spin_lock(&bdi->wb.list_lock);
> -       __bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> -                              start_time);
> +       __bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> +                              bdi_thresh, bdi_dirty, start_time);
>        spin_unlock(&bdi->wb.list_lock);
>  }
>
> @@ -677,7 +878,8 @@ static void balance_dirty_pages(struct a
>                 * catch-up. This avoids (excessively) small writeouts
>                 * when the bdi limits are ramping up.
>                 */
> -               if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> +               if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> +                                                     background_thresh))
>                        break;
>
>                bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -721,8 +923,9 @@ static void balance_dirty_pages(struct a
>                if (!bdi->dirty_exceeded)
>                        bdi->dirty_exceeded = 1;
>
> -               bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> -                                    bdi_thresh, bdi_dirty, start_time);
> +               bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> +                                    nr_dirty, bdi_thresh, bdi_dirty,
> +                                    start_time);
>
>                /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>                 * Unstable writes are a feature of certain networked
> --- linux-next.orig/fs/fs-writeback.c   2011-08-26 15:57:18.000000000 +0800
> +++ linux-next/fs/fs-writeback.c        2011-08-26 15:57:20.000000000 +0800
> @@ -675,7 +675,7 @@ static inline bool over_bground_thresh(v
>  static void wb_update_bandwidth(struct bdi_writeback *wb,
>                                unsigned long start_time)
>  {
> -       __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> +       __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
>  }
>
>  /*
> --- linux-next.orig/include/linux/writeback.h   2011-08-26 15:57:18.000000000 +0800
> +++ linux-next/include/linux/writeback.h        2011-08-26 15:57:20.000000000 +0800
> @@ -141,6 +141,7 @@ unsigned long bdi_dirty_limit(struct bac
>
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>                            unsigned long thresh,
> +                           unsigned long bg_thresh,
>                            unsigned long dirty,
>                            unsigned long bdi_thresh,
>                            unsigned long bdi_dirty,
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

end of thread, other threads:[~2011-11-12  5:44 UTC | newest]

Thread overview: 175+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-04  1:53 [PATCH 00/18] IO-less dirty throttling v11 Wu Fengguang
2011-09-04  1:53 ` Wu Fengguang
2011-09-04  1:53 ` Wu Fengguang
2011-09-04  1:53 ` [PATCH 01/18] writeback: account per-bdi accumulated dirtied pages Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53 ` [PATCH 02/18] writeback: dirty position control Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-05 15:02   ` Peter Zijlstra
2011-09-05 15:02     ` Peter Zijlstra
2011-09-06  2:10     ` Wu Fengguang
2011-09-06  2:10       ` Wu Fengguang
2011-09-05 15:05   ` Peter Zijlstra
2011-09-05 15:05     ` Peter Zijlstra
2011-09-06  2:43     ` Wu Fengguang
2011-09-06  2:43       ` Wu Fengguang
2011-09-06 18:20   ` Vivek Goyal
2011-09-06 18:20     ` Vivek Goyal
2011-09-08  2:53     ` Wu Fengguang
2011-09-08  2:53       ` Wu Fengguang
2011-11-12  5:44   ` Nai Xia
2011-11-12  5:44     ` Nai Xia
2011-09-04  1:53 ` [PATCH 03/18] writeback: dirty rate control Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-29 11:57   ` Wu Fengguang
2011-09-29 11:57     ` Wu Fengguang
2011-09-04  1:53 ` [PATCH 04/18] writeback: stabilize bdi->dirty_ratelimit Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53 ` [PATCH 05/18] writeback: per task dirty rate limit Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-06 15:47   ` Peter Zijlstra
2011-09-06 15:47     ` Peter Zijlstra
2011-09-06 15:47     ` Peter Zijlstra
2011-09-06 23:27     ` Jan Kara
2011-09-06 23:27       ` Jan Kara
2011-09-06 23:34       ` Jan Kara
2011-09-06 23:34         ` Jan Kara
2011-09-07  7:27       ` Peter Zijlstra
2011-09-07  7:27         ` Peter Zijlstra
2011-09-07  7:27         ` Peter Zijlstra
2011-09-07  1:04     ` Wu Fengguang
2011-09-07  1:04       ` Wu Fengguang
2011-09-07  7:31       ` Peter Zijlstra
2011-09-07  7:31         ` Peter Zijlstra
2011-09-07  7:31         ` Peter Zijlstra
2011-09-07 11:00         ` Wu Fengguang
2011-09-07 11:00           ` Wu Fengguang
2011-09-04  1:53 ` [PATCH 06/18] writeback: IO-less balance_dirty_pages() Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-06 12:13   ` Peter Zijlstra
2011-09-06 12:13     ` Peter Zijlstra
2011-09-06 12:13     ` Peter Zijlstra
2011-09-07  2:46     ` Wu Fengguang
2011-09-07  2:46       ` Wu Fengguang
2011-09-04  1:53 ` [PATCH 07/18] writeback: dirty ratelimit - think time compensation Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53 ` [PATCH 08/18] writeback: trace dirty_ratelimit Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53 ` [PATCH 09/18] writeback: trace balance_dirty_pages Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53 ` [PATCH 10/18] writeback: dirty position control - bdi reserve area Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-06 14:09   ` Peter Zijlstra
2011-09-06 14:09     ` Peter Zijlstra
2011-09-06 14:09     ` Peter Zijlstra
2011-09-07 12:31     ` Wu Fengguang
2011-09-07 12:31       ` Wu Fengguang
2011-09-12 10:19       ` Peter Zijlstra
2011-09-12 10:19         ` Peter Zijlstra
2011-09-12 10:19         ` Peter Zijlstra
2011-09-18 14:17         ` Wu Fengguang
2011-09-18 14:37           ` Wu Fengguang
2011-09-18 14:37             ` Wu Fengguang
2011-09-18 14:47             ` Wu Fengguang
2011-09-18 14:47               ` Wu Fengguang
2011-09-28 14:02               ` Wu Fengguang
2011-09-28 14:50                 ` Peter Zijlstra
2011-09-28 14:50                   ` Peter Zijlstra
2011-09-29  3:32                   ` Wu Fengguang
2011-09-29  3:32                     ` Wu Fengguang
2011-09-29  8:49                     ` Peter Zijlstra
2011-09-29  8:49                       ` Peter Zijlstra
2011-09-29  8:49                       ` Peter Zijlstra
2011-09-29 11:05                       ` Wu Fengguang
2011-09-29 11:05                         ` Wu Fengguang
2011-09-29 12:15                 ` Wu Fengguang
2011-09-04  1:53 ` [PATCH 11/18] block: add bdi flag to indicate risk of io queue underrun Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-06 14:22   ` Peter Zijlstra
2011-09-06 14:22     ` Peter Zijlstra
2011-09-07  2:37     ` Wu Fengguang
2011-09-07  2:37       ` Wu Fengguang
2011-09-07  7:31       ` Peter Zijlstra
2011-09-07  7:31         ` Peter Zijlstra
2011-09-07  7:31         ` Peter Zijlstra
2011-09-04  1:53 ` [PATCH 12/18] writeback: balanced_rate cannot exceed write bandwidth Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53 ` [PATCH 13/18] writeback: limit max dirty pause time Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-06 14:52   ` Peter Zijlstra
2011-09-06 14:52     ` Peter Zijlstra
2011-09-06 14:52     ` Peter Zijlstra
2011-09-07  2:35     ` Wu Fengguang
2011-09-07  2:35       ` Wu Fengguang
2011-09-12 10:22       ` Peter Zijlstra
2011-09-12 10:22         ` Peter Zijlstra
2011-09-12 10:22         ` Peter Zijlstra
2011-09-18 14:23         ` Wu Fengguang
2011-09-18 14:23           ` Wu Fengguang
2011-09-04  1:53 ` [PATCH 14/18] writeback: control " Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-06 15:51   ` Peter Zijlstra
2011-09-06 15:51     ` Peter Zijlstra
2011-09-06 15:51     ` Peter Zijlstra
2011-09-07  2:02     ` Wu Fengguang
2011-09-07  2:02       ` Wu Fengguang
2011-09-12 10:28       ` Peter Zijlstra
2011-09-12 10:28         ` Peter Zijlstra
2011-09-12 10:28         ` Peter Zijlstra
2011-09-04  1:53 ` [PATCH 15/18] writeback: charge leaked page dirties to active tasks Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-06 16:16   ` Peter Zijlstra
2011-09-06 16:16     ` Peter Zijlstra
2011-09-06 16:16     ` Peter Zijlstra
2011-09-07  9:06     ` Wu Fengguang
2011-09-07  9:06       ` Wu Fengguang
2011-09-07  0:17   ` Jan Kara
2011-09-07  0:17     ` Jan Kara
2011-09-07  9:37     ` Wu Fengguang
2011-09-04  1:53 ` [PATCH 16/18] writeback: fix dirtied pages accounting on sub-page writes Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53 ` [PATCH 17/18] writeback: fix dirtied pages accounting on redirty Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-06 16:18   ` Peter Zijlstra
2011-09-06 16:18     ` Peter Zijlstra
2011-09-06 16:18     ` Peter Zijlstra
2011-09-07  0:22     ` Jan Kara
2011-09-07  0:22       ` Jan Kara
2011-09-07  1:18       ` Wu Fengguang
2011-09-07  6:56       ` Christoph Hellwig
2011-09-07  6:56         ` Christoph Hellwig
2011-09-07  8:19         ` Peter Zijlstra
2011-09-07  8:19           ` Peter Zijlstra
2011-09-07  8:19           ` Peter Zijlstra
2011-09-07 16:42           ` Jan Kara
2011-09-07 16:42             ` Jan Kara
2011-09-07 16:46             ` Christoph Hellwig
2011-09-07 16:46               ` Christoph Hellwig
2011-09-08  8:51               ` Steven Whitehouse
2011-09-08  8:51                 ` Steven Whitehouse
2011-09-04  1:53 ` [PATCH 18/18] btrfs: fix dirtied pages accounting on sub-page writes Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-04  1:53   ` Wu Fengguang
2011-09-07 13:32 ` [PATCH 00/18] IO-less dirty throttling v11 Wu Fengguang
2011-09-07 13:32   ` Wu Fengguang
2011-09-07 19:14   ` Trond Myklebust
2011-09-07 19:14     ` Trond Myklebust
2011-09-28 14:58 ` Christoph Hellwig
2011-09-28 14:58   ` Christoph Hellwig
2011-09-29  4:11   ` Wu Fengguang
2011-09-29  4:11     ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.