[PATCH 0/3] VM throttling: avoid blocking occasional writers

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] VM throttling: avoid blocking occasional writers
@ 2007-03-14 12:42 Tomoki Sekiyama
  2007-03-14 13:18 ` Peter Zijlstra
  2007-03-15 19:07 ` Andrew Morton
  0 siblings, 2 replies; 14+ messages in thread
From: Tomoki Sekiyama @ 2007-03-14 12:42 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: yumiko.sugita.yf, masami.hiramatsu.pt, hidehiro.kawai.ez,
	yuji.kakutani.uw, soshima, haoki, kamezawa.hiroyu, nikita,
	leroy.vanlogchem

Hi,

I ported the patch sent before to 2.6.21-rc3-mm2, so I'm resending it.
( Previous patch is available at
  http://marc.info/?l=linux-kernel&m=117223267512340&w=2 )

-Summary:
I have observed a problem that write(2) can be blocked for a long time
if a system has several disks and is under heavy I/O pressure. This
patchset is to avoid the problem by introducing high/low water-mark
algorithm to balance_dirty_pages() function.

-Example of the probrem:

There are two processes on a system which has two disks. Process-A
writes heavily to disk-a, and process-B writes small data (e.g. log
files) to disk-b occasionally. A portion of system memory, which is
depends on vm.dirty_ratio (typically 40%), is filled up with Dirty
and Writeback pages of disk-a.

In this situation, write(2) of process-B could be blocked for a very
long time (more then 60 seconds), although the load of disk-b is quite
low. In particular, the system would become quite slow, if disk-a is
slow (e.g. backup to an USB disk).

This seems to be the same problem as discussed in LKML:
http://marc.theaimsgroup.com/?t=115559902900003
and
http://marc.theaimsgroup.com/?t=117182340400003

-Cause:

I found this problem is caused by the balance_dirty_pages().

While Dirty+Writeback pages get more than 40% of memory, process-B is
blocked in balance_dirty_pages() until writeback of some (`write_chunk',
typically = 1536) dirty pages on disk-b is started.

However, because disk-b has only a few dirty pages, the process-B will
be blocked until writeback to disk-a is completed and Dirty+Writeback
goes below 40%.

-Solution:

I consider that all of the dirty pages for the disk have been written
back and that the disk is clean if a process cannot write 'write_chunk'
pages in balance_dirty_pages().

To avoid using up the free memory with dirty pages by passing blocking,
this patchset adds a new threshold named vm.dirty_limit_ratio to sysctl.

It modifies balance_dirty_pages() not to block when the amount of
Dirty+Writeback is less than vm.dirty_limit_ratio percent of the memory.
In the other cases, writers are throttled as current Linux does.

In this patchset, vm.dirty_limit_ratio, instead of vm.dirty_ratio, is
used as the clamping level of Dirty+Writeback. And, vm.dirty_ratio is
used as the level at which a writers will itself start writeback of the
dirty pages.

-Testing Results:

In the situation explained in "Example of the problem" section, I
measured time of write(2)ing to disk-b.
The write was completed by 30ms or less under the kernel with this
patchset.

When nr_requests is set too high (e.g. 8192), Dirty+Writeback grows near
vm.dirty_limit_ratio(45% of system memory by defaults). In that case,
write(2) sometimes took about 1 second.

This patchset can be applied to 2.6.21-rc3-mm2.
It consists of 3 pieces:

1/3 - add a sysctl variable `vm.dirty_limit_ratio'
2/3 - modify get_dirty_limits() to return the limit of dirty pages.
3/3 - break out of balance_dirty_pages() loop if the disk doesn't have
      remaining dirty pages, if Dirty+Writeback < vm.dirty_limit_ratio.

-- 
Tomoki Sekiyama
Hitachi, Ltd., Systems Development Laboratory

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-03-14 12:42 [PATCH 0/3] VM throttling: avoid blocking occasional writers Tomoki Sekiyama
@ 2007-03-14 13:18 ` Peter Zijlstra
  2007-03-15 19:07 ` Andrew Morton
  1 sibling, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2007-03-14 13:18 UTC (permalink / raw)
  To: Tomoki Sekiyama
  Cc: akpm, linux-kernel, yumiko.sugita.yf, masami.hiramatsu.pt,
	hidehiro.kawai.ez, yuji.kakutani.uw, soshima, haoki,
	kamezawa.hiroyu, nikita, leroy.vanlogchem, Dave Chinner

Hi,

I've been working on an alternative solution (see patch below). However
I haven't posted yet because I'm not quite satisfied and haven't done a
lot of testing.

The patch relies on the per backing dev dirty/writeback counts currently
in -mm to which David Chinner objected. I plan to rework those as percpu
counters.

I think my solution might behave better because it fully decouples the
device throttling.

---

Scale writeback cache per backing device, proportional to its writeout speed.

akpm sayeth:
> Which problem are we trying to solve here?  afaik our two uppermost
> problems are:
> 
> a) Heavy write to queue A causes light writer to queue B to blok for a long
> time in balance_dirty_pages().  Even if the devices have the same speed.  

This one; esp when not the same speed. The - my usb stick makes my
computer suck - problem. But even on similar speed, the separation of
device should avoid blocking dev B when dev A is being throttled.

The writeout speed is measure dynamically, so when it doesn't have
anything to write out for a while its writeback cache size goes to 0.

Conversely, when starting up it will in the beginning act almost
synchronous but will quickly build up a 'fair' share of the writeback
cache.

> b) heavy write to device A causes light write to device A to block for a
> long time in balance_dirty_pages(), occasionally.  Harder to fix.

This will indeed take more. I've thought about it though. But one
quickly ends up with per task state.


How it all works:

We pick a 2^n value based on the vm_dirty_ratio and total vm size to act as a
period - vm_cycle_shift. This period measures 'time' in writeout events.

Each writeout increases time and adds to a per bdi counter. This counter is 
halved when a period expires. So per bdi speed is:

  0.5 * (previous cycle speed) + this cycle's events.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c           |    3 +
 include/linux/backing-dev.h |    7 +++
 include/linux/writeback.h   |   10 ++++
 kernel/sysctl.c             |   10 +++-
 mm/page-writeback.c         |  102 ++++++++++++++++++++++++++++++++++++++------
 5 files changed, 119 insertions(+), 13 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -34,6 +34,13 @@ struct backing_dev_info {
 	void *congested_data;	/* Pointer to aux data for congested func */
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
 	void *unplug_io_data;
+
+	/*
+	 * data used for scaling the writeback cache
+	 */
+	spinlock_t lock;		/* protect the cycle count */
+	atomic_long_t nr_writeout;	/* writeout scale */
+	unsigned long cycles;		/* writeout cycles */
 };
 
 
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -4,6 +4,8 @@
 #ifndef WRITEBACK_H
 #define WRITEBACK_H
 
+#include <linux/log2.h>
+
 struct backing_dev_info;
 
 extern spinlock_t inode_lock;
@@ -89,11 +91,19 @@ void throttle_vm_writeout(gfp_t gfp_mask
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
 extern int vm_dirty_ratio;
+extern int vm_cycle_shift;
 extern int dirty_writeback_interval;
 extern int dirty_expire_interval;
 extern int block_dump;
 extern int laptop_mode;
 
+extern long vm_total_pages; /* reduce dependancy stuff */
+static inline void update_cycle_shift(void)
+{
+	unsigned long dirty_pages = (vm_dirty_ratio * vm_total_pages) / 100;
+	vm_cycle_shift = 2 + ilog2_up(int_sqrt(dirty_pages));
+}
+
 struct ctl_table;
 struct file;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int, struct file *,
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -612,6 +612,14 @@ static ctl_table kern_table[] = {
 static int zero;
 static int one_hundred = 100;
 
+static int proc_dointvec_vm_dirty_ratio(ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos);
+	update_cycle_shift();
+	return ret;
+}
 
 static ctl_table vm_table[] = {
 	{
@@ -663,7 +671,7 @@ static ctl_table vm_table[] = {
 		.data		= &vm_dirty_ratio,
 		.maxlen		= sizeof(vm_dirty_ratio),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec_minmax,
+		.proc_handler	= &proc_dointvec_vm_dirty_ratio,
 		.strategy	= &sysctl_intvec,
 		.extra1		= &zero,
 		.extra2		= &one_hundred,
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -73,6 +73,9 @@ int dirty_background_ratio = 10;
  * The generator of dirty data starts writeback at this percentage
  */
 int vm_dirty_ratio = 40;
+int vm_cycle_shift;
+
+static DEFINE_PER_CPU(unsigned long, vm_writeout) = {0};
 
 /*
  * The interval between `kupdate'-style writebacks, in jiffies
@@ -102,6 +105,55 @@ EXPORT_SYMBOL(laptop_mode);
 
 static void background_writeout(unsigned long _min_pages);
 
+static unsigned long bdi_total_writeout(void)
+{
+	int cpu;
+	unsigned long sum = 0;
+	for_each_possible_cpu(cpu)
+		sum += per_cpu(vm_writeout, cpu);
+	return sum;
+}
+
+static void bdi_writeout_norm(struct backing_dev_info *bdi)
+{
+	int bits = vm_cycle_shift;
+	unsigned long cycle = 1UL << bits;
+	unsigned long mask = ~(cycle - 1);
+	unsigned long total = bdi_total_writeout() << 1;
+
+	if ((bdi->cycles & mask) == (total & mask))
+		return;
+
+	spin_lock(&bdi->lock);
+	while ((bdi->cycles & mask) != (total & mask)) {
+		atomic_long_sub(atomic_long_read(&bdi->nr_writeout) / 2,
+				&bdi->nr_writeout);
+		bdi->cycles += cycle;
+	}
+	spin_unlock(&bdi->lock);
+}
+
+static void bdi_writeout_inc(struct backing_dev_info *bdi)
+{
+	get_cpu_var(vm_writeout)++;
+	put_cpu();
+
+	if (!(atomic_long_inc_return(&bdi->nr_writeout) & 0x7))
+		bdi_writeout_norm(bdi);
+}
+
+static void
+get_writeout_scale(struct address_space *mapping, int *scale, int *div)
+{
+	int bits = vm_cycle_shift - 1;
+	unsigned long total = bdi_total_writeout();
+	unsigned long cycle = 1UL << bits;
+	unsigned long mask = cycle - 1;
+
+	*scale = atomic_long_read(&mapping->backing_dev_info->nr_writeout);
+	*div = cycle + (total & mask);
+}
+
 /*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
@@ -120,7 +172,7 @@ static void background_writeout(unsigned
  * clamping level.
  */
 static void
-get_dirty_limits(long *pbackground, long *pdirty,
+get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
 					struct address_space *mapping)
 {
 	int background_ratio;		/* Percentages */
@@ -163,6 +215,21 @@ get_dirty_limits(long *pbackground, long
 	}
 	*pbackground = background;
 	*pdirty = dirty;
+
+	if (mapping) {
+		long long tmp = dirty;
+		int scale, div;
+
+		get_writeout_scale(mapping, &scale, &div);
+
+		if (scale > div)
+			scale = div;
+
+		tmp = (tmp * 122) >> 7; /* take ~95% of total dirty value */
+		tmp *= scale;
+		do_div(tmp, div);
+		*pbdi_dirty = (long)tmp;
+	}
 }
 
 /*
@@ -177,6 +244,7 @@ static void balance_dirty_pages(struct a
 	long nr_reclaimable;
 	long background_thresh;
 	long dirty_thresh;
+	long bdi_thresh;
 	unsigned long pages_written = 0;
 	unsigned long write_chunk = sync_writeback_pages();
 
@@ -191,11 +259,15 @@ static void balance_dirty_pages(struct a
 			.range_cyclic	= 1,
 		};
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, mapping);
+		get_dirty_limits(&background_thresh, &dirty_thresh,
+				&bdi_thresh, mapping);
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
-			dirty_thresh)
+		if ((nr_reclaimable + global_page_state(NR_WRITEBACK) <=
+			dirty_thresh) &&
+		    (atomic_long_read(&bdi->nr_dirty) +
+		     atomic_long_read(&bdi->nr_writeback) <=
+		     	bdi_thresh))
 				break;
 
 		if (!dirty_exceeded)
@@ -209,14 +281,18 @@ static void balance_dirty_pages(struct a
 		 */
 		if (nr_reclaimable) {
 			writeback_inodes(&wbc);
-			get_dirty_limits(&background_thresh,
-					 	&dirty_thresh, mapping);
+
+			get_dirty_limits(&background_thresh, &dirty_thresh,
+				       &bdi_thresh, mapping);
 			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-			if (nr_reclaimable +
-				global_page_state(NR_WRITEBACK)
-					<= dirty_thresh)
-						break;
+			if ((nr_reclaimable + global_page_state(NR_WRITEBACK) <=
+				dirty_thresh) &&
+			    (atomic_long_read(&bdi->nr_dirty) +
+			     atomic_long_read(&bdi->nr_writeback) <=
+				 bdi_thresh))
+				break;
+
 			pages_written += write_chunk - wbc.nr_to_write;
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
@@ -312,7 +388,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
 	}
 
         for ( ; ; ) {
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
@@ -347,7 +423,7 @@ static void background_writeout(unsigned
 		long background_thresh;
 		long dirty_thresh;
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 		if (global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)
@@ -555,6 +631,7 @@ void __init page_writeback_init(void)
 	mod_timer(&wb_timer, jiffies + dirty_writeback_interval);
 	writeback_set_ratelimit();
 	register_cpu_notifier(&ratelimit_nb);
+	update_cycle_shift();
 }
 
 /**
@@ -935,6 +1012,7 @@ int test_clear_page_writeback(struct pag
 						PAGECACHE_TAG_WRITEBACK);
 			atomic_long_dec(&mapping->backing_dev_info->
 					nr_writeback);
+			bdi_writeout_inc(mapping->backing_dev_info);
 		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -215,6 +215,9 @@ void blk_queue_make_request(request_queu
 	bdi->capabilities = BDI_CAP_MAP_COPY;
 	atomic_long_set(&bdi->nr_dirty, 0);
 	atomic_long_set(&bdi->nr_writeback, 0);
+	spin_lock_init(&bdi->lock);
+	atomic_long_set(&bdi->nr_writeout, 0);
+	bdi->cycles = 0;
 	blk_queue_max_sectors(q, SAFE_MAX_SECTORS);
 	blk_queue_hardsect_size(q, 512);
 	blk_queue_dma_alignment(q, 511);




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-03-14 12:42 [PATCH 0/3] VM throttling: avoid blocking occasional writers Tomoki Sekiyama
  2007-03-14 13:18 ` Peter Zijlstra
@ 2007-03-15 19:07 ` Andrew Morton
  2007-03-18 14:59   ` Bill Davidsen
  1 sibling, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2007-03-15 19:07 UTC (permalink / raw)
  To: Tomoki Sekiyama
  Cc: linux-kernel, yumiko.sugita.yf, masami.hiramatsu.pt,
	hidehiro.kawai.ez, yuji.kakutani.uw, soshima, haoki,
	kamezawa.hiroyu, nikita, leroy.vanlogchem

> On Wed, 14 Mar 2007 21:42:46 +0900 Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com> wrote:
>
> ...
>
> 
> -Solution:
> 
> I consider that all of the dirty pages for the disk have been written
> back and that the disk is clean if a process cannot write 'write_chunk'
> pages in balance_dirty_pages().
> 
> To avoid using up the free memory with dirty pages by passing blocking,
> this patchset adds a new threshold named vm.dirty_limit_ratio to sysctl.
> 
> It modifies balance_dirty_pages() not to block when the amount of
> Dirty+Writeback is less than vm.dirty_limit_ratio percent of the memory.
> In the other cases, writers are throttled as current Linux does.
> 
> 
> In this patchset, vm.dirty_limit_ratio, instead of vm.dirty_ratio, is
> used as the clamping level of Dirty+Writeback. And, vm.dirty_ratio is
> used as the level at which a writers will itself start writeback of the
> dirty pages.

Might be a reasonable solution - let's see what Peter comes up with too.

Comments on the patch:

- Please don't VM_DIRTY_LIMIT_RATIO: just use CTL_UNNUMBERED and leave
  sysctl.h alone.

- The 40% default is already too high.  Let's set this new upper limit to
  40% and decrease he non-blocking ratio.

- Please update the procfs documentation in ./Docmentation/

- I wonder if dirty_limit_ratio is the best name we could choose. 
  vm_dirty_blocking_ratio, perhaps?  Dunno.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-03-15 19:07 ` Andrew Morton
@ 2007-03-18 14:59   ` Bill Davidsen
  2007-03-22  5:49     ` Tomoki Sekiyama
  0 siblings, 1 reply; 14+ messages in thread
From: Bill Davidsen @ 2007-03-18 14:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tomoki Sekiyama, linux-kernel, yumiko.sugita.yf,
	masami.hiramatsu.pt, hidehiro.kawai.ez, yuji.kakutani.uw,
	soshima, haoki, kamezawa.hiroyu, nikita, leroy.vanlogchem

Andrew Morton wrote:
>> On Wed, 14 Mar 2007 21:42:46 +0900 Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com> wrote:
>>
>> ...
>>
>>
>> -Solution:
>>
>> I consider that all of the dirty pages for the disk have been written
>> back and that the disk is clean if a process cannot write 'write_chunk'
>> pages in balance_dirty_pages().
>>
>> To avoid using up the free memory with dirty pages by passing blocking,
>> this patchset adds a new threshold named vm.dirty_limit_ratio to sysctl.
>>
>> It modifies balance_dirty_pages() not to block when the amount of
>> Dirty+Writeback is less than vm.dirty_limit_ratio percent of the memory.
>> In the other cases, writers are throttled as current Linux does.
>>
>>
>> In this patchset, vm.dirty_limit_ratio, instead of vm.dirty_ratio, is
>> used as the clamping level of Dirty+Writeback. And, vm.dirty_ratio is
>> used as the level at which a writers will itself start writeback of the
>> dirty pages.
> 
> Might be a reasonable solution - let's see what Peter comes up with too.
> 
> Comments on the patch:
> 
> - Please don't VM_DIRTY_LIMIT_RATIO: just use CTL_UNNUMBERED and leave
>   sysctl.h alone.
> 
> - The 40% default is already too high.  Let's set this new upper limit to
>   40% and decrease he non-blocking ratio.
> 
> - Please update the procfs documentation in ./Docmentation/
> 
> - I wonder if dirty_limit_ratio is the best name we could choose. 
>   vm_dirty_blocking_ratio, perhaps?  Dunno.
> 
I don't like it, but I dislike it less than "dirty_limit_ratio" I guess. 
It would probably break things to change it now, including my 
sysctl.conf on a number of systems :-(

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-03-18 14:59   ` Bill Davidsen
@ 2007-03-22  5:49     ` Tomoki Sekiyama
  2007-03-22 11:41       ` Bill Davidsen
  0 siblings, 1 reply; 14+ messages in thread
From: Tomoki Sekiyama @ 2007-03-22  5:49 UTC (permalink / raw)
  To: Bill Davidsen, Andrew Morton
  Cc: linux-kernel, yumiko.sugita.yf, masami.hiramatsu.pt,
	hidehiro.kawai.ez, yuji.kakutani.uw, soshima, haoki,
	kamezawa.hiroyu, nikita, leroy.vanlogchem

Hi,

Thanks for your comments.
I'm sorry for my late reply.

Bill Davidsen wrote:
 > Andrew Morton wrote:
 >>> On Wed, 14 Mar 2007 21:42:46 +0900 Tomoki Sekiyama
 >>> <tomoki.sekiyama.qu@hitachi.com> wrote:
 >>>
 >>> ...
 >>>
 >>>
 >>> -Solution:
 >>>
 >>> I consider that all of the dirty pages for the disk have been written
 >>> back and that the disk is clean if a process cannot write 'write_chunk'
 >>> pages in balance_dirty_pages().
 >>>
 >>> To avoid using up the free memory with dirty pages by passing blocking,
 >>> this patchset adds a new threshold named vm.dirty_limit_ratio to sysctl.
 >>>
 >>> It modifies balance_dirty_pages() not to block when the amount of
 >>> Dirty+Writeback is less than vm.dirty_limit_ratio percent of the memory.
 >>> In the other cases, writers are throttled as current Linux does.
 >>>
 >>>
 >>> In this patchset, vm.dirty_limit_ratio, instead of vm.dirty_ratio, is
 >>> used as the clamping level of Dirty+Writeback. And, vm.dirty_ratio is
 >>> used as the level at which a writers will itself start writeback of the
 >>> dirty pages.
 >>
 >> Might be a reasonable solution - let's see what Peter comes up with too.
 >>
 >> Comments on the patch:
 >>
 >> - Please don't VM_DIRTY_LIMIT_RATIO: just use CTL_UNNUMBERED and leave
 >>   sysctl.h alone.
 >>
 >> - The 40% default is already too high.  Let's set this new upper limit to
 >>   40% and decrease he non-blocking ratio.
 >>
 >> - Please update the procfs documentation in ./Docmentation/

OK, I'm going to fix them and repost the patchset.


 >> - I wonder if dirty_limit_ratio is the best name we could choose.
 >> vm_dirty_blocking_ratio, perhaps?  Dunno.
 >>
 > I don't like it, but I dislike it less than "dirty_limit_ratio" I guess.
 > It would probably break things to change it now, including my
 > sysctl.conf on a number of systems  :-(

I'm wondering which interface is preferred...

1) Just rename "dirty_limit_ratio" to "dirty_blocking_ratio."
    Those who had been changing dirty_ratio should additionally modify
    dirty_blocking_ratio in order to determine the upper limit of dirty pages.

2) Change "dirty_ratio" to a vector, consists of 2 values;
    {blocking ratio, writeback starting ratio}.
    For example, to change the both values:
      # echo 40 35 > /proc/sys/vm/dirty_ratio
    And to change only the first one:
      # echo 20 > /proc/sys/vm/dirty_ratio
    In the latter way the writeback starting ratio is regarded as the same as the
    blocking ratio if the writeback starting ratio is smaller. And then, the kernel behaves
    similarly as the current kernel.

3) Use "dirty_ratio" as the blocking ratio. And add
    "start_writeback_ratio", and start writeback at
    start_writeback_ratio(default:90) * dirty_ratio / 100 [%].
    In this way, specifying blocking ratio can be done in the same way as
    current kernel, but high/low watermark algorithm is enabled.


Regards,
-- 
Tomoki Sekiyama
Hitachi, Ltd., Systems Development Laboratory


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-03-22  5:49     ` Tomoki Sekiyama
@ 2007-03-22 11:41       ` Bill Davidsen
  2007-03-26 10:27         ` Tomoki Sekiyama
  0 siblings, 1 reply; 14+ messages in thread
From: Bill Davidsen @ 2007-03-22 11:41 UTC (permalink / raw)
  To: Tomoki Sekiyama
  Cc: Andrew Morton, linux-kernel, yumiko.sugita.yf,
	masami.hiramatsu.pt, hidehiro.kawai.ez, yuji.kakutani.uw,
	soshima, haoki, kamezawa.hiroyu, nikita, leroy.vanlogchem

Tomoki Sekiyama wrote:
> Hi,
>
> Thanks for your comments.
> I'm sorry for my late reply.
>
> Bill Davidsen wrote:
> > Andrew Morton wrote:
>
> >> - I wonder if dirty_limit_ratio is the best name we could choose.
> >> vm_dirty_blocking_ratio, perhaps?  Dunno.
> >>
> > I don't like it, but I dislike it less than "dirty_limit_ratio" I 
> guess.
> > It would probably break things to change it now, including my
> > sysctl.conf on a number of systems  :-(
>
> I'm wondering which interface is preferred...
>
> 1) Just rename "dirty_limit_ratio" to "dirty_blocking_ratio."
>    Those who had been changing dirty_ratio should additionally modify
>    dirty_blocking_ratio in order to determine the upper limit of dirty 
> pages.
>
> 2) Change "dirty_ratio" to a vector, consists of 2 values;
>    {blocking ratio, writeback starting ratio}.
>    For example, to change the both values:
>      # echo 40 35 > /proc/sys/vm/dirty_ratio
>    And to change only the first one:
>      # echo 20 > /proc/sys/vm/dirty_ratio
>    In the latter way the writeback starting ratio is regarded as the 
> same as the
>    blocking ratio if the writeback starting ratio is smaller. And 
> then, the kernel behaves
>    similarly as the current kernel.
>
> 3) Use "dirty_ratio" as the blocking ratio. And add
>    "start_writeback_ratio", and start writeback at
>    start_writeback_ratio(default:90) * dirty_ratio / 100 [%].
>    In this way, specifying blocking ratio can be done in the same way as
>    current kernel, but high/low watermark algorithm is enabled.
I like 3 better, it should make tuning behavior more precise. You can 
make an argument for absolute values for writeback, if my disk will only 
write 70MB/s I may only want 203 sec of pending writes, regardless of 
available memory.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-03-22 11:41       ` Bill Davidsen
@ 2007-03-26 10:27         ` Tomoki Sekiyama
  2007-03-26 17:11           ` Bill Davidsen
  0 siblings, 1 reply; 14+ messages in thread
From: Tomoki Sekiyama @ 2007-03-26 10:27 UTC (permalink / raw)
  To: Bill Davidsen, Andrew Morton
  Cc: linux-kernel, yumiko.sugita.yf, masami.hiramatsu.pt,
	hidehiro.kawai.ez, yuji.kakutani.uw, soshima, haoki,
	kamezawa.hiroyu, nikita, leroy.vanlogchem

Hi,
Thanks for your reply.

>>3) Use "dirty_ratio" as the blocking ratio. And add
>>   "start_writeback_ratio", and start writeback at
>>   start_writeback_ratio(default:90) * dirty_ratio / 100 [%].
>>   In this way, specifying blocking ratio can be done in the same way
>>   as current kernel, but high/low watermark algorithm is enabled.
>I like 3 better, it should make tuning behavior more precise.

Then, what do you think of the following idea?

(4) add `dirty_start_writeback_ratio' as percentage of memory,
    at which a generator of dirty pages itself starts writeback
    (that is, non-blocking ratio).

In this way, `dirty_ratio' is used as the blocking ratio, so we don't
need to modify the sysctl.conf etc. I think it's easier to understand
for administrators of systems, because the interface is similar as
`dirty_background_ratio' and`dirty_ratio.'

If this is OK, I'll repost the patch.

> You can make an argument for absolute values for writeback,
> if my disk will only write 70MB/s I may only want 203 sec of
> pending writes, regardless of available memory.

To realize tuning with absolute values, I consider that we need to
modify handling of `dirty_background_ratio,' `dirty_ratio' and so on as
well as `dirty_start_writeback_ratio.' I think this should be done in
another patch if this feature is required.

Regards,
--
Tomoki Sekiyama
Hitachi, Ltd., Systems Development Laboratory

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-03-26 10:27         ` Tomoki Sekiyama
@ 2007-03-26 17:11           ` Bill Davidsen
  2007-04-03 10:42             ` Tomoki Sekiyama
                               ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Bill Davidsen @ 2007-03-26 17:11 UTC (permalink / raw)
  To: Tomoki Sekiyama
  Cc: Andrew Morton, linux-kernel, yumiko.sugita.yf,
	masami.hiramatsu.pt, hidehiro.kawai.ez, yuji.kakutani.uw,
	soshima, haoki, kamezawa.hiroyu, nikita, leroy.vanlogchem

Tomoki Sekiyama wrote:
> Hi,
> Thanks for your reply.
>
>   
>>> 3) Use "dirty_ratio" as the blocking ratio. And add
>>>   "start_writeback_ratio", and start writeback at
>>>   start_writeback_ratio(default:90) * dirty_ratio / 100 [%].
>>>   In this way, specifying blocking ratio can be done in the same way
>>>   as current kernel, but high/low watermark algorithm is enabled.
>>>       
>> I like 3 better, it should make tuning behavior more precise.
>>     
>
> Then, what do you think of the following idea?
>
> (4) add `dirty_start_writeback_ratio' as percentage of memory,
>     at which a generator of dirty pages itself starts writeback
>     (that is, non-blocking ratio).
>
> In this way, `dirty_ratio' is used as the blocking ratio, so we don't
> need to modify the sysctl.conf etc. I think it's easier to understand
> for administrators of systems, because the interface is similar as
> `dirty_background_ratio' and`dirty_ratio.'
>
> If this is OK, I'll repost the patch.
>   
It sounds good to me, just be sure behavior is sane for for both 
blocking less than start_writeback and vice versa.
>   
>> You can make an argument for absolute values for writeback,
>> if my disk will only write 70MB/s I may only want 203 sec of
>> pending writes, regardless of available memory.
>>     
>
> To realize tuning with absolute values, I consider that we need to
> modify handling of `dirty_background_ratio,' `dirty_ratio' and so on as
> well as `dirty_start_writeback_ratio.' I think this should be done in
> another patch if this feature is required.
>
> Regards,
> --
> Tomoki Sekiyama
> Hitachi, Ltd., Systems Development Laboratory
>
>
>   


-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] VM throttling: avoid blocking occasional writers
  2007-03-26 17:11           ` Bill Davidsen
@ 2007-04-03 10:42             ` Tomoki Sekiyama
  2007-04-03 10:46             ` [PATCH 1/2] VM throttling: Start writeback at dirty_writeback_start_ratio Tomoki Sekiyama
  2007-04-03 10:47             ` [PATCH 2/2] VM throttling: Add vm.dirty_start_writeback_ratio to sysctl Tomoki Sekiyama
  2 siblings, 0 replies; 14+ messages in thread
From: Tomoki Sekiyama @ 2007-04-03 10:42 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Andrew Morton, linux-kernel, yumiko.sugita.yf,
	masami.hiramatsu.pt, hidehiro.kawai.ez, yuji.kakutani.uw,
	soshima, haoki

Hi,
Thanks for your comments.
I'm sorry for my late reply.

Bill Davidsen wrote:
>> Then, what do you think of the following idea?
>>
>> (4) add `dirty_start_writeback_ratio' as percentage of memory,
>>     at which a generator of dirty pages itself starts writeback
>>     (that is, non-blocking ratio).
>>
>> In this way, `dirty_ratio' is used as the blocking ratio, so we don't
>> need to modify the sysctl.conf etc. I think it's easier to understand
>> for administrators of systems, because the interface is similar as
>> `dirty_background_ratio' and`dirty_ratio.'
>>
> It sounds good to me, just be sure behavior is sane for for both
> blocking less than start_writeback and vice versa.

Then I'm going to post the new patchset.
In my new patchset, if dirty_ratio < dirty_start_writeback_ratio,
dirty_start_writeback_ratio is just regarded as the same value as
dirty_ratio, and then the kernel behaves similarly as the current kernel.

Regards,
-- 
Tomoki Sekiyama
Hitachi, Ltd., Systems Development Laboratory


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] VM throttling: Start writeback at dirty_writeback_start_ratio
  2007-03-26 17:11           ` Bill Davidsen
  2007-04-03 10:42             ` Tomoki Sekiyama
@ 2007-04-03 10:46             ` Tomoki Sekiyama
  2007-04-06  0:31               ` Andrew Morton
  2007-04-03 10:47             ` [PATCH 2/2] VM throttling: Add vm.dirty_start_writeback_ratio to sysctl Tomoki Sekiyama
  2 siblings, 1 reply; 14+ messages in thread
From: Tomoki Sekiyama @ 2007-04-03 10:46 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Bill Davidsen, yumiko.sugita.yf, masami.hiramatsu.pt,
	hidehiro.kawai.ez, yuji.kakutani.uw, soshima, haoki

This patchset is to avoid the problem that write(2) can be blocked for a
long time if a system has several disks with different speed and is
under heavy I/O pressure.

-Description of the problem:
While Dirty+Writeback pages get more than 40%(`dirty_ratio') of memory,
generators of dirty pages are blocked in balance_dirty_pages() until
they start writeback of a specific number (`write_chunk', typically=1536)
of dirty pages on the disks they write to.

Under this rule, if a process writes to the disk which has only a few
(less than 1536) dirty pages, that process will be blocked until
writeback of the other disks is completed and % of Dirty+Writeback goes
below 40%.

Thus, if a slow device (such as a USB disk) has many dirty pages, the
processes which write small data to the other disks can be blocked for
quite a long time.

-Solution:
This patch introduces high/low-watermark algorithm in
balance_dirty_pages() in order to throttle only the processes which
write to disks with heavy load.

This patch adds `dirty_start_writeback_ratio' for the low-watermark,
and modifies get_dirty_limits() to calculate and return the writeback
starting level of dirty pages based on `dirty_start_writeback_ratio'.

If % of Dirty+Writeback > `dirty_writeback_start_ratio', generators of
dirty pages start writeback of dirty pages by themselves. At that time,
these processes are not blocked in balance_dirty_pages(), but they may
be blocked if the write-requests-queue of the written disk is full
(that is, the length of the queue > `nr_requests'). By this behavior,
we can throttle only processes which write to the disks with heavy load,
and can allow processes to write to the other disks without blocking.

If % of Dirty+Writeback > `dirty_ratio', generators of dirty pages
are throttled as current Linux does, not to fill up memory with dirty
pages.

Thanks,

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
---
 include/linux/writeback.h |    1
 mm/page-writeback.c       |   52 ++++++++++++++++++++++++++++++++++++----------
 2 files changed, 42 insertions(+), 11 deletions(-)

Index: linux-2.6.21-rc5-mm3-writeback/include/linux/writeback.h
===================================================================
--- linux-2.6.21-rc5-mm3-writeback.orig/include/linux/writeback.h
+++ linux-2.6.21-rc5-mm3-writeback/include/linux/writeback.h
@@ -94,6 +94,7 @@ static inline int laptop_spinned_down(vo

 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
+extern int dirty_start_writeback_ratio;
 extern int vm_dirty_ratio;
 extern int dirty_writeback_interval;
 extern int dirty_expire_interval;
Index: linux-2.6.21-rc5-mm3-writeback/mm/page-writeback.c
===================================================================
--- linux-2.6.21-rc5-mm3-writeback.orig/mm/page-writeback.c
+++ linux-2.6.21-rc5-mm3-writeback/mm/page-writeback.c
@@ -72,6 +72,11 @@ int dirty_background_ratio = 10;
 /*
  * The generator of dirty data starts writeback at this percentage
  */
+int dirty_start_writeback_ratio = 35;
+
+/*
+ * The generator of dirty data is blocked at this percentage
+ */
 int vm_dirty_ratio = 40;

 /*
@@ -112,12 +117,16 @@ static void background_writeout(unsigned
  * performing lots of scanning.
  *
  * We only allow 1/2 of the currently-unmapped memory to be dirtied.
+ * `vm.dirty_ratio' is ignored if it is larger than that.
+ * In this case, `vm.dirty_start_writeback_ratio' is also decreased to keep
+ * writeback independently among disks.
  *
  * We don't permit the clamping level to fall below 5% - that is getting rather
  * excessive.
  *
- * We make sure that the background writeout level is below the adjusted
- * clamping level.
+ * We make sure that the active writeout level is below the adjusted clamping
+ * leve, and that the background writeout level is below the active writeout
+ * level.
  */

 static unsigned long highmem_dirtyable_memory(unsigned long total)
@@ -158,13 +167,15 @@ static unsigned long determine_dirtyable
 }

 static void
-get_dirty_limits(long *pbackground, long *pdirty,
+get_dirty_limits(long *pbackground, long *pstart_writeback, long *pdirty,
 					struct address_space *mapping)
 {
 	int background_ratio;		/* Percentages */
+	int start_writeback_ratio;
 	int dirty_ratio;
 	int unmapped_ratio;
 	long background;
+	long start_writeback;
 	long dirty;
 	unsigned long available_memory = determine_dirtyable_memory();
 	struct task_struct *tsk;
@@ -177,28 +188,40 @@ get_dirty_limits(long *pbackground, long
 	if (dirty_ratio > unmapped_ratio / 2)
 		dirty_ratio = unmapped_ratio / 2;

+	start_writeback_ratio = dirty_start_writeback_ratio;
+	if (start_writeback_ratio > dirty_ratio)
+		start_writeback_ratio = dirty_ratio;
+	start_writeback_ratio -= vm_dirty_ratio - dirty_ratio;
+
 	if (dirty_ratio < 5)
 		dirty_ratio = 5;
+	if (start_writeback_ratio < 2)
+		start_writeback_ratio = 2;

 	background_ratio = dirty_background_ratio;
-	if (background_ratio >= dirty_ratio)
-		background_ratio = dirty_ratio / 2;
+	if (background_ratio >= start_writeback_ratio)
+		background_ratio = start_writeback_ratio / 2;

 	background = (background_ratio * available_memory) / 100;
+	start_writeback = (start_writeback_ratio * available_memory) / 100;
 	dirty = (dirty_ratio * available_memory) / 100;
 	tsk = current;
 	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
 		background += background / 4;
+		start_writeback += start_writeback / 4;
 		dirty += dirty / 4;
 	}
 	*pbackground = background;
+	*pstart_writeback = start_writeback;
 	*pdirty = dirty;
 }

 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
- * the caller to perform writeback if the system is over `vm_dirty_ratio'.
+ * the caller to perform writeback if the system is over
+ * `start_writeback_thresh'. If the system is over `dirty_thresh' then the
+ * caller will be blocked unless it cannot writeback enough pages.
  * If we're over `background_thresh' then pdflush is woken to perform some
  * writeout.
  */
@@ -206,6 +229,7 @@ static void balance_dirty_pages(struct a
 {
 	long nr_reclaimable;
 	long background_thresh;
+	long start_writeback_thresh;
 	long dirty_thresh;
 	unsigned long pages_written = 0;
 	unsigned long write_chunk = sync_writeback_pages();
@@ -221,11 +245,12 @@ static void balance_dirty_pages(struct a
 			.range_cyclic	= 1,
 		};

-		get_dirty_limits(&background_thresh, &dirty_thresh, mapping);
+		get_dirty_limits(&background_thresh, &start_writeback_thresh,
+				 &dirty_thresh, mapping);
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
-			dirty_thresh)
+			start_writeback_thresh)
 				break;

 		if (!dirty_exceeded)
@@ -240,7 +265,8 @@ static void balance_dirty_pages(struct a
 		if (nr_reclaimable) {
 			writeback_inodes(&wbc);
 			get_dirty_limits(&background_thresh,
-					 	&dirty_thresh, mapping);
+					 &start_writeback_thresh,
+					 &dirty_thresh, mapping);
 			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 			if (nr_reclaimable +
@@ -329,6 +355,7 @@ EXPORT_SYMBOL(balance_dirty_pages_rateli
 void throttle_vm_writeout(gfp_t gfp_mask)
 {
 	long background_thresh;
+	long writeback_thresh;
 	long dirty_thresh;

 	if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
@@ -342,7 +369,8 @@ void throttle_vm_writeout(gfp_t gfp_mask
 	}

         for ( ; ; ) {
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &writeback_thresh,
+				 &dirty_thresh, NULL);

                 /*
                  * Boost the allowable dirty threshold a bit for page
@@ -375,9 +403,11 @@ static void background_writeout(unsigned

 	for ( ; ; ) {
 		long background_thresh;
+		long writeback_thresh;
 		long dirty_thresh;

-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &writeback_thresh,
+				 &dirty_thresh,  NULL);
 		if (global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/2] VM throttling: Add vm.dirty_start_writeback_ratio to sysctl
  2007-03-26 17:11           ` Bill Davidsen
  2007-04-03 10:42             ` Tomoki Sekiyama
  2007-04-03 10:46             ` [PATCH 1/2] VM throttling: Start writeback at dirty_writeback_start_ratio Tomoki Sekiyama
@ 2007-04-03 10:47             ` Tomoki Sekiyama
  2 siblings, 0 replies; 14+ messages in thread
From: Tomoki Sekiyama @ 2007-04-03 10:47 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Bill Davidsen, yumiko.sugita.yf, masami.hiramatsu.pt,
	hidehiro.kawai.ez, yuji.kakutani.uw, soshima, haoki

This patch adds a sysctl variable `vm.dirty_start_writeback_ratio' to
enable users to adjust the writeback starting level of the dirty pages.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
---
 Documentation/filesystems/proc.txt |   11 +++++++++--
 Documentation/sysctl/vm.txt        |    3 ++-
 kernel/sysctl.c                    |   11 +++++++++++
 3 files changed, 22 insertions(+), 3 deletions(-)

Index: linux-2.6.21-rc5-mm3-writeback/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.21-rc5-mm3-writeback.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.21-rc5-mm3-writeback/Documentation/filesystems/proc.txt
@@ -1170,13 +1170,20 @@ dirty_background_ratio
 Contains, as a percentage of total system memory, the number of pages at which
 the pdflush background writeback daemon will start writing out dirty data.

-dirty_ratio
------------------
+dirty_writeback_start_ratio
+---------------------------

 Contains, as a percentage of total system memory, the number of pages at which
 a process which is generating disk writes will itself start writing out dirty
 data.

+dirty_ratio
+-----------------
+
+Contains, as a percentage of total system memory, the number of pages at which
+a process which is generating disk writes will be blocked untill the level
+subsides.
+
 dirty_writeback_centisecs
 -------------------------

Index: linux-2.6.21-rc5-mm3-writeback/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.21-rc5-mm3-writeback.orig/Documentation/sysctl/vm.txt
+++ linux-2.6.21-rc5-mm3-writeback/Documentation/sysctl/vm.txt
@@ -19,6 +19,7 @@ Currently, these files are in /proc/sys/
 - overcommit_memory
 - page-cluster
 - dirty_ratio
+- dirty_start_writeback_ratio
 - dirty_background_ratio
 - dirty_expire_centisecs
 - dirty_writeback_centisecs
@@ -40,7 +41,7 @@ Currently, these files are in /proc/sys/
 dirty_ratio, dirty_background_ratio, dirty_expire_centisecs,
 dirty_writeback_centisecs, vfs_cache_pressure, laptop_mode,
 block_dump, swap_token_timeout, drop-caches,
-hugepages_treat_as_movable:
+hugepages_treat_as_movable, dirty_start_writeback_ratio:

 See Documentation/filesystems/proc.txt

Index: linux-2.6.21-rc5-mm3-writeback/kernel/sysctl.c
===================================================================
--- linux-2.6.21-rc5-mm3-writeback.orig/kernel/sysctl.c
+++ linux-2.6.21-rc5-mm3-writeback/kernel/sysctl.c
@@ -708,6 +708,17 @@ static ctl_table vm_table[] = {
 		.extra2		= &one_hundred,
 	},
 	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "dirty_start_writeback_ratio",
+		.data		= &dirty_start_writeback_ratio,
+		.maxlen		= sizeof(dirty_start_writeback_ratio),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+		.extra2		= &one_hundred,
+	},
+	{
 		.ctl_name	= VM_DIRTY_WB_CS,
 		.procname	= "dirty_writeback_centisecs",
 		.data		= &dirty_writeback_interval,

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] VM throttling: Start writeback at dirty_writeback_start_ratio
  2007-04-03 10:46             ` [PATCH 1/2] VM throttling: Start writeback at dirty_writeback_start_ratio Tomoki Sekiyama
@ 2007-04-06  0:31               ` Andrew Morton
  2007-04-10  3:04                 ` Tomoki Sekiyama
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2007-04-06  0:31 UTC (permalink / raw)
  To: Tomoki Sekiyama
  Cc: linux-kernel, Bill Davidsen, yumiko.sugita.yf,
	masami.hiramatsu.pt, hidehiro.kawai.ez, yuji.kakutani.uw,
	soshima, haoki, Peter Zijlstra

On Tue, 03 Apr 2007 19:46:04 +0900
Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com> wrote:

> This patchset is to avoid the problem that write(2) can be blocked for a
> long time if a system has several disks with different speed and is
> under heavy I/O pressure.
> 
> -Description of the problem:
> While Dirty+Writeback pages get more than 40%(`dirty_ratio') of memory,
> generators of dirty pages are blocked in balance_dirty_pages() until
> they start writeback of a specific number (`write_chunk', typically=1536)
> of dirty pages on the disks they write to.
> 
> Under this rule, if a process writes to the disk which has only a few
> (less than 1536) dirty pages, that process will be blocked until
> writeback of the other disks is completed and % of Dirty+Writeback goes
> below 40%.
> 
> Thus, if a slow device (such as a USB disk) has many dirty pages, the
> processes which write small data to the other disks can be blocked for
> quite a long time.
> 
> -Solution:
> This patch introduces high/low-watermark algorithm in
> balance_dirty_pages() in order to throttle only the processes which
> write to disks with heavy load.
> 
> This patch adds `dirty_start_writeback_ratio' for the low-watermark,
> and modifies get_dirty_limits() to calculate and return the writeback
> starting level of dirty pages based on `dirty_start_writeback_ratio'.
> 
> If % of Dirty+Writeback > `dirty_writeback_start_ratio', generators of
> dirty pages start writeback of dirty pages by themselves. At that time,
> these processes are not blocked in balance_dirty_pages(), but they may
> be blocked if the write-requests-queue of the written disk is full
> (that is, the length of the queue > `nr_requests'). By this behavior,
> we can throttle only processes which write to the disks with heavy load,
> and can allow processes to write to the other disks without blocking.
> 
> If % of Dirty+Writeback > `dirty_ratio', generators of dirty pages
> are throttled as current Linux does, not to fill up memory with dirty
> pages.

Does this actually solve the problem?  If the request queue is sufficiently
large (relative to the various dirty-memory thresholds) then I'd expect
that a heavy-writer will be able to very quickly take the total
dirty+writeback memory up to the dirty_ratio (should be renamed
throttle_threshold, but it's too late for that).

I suspect the reason why this patch was successful in your testing was
because dirty_start_writeback_ratio happens to exceed the size of the disk
request queues, so the heavy writer is getting stuck on disk request queue
exhaustion.

But that won't work if we have a lot of processes writing to a lot of
disks, and it won't work if the request queue size is large, or if the
dirty-memory thresholds are small (relative to the request queue size).

Do the patches still work after
`echo 10000 > /sys/block/sda/queue/nr_requests'?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] VM throttling: Start writeback at dirty_writeback_start_ratio
  2007-04-06  0:31               ` Andrew Morton
@ 2007-04-10  3:04                 ` Tomoki Sekiyama
  2007-04-10  3:46                   ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Tomoki Sekiyama @ 2007-04-10  3:04 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Bill Davidsen, yumiko.sugita.yf, masami.hiramatsu.pt,
	hidehiro.kawai.ez, yuji.kakutani.uw, soshima, haoki,
	Peter Zijlstra

Hello Andrew,
Thank you for your comments.

Andrew Morton wrote:
> On Tue, 03 Apr 2007 19:46:04 +0900
> Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com> wrote:
>> If % of Dirty+Writeback > `dirty_writeback_start_ratio', generators of
>> dirty pages start writeback of dirty pages by themselves. At that time,
>> these processes are not blocked in balance_dirty_pages(), but they may
>> be blocked if the write-requests-queue of the written disk is full
>> (that is, the length of the queue > `nr_requests'). By this behavior,
>> we can throttle only processes which write to the disks with heavy load,
>> and can allow processes to write to the other disks without blocking.
>>
>> If % of Dirty+Writeback > `dirty_ratio', generators of dirty pages
>> are throttled as current Linux does, not to fill up memory with dirty
>> pages.
> 
> Does this actually solve the problem?  If the request queue is sufficiently
> large (relative to the various dirty-memory thresholds) then I'd expect
> that a heavy-writer will be able to very quickly take the total
> dirty+writeback memory up to the dirty_ratio (should be renamed
> throttle_threshold, but it's too late for that).
> 
> I suspect the reason why this patch was successful in your testing was
> because dirty_start_writeback_ratio happens to exceed the size of the disk
> request queues, so the heavy writer is getting stuck on disk request queue
> exhaustion.
> 
> But that won't work if we have a lot of processes writing to a lot of
> disks, and it won't work if the request queue size is large, or if the
> dirty-memory thresholds are small (relative to the request queue size).
> 
> Do the patches still work after
> `echo 10000 > /sys/block/sda/queue/nr_requests'?

As you pointed out, this patch has no effect if nr_requests is too large,
because it distinguishes heavy disks depending on the length of the write-
requests queue of each disk.

This patch is for providing the system administrators with room to avoid
the problem by adjusting parameters appropriately, rather than an automatic
solution for any possible situations.

Could you please tell me some situations in which we should set nr_request
that large?

Thanks,
-- 
Tomoki Sekiyama
Hitachi, Ltd., Systems Development Laboratory

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] VM throttling: Start writeback at dirty_writeback_start_ratio
  2007-04-10  3:04                 ` Tomoki Sekiyama
@ 2007-04-10  3:46                   ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2007-04-10  3:46 UTC (permalink / raw)
  To: Tomoki Sekiyama
  Cc: linux-kernel, Bill Davidsen, yumiko.sugita.yf,
	masami.hiramatsu.pt, hidehiro.kawai.ez, yuji.kakutani.uw,
	soshima, haoki, Peter Zijlstra

On Tue, 10 Apr 2007 12:04:54 +0900 Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com> wrote:

> Hello Andrew,
> Thank you for your comments.
> 
> Andrew Morton wrote:
> > On Tue, 03 Apr 2007 19:46:04 +0900
> > Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com> wrote:
> >> If % of Dirty+Writeback > `dirty_writeback_start_ratio', generators of
> >> dirty pages start writeback of dirty pages by themselves. At that time,
> >> these processes are not blocked in balance_dirty_pages(), but they may
> >> be blocked if the write-requests-queue of the written disk is full
> >> (that is, the length of the queue > `nr_requests'). By this behavior,
> >> we can throttle only processes which write to the disks with heavy load,
> >> and can allow processes to write to the other disks without blocking.
> >>
> >> If % of Dirty+Writeback > `dirty_ratio', generators of dirty pages
> >> are throttled as current Linux does, not to fill up memory with dirty
> >> pages.
> > 
> > Does this actually solve the problem?  If the request queue is sufficiently
> > large (relative to the various dirty-memory thresholds) then I'd expect
> > that a heavy-writer will be able to very quickly take the total
> > dirty+writeback memory up to the dirty_ratio (should be renamed
> > throttle_threshold, but it's too late for that).
> > 
> > I suspect the reason why this patch was successful in your testing was
> > because dirty_start_writeback_ratio happens to exceed the size of the disk
> > request queues, so the heavy writer is getting stuck on disk request queue
> > exhaustion.
> > 
> > But that won't work if we have a lot of processes writing to a lot of
> > disks, and it won't work if the request queue size is large, or if the
> > dirty-memory thresholds are small (relative to the request queue size).
> > 
> > Do the patches still work after
> > `echo 10000 > /sys/block/sda/queue/nr_requests'?
> 
> As you pointed out, this patch has no effect if nr_requests is too large,
> because it distinguishes heavy disks depending on the length of the write-
> requests queue of each disk.
> 
> This patch is for providing the system administrators with room to avoid
> the problem by adjusting parameters appropriately, rather than an automatic
> solution for any possible situations.
> 
> Could you please tell me some situations in which we should set nr_request
> that large?

It's probably not a sensible thing to do.  But it's _possible_ to do, and
the fact that the kernel will again misbehave indicates an overall weakness
in our design.

And there are other ways in which this situation could occur:

- The request queue has a fixed size (it is not scaled according to the
  amount of memory in the machine).  So if the machine is small enough
  (say, 64MB) then the problem can happen.

- The machine could have a large number of disks

- The queue size of 128 is in units of "number of requests".  But it is
  independent upon the _size_ of those requests.  If someone comes up with
  a driver which wants to use 16MB-sized requests, the problem will again
  reoccur.

For all these sorts of reasons, we have learned that we should avoid any
dependence upon request queue exhaustion within the VM/VFS/etc.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2007-04-10  3:47 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-14 12:42 [PATCH 0/3] VM throttling: avoid blocking occasional writers Tomoki Sekiyama
2007-03-14 13:18 ` Peter Zijlstra
2007-03-15 19:07 ` Andrew Morton
2007-03-18 14:59   ` Bill Davidsen
2007-03-22  5:49     ` Tomoki Sekiyama
2007-03-22 11:41       ` Bill Davidsen
2007-03-26 10:27         ` Tomoki Sekiyama
2007-03-26 17:11           ` Bill Davidsen
2007-04-03 10:42             ` Tomoki Sekiyama
2007-04-03 10:46             ` [PATCH 1/2] VM throttling: Start writeback at dirty_writeback_start_ratio Tomoki Sekiyama
2007-04-06  0:31               ` Andrew Morton
2007-04-10  3:04                 ` Tomoki Sekiyama
2007-04-10  3:46                   ` Andrew Morton
2007-04-03 10:47             ` [PATCH 2/2] VM throttling: Add vm.dirty_start_writeback_ratio to sysctl Tomoki Sekiyama

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.