All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/6] per device dirty throttling
@ 2007-03-19 15:57 ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 15:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra

This patch-set implements per device dirty page throttling. Which should solve
the problem we currently have with one device hogging the dirty limit.

Preliminary testing shows good results:

mem=128M

time (dd if=/dev/zero of=/mnt/<dev>/zero bs=4096 count=$((1024*1024/4)); sync)

1GB to disk

real    0m33.074s       0m34.596s       0m33.387s
user    0m0.147s        0m0.163s        0m0.142s
sys     0m7.872s        0m8.409s        0m8.395s

1GB to usb-flash

real    3m21.170s       3m15.512s       3m23.889s
user    0m0.135s        0m0.146s        0m0.127s
sys     0m7.327s        0m7.328s        0m7.342s


2.6.20 device vs device

1GB disk vs disk

real    1m30.736s       1m16.133s       1m42.068s
user    0m0.204s        0m0.167s        0m0.222s
sys     0m10.438s       0m7.958s        0m10.599s

1GB usb-flash vs background disk

N/A 30m+

1GB disk vs background usb-flash

real    4m0.687s        2m20.145s       4m12.923s
user    0m0.173s        0m0.185s        0m0.161s
sys     0m8.227s        0m8.581s        0m8.345s


2.6.20-writeback

1GB disk vs disk

real    0m36.696s	0m40.837s	0m38.679s
user    0m0.161s	0m0.148s	0m0.160s
sys     0m8.240s	0m8.068s	0m8.174s

1GB usb-flash vs background disk

real    3m37.464s	3m49.720s	4m5.805s
user    0m0.167s	0m0.166s	0m0.149s
sys     0m7.195s	0m7.281s	0m7.199s

1GB disk vs background usb-flash

real    0m41.585s	0m30.888s	0m34.493s
user    0m0.161s	0m0.167s	0m0.162s
sys     0m7.826s	0m7.807s	0m7.821s



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 0/6] per device dirty throttling
@ 2007-03-19 15:57 ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 15:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra

This patch-set implements per device dirty page throttling. Which should solve
the problem we currently have with one device hogging the dirty limit.

Preliminary testing shows good results:

mem=128M

time (dd if=/dev/zero of=/mnt/<dev>/zero bs=4096 count=$((1024*1024/4)); sync)

1GB to disk

real    0m33.074s       0m34.596s       0m33.387s
user    0m0.147s        0m0.163s        0m0.142s
sys     0m7.872s        0m8.409s        0m8.395s

1GB to usb-flash

real    3m21.170s       3m15.512s       3m23.889s
user    0m0.135s        0m0.146s        0m0.127s
sys     0m7.327s        0m7.328s        0m7.342s


2.6.20 device vs device

1GB disk vs disk

real    1m30.736s       1m16.133s       1m42.068s
user    0m0.204s        0m0.167s        0m0.222s
sys     0m10.438s       0m7.958s        0m10.599s

1GB usb-flash vs background disk

N/A 30m+

1GB disk vs background usb-flash

real    4m0.687s        2m20.145s       4m12.923s
user    0m0.173s        0m0.185s        0m0.161s
sys     0m8.227s        0m8.581s        0m8.345s


2.6.20-writeback

1GB disk vs disk

real    0m36.696s	0m40.837s	0m38.679s
user    0m0.161s	0m0.148s	0m0.160s
sys     0m8.240s	0m8.068s	0m8.174s

1GB usb-flash vs background disk

real    3m37.464s	3m49.720s	4m5.805s
user    0m0.167s	0m0.166s	0m0.149s
sys     0m7.195s	0m7.281s	0m7.199s

1GB disk vs background usb-flash

real    0m41.585s	0m30.888s	0m34.493s
user    0m0.161s	0m0.167s	0m0.162s
sys     0m7.826s	0m7.807s	0m7.821s


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 1/6] mm: scalable bdi statistics counters.
  2007-03-19 15:57 ` Peter Zijlstra
@ 2007-03-19 15:57   ` Peter Zijlstra
  -1 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 15:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra

[-- Attachment #1: bdi_stat.patch --]
[-- Type: text/plain, Size: 7681 bytes --]

Provide scalable per backing_dev_info statistics counters modeled on the ZVC
code.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c           |    1 
 drivers/block/rd.c          |    2 
 drivers/char/mem.c          |    2 
 fs/char_dev.c               |    1 
 include/linux/backing-dev.h |   86 ++++++++++++++++++++++++++++++++++++
 mm/backing-dev.c            |  103 ++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 195 insertions(+)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -211,6 +211,7 @@ void blk_queue_make_request(request_queu
 	q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 	q->backing_dev_info.state = 0;
 	q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
+	bdi_stat_init(&q->backing_dev_info);
 	blk_queue_max_sectors(q, SAFE_MAX_SECTORS);
 	blk_queue_hardsect_size(q, 512);
 	blk_queue_dma_alignment(q, 511);
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -22,6 +22,17 @@ enum bdi_state {
 	BDI_unused,		/* Available bits start here */
 };
 
+enum bdi_stat_item {
+	NR_BDI_STAT_ITEMS
+};
+
+#ifdef CONFIG_SMP
+struct bdi_per_cpu_data {
+	s8 stat_threshold;
+	s8 bdi_stat_diff[NR_BDI_STAT_ITEMS];
+} ____cacheline_aligned_in_smp;
+#endif
+
 typedef int (congested_fn)(void *, int);
 
 struct backing_dev_info {
@@ -32,8 +43,83 @@ struct backing_dev_info {
 	void *congested_data;	/* Pointer to aux data for congested func */
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
 	void *unplug_io_data;
+
+	atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS];
+#ifdef CONFIG_SMP
+	struct bdi_per_cpu_data pcd[NR_CPUS];
+#endif
 };
 
+extern atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS];
+
+static inline void bdi_stat_add(long x, struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	atomic_long_add(x, &bdi->bdi_stats[item]);
+	atomic_long_add(x, &bdi_stats[item]);
+}
+
+/*
+ * cannot be unsigned long and clip on 0.
+ */
+static inline unsigned long global_bdi_stat(enum bdi_stat_item item)
+{
+	long x = atomic_long_read(&bdi_stats[item]);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
+static inline unsigned long bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	long x = atomic_long_read(&bdi->bdi_stats[item]);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
+#ifdef CONFIG_SMP
+void __mod_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item, int delta);
+void __inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
+void __dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
+
+void mod_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item, int delta);
+void inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
+void dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
+
+#else /* CONFIG_SMP */
+
+static inline void __mod_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, int delta)
+{
+	bdi_stat_add(delta, bdi, item);
+}
+
+static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	atomic_long_inc(&bdi->bdi_stats[item]);
+	atomic_long_inc(&bdi_stats[item]);
+}
+
+static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	atomic_long_dec(&bdi->bdi_stats[item]);
+	atomic_long_dec(&bdi_stats[item]);
+}
+
+#define mod_bdi_stat __mod_bdi_stat
+#define inc_bdi_stat __inc_bdi_stat
+#define dec_bdi_stat __dec_bdi_stat
+#endif
+
+void bdi_stat_init(struct backing_dev_info *bdi);
 
 /*
  * Flags in backing_dev_info::capability
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -67,3 +67,106 @@ void congestion_end(int rw)
 		wake_up(wqh);
 }
 EXPORT_SYMBOL(congestion_end);
+
+atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS];
+EXPORT_SYMBOL(bdi_stats);
+
+void bdi_stat_init(struct backing_dev_info *bdi)
+{
+	int i;
+
+	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
+		atomic_long_set(&bdi->bdi_stats[i], 0);
+
+#ifdef CONFIG_SMP
+	for (i = 0; i < NR_CPUS; i++) {
+		int j;
+		for (j = 0; j < NR_BDI_STAT_ITEMS; j++)
+			bdi->pcd[i].bdi_stat_diff[j] = 0;
+		bdi->pcd[i].stat_threshold = 8 * ilog2(num_online_cpus());
+	}
+#endif
+}
+EXPORT_SYMBOL(bdi_stat_init);
+
+#ifdef CONFIG_SMP
+void __mod_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, int delta)
+{
+	struct bdi_per_cpu_data *pcd = &bdi->pcd[smp_processor_id()];
+	s8 *p = pcd->bdi_stat_diff + item;
+	long x;
+
+	x = delta + *p;
+
+	if (unlikely(x > pcd->stat_threshold || x < -pcd->stat_threshold)) {
+		bdi_stat_add(x, bdi, item);
+		x = 0;
+	}
+	*p = x;
+}
+EXPORT_SYMBOL(__mod_bdi_stat);
+
+void mod_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, int delta)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__mod_bdi_stat(bdi, item, delta);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(mod_bdi_stat);
+
+void __inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item)
+{
+	struct bdi_per_cpu_data *pcd = &bdi->pcd[smp_processor_id()];
+	s8 *p = pcd->bdi_stat_diff + item;
+
+	(*p)++;
+
+	if (unlikely(*p > pcd->stat_threshold)) {
+		int overstep = pcd->stat_threshold / 2;
+
+		bdi_stat_add(*p + overstep, bdi, item);
+		*p = -overstep;
+	}
+}
+EXPORT_SYMBOL(__inc_bdi_stat);
+
+void inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__inc_bdi_stat(bdi, item);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(inc_bdi_stat);
+
+void __dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item)
+{
+	struct bdi_per_cpu_data *pcd = &bdi->pcd[smp_processor_id()];
+	s8 *p = pcd->bdi_stat_diff + item;
+
+	(*p)--;
+
+	if (unlikely(*p < -pcd->stat_threshold)) {
+		int overstep = pcd->stat_threshold / 2;
+
+		bdi_stat_add(*p - overstep, bdi, item);
+		*p = overstep;
+	}
+}
+EXPORT_SYMBOL(__dec_bdi_stat);
+
+void dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__dec_bdi_stat(bdi, item);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(dec_bdi_stat);
+#endif
Index: linux-2.6/drivers/block/rd.c
===================================================================
--- linux-2.6.orig/drivers/block/rd.c
+++ linux-2.6/drivers/block/rd.c
@@ -421,6 +421,8 @@ static int __init rd_init(void)
 	int i;
 	int err = -ENOMEM;
 
+	bdi_stat_init(&rd_file_backing_dev_info);
+
 	if (rd_blocksize > PAGE_SIZE || rd_blocksize < 512 ||
 			(rd_blocksize & (rd_blocksize-1))) {
 		printk("RAMDISK: wrong blocksize %d, reverting to defaults\n",
Index: linux-2.6/drivers/char/mem.c
===================================================================
--- linux-2.6.orig/drivers/char/mem.c
+++ linux-2.6/drivers/char/mem.c
@@ -988,6 +988,8 @@ static int __init chr_dev_init(void)
 			      MKDEV(MEM_MAJOR, devlist[i].minor),
 			      devlist[i].name);
 
+	bdi_stat_init(&zero_bdi);
+
 	return 0;
 }
 
Index: linux-2.6/fs/char_dev.c
===================================================================
--- linux-2.6.orig/fs/char_dev.c
+++ linux-2.6/fs/char_dev.c
@@ -545,6 +545,7 @@ static struct kobject *base_probe(dev_t 
 void __init chrdev_init(void)
 {
 	cdev_map = kobj_map_init(base_probe, &chrdevs_lock);
+	bdi_stat_init(&directly_mappable_cdev_bdi);
 }
 
 

--


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 1/6] mm: scalable bdi statistics counters.
@ 2007-03-19 15:57   ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 15:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra

[-- Attachment #1: bdi_stat.patch --]
[-- Type: text/plain, Size: 7906 bytes --]

Provide scalable per backing_dev_info statistics counters modeled on the ZVC
code.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c           |    1 
 drivers/block/rd.c          |    2 
 drivers/char/mem.c          |    2 
 fs/char_dev.c               |    1 
 include/linux/backing-dev.h |   86 ++++++++++++++++++++++++++++++++++++
 mm/backing-dev.c            |  103 ++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 195 insertions(+)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -211,6 +211,7 @@ void blk_queue_make_request(request_queu
 	q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 	q->backing_dev_info.state = 0;
 	q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
+	bdi_stat_init(&q->backing_dev_info);
 	blk_queue_max_sectors(q, SAFE_MAX_SECTORS);
 	blk_queue_hardsect_size(q, 512);
 	blk_queue_dma_alignment(q, 511);
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -22,6 +22,17 @@ enum bdi_state {
 	BDI_unused,		/* Available bits start here */
 };
 
+enum bdi_stat_item {
+	NR_BDI_STAT_ITEMS
+};
+
+#ifdef CONFIG_SMP
+struct bdi_per_cpu_data {
+	s8 stat_threshold;
+	s8 bdi_stat_diff[NR_BDI_STAT_ITEMS];
+} ____cacheline_aligned_in_smp;
+#endif
+
 typedef int (congested_fn)(void *, int);
 
 struct backing_dev_info {
@@ -32,8 +43,83 @@ struct backing_dev_info {
 	void *congested_data;	/* Pointer to aux data for congested func */
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
 	void *unplug_io_data;
+
+	atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS];
+#ifdef CONFIG_SMP
+	struct bdi_per_cpu_data pcd[NR_CPUS];
+#endif
 };
 
+extern atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS];
+
+static inline void bdi_stat_add(long x, struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	atomic_long_add(x, &bdi->bdi_stats[item]);
+	atomic_long_add(x, &bdi_stats[item]);
+}
+
+/*
+ * cannot be unsigned long and clip on 0.
+ */
+static inline unsigned long global_bdi_stat(enum bdi_stat_item item)
+{
+	long x = atomic_long_read(&bdi_stats[item]);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
+static inline unsigned long bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	long x = atomic_long_read(&bdi->bdi_stats[item]);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
+#ifdef CONFIG_SMP
+void __mod_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item, int delta);
+void __inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
+void __dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
+
+void mod_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item, int delta);
+void inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
+void dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
+
+#else /* CONFIG_SMP */
+
+static inline void __mod_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, int delta)
+{
+	bdi_stat_add(delta, bdi, item);
+}
+
+static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	atomic_long_inc(&bdi->bdi_stats[item]);
+	atomic_long_inc(&bdi_stats[item]);
+}
+
+static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	atomic_long_dec(&bdi->bdi_stats[item]);
+	atomic_long_dec(&bdi_stats[item]);
+}
+
+#define mod_bdi_stat __mod_bdi_stat
+#define inc_bdi_stat __inc_bdi_stat
+#define dec_bdi_stat __dec_bdi_stat
+#endif
+
+void bdi_stat_init(struct backing_dev_info *bdi);
 
 /*
  * Flags in backing_dev_info::capability
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -67,3 +67,106 @@ void congestion_end(int rw)
 		wake_up(wqh);
 }
 EXPORT_SYMBOL(congestion_end);
+
+atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS];
+EXPORT_SYMBOL(bdi_stats);
+
+void bdi_stat_init(struct backing_dev_info *bdi)
+{
+	int i;
+
+	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
+		atomic_long_set(&bdi->bdi_stats[i], 0);
+
+#ifdef CONFIG_SMP
+	for (i = 0; i < NR_CPUS; i++) {
+		int j;
+		for (j = 0; j < NR_BDI_STAT_ITEMS; j++)
+			bdi->pcd[i].bdi_stat_diff[j] = 0;
+		bdi->pcd[i].stat_threshold = 8 * ilog2(num_online_cpus());
+	}
+#endif
+}
+EXPORT_SYMBOL(bdi_stat_init);
+
+#ifdef CONFIG_SMP
+void __mod_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, int delta)
+{
+	struct bdi_per_cpu_data *pcd = &bdi->pcd[smp_processor_id()];
+	s8 *p = pcd->bdi_stat_diff + item;
+	long x;
+
+	x = delta + *p;
+
+	if (unlikely(x > pcd->stat_threshold || x < -pcd->stat_threshold)) {
+		bdi_stat_add(x, bdi, item);
+		x = 0;
+	}
+	*p = x;
+}
+EXPORT_SYMBOL(__mod_bdi_stat);
+
+void mod_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, int delta)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__mod_bdi_stat(bdi, item, delta);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(mod_bdi_stat);
+
+void __inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item)
+{
+	struct bdi_per_cpu_data *pcd = &bdi->pcd[smp_processor_id()];
+	s8 *p = pcd->bdi_stat_diff + item;
+
+	(*p)++;
+
+	if (unlikely(*p > pcd->stat_threshold)) {
+		int overstep = pcd->stat_threshold / 2;
+
+		bdi_stat_add(*p + overstep, bdi, item);
+		*p = -overstep;
+	}
+}
+EXPORT_SYMBOL(__inc_bdi_stat);
+
+void inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__inc_bdi_stat(bdi, item);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(inc_bdi_stat);
+
+void __dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item)
+{
+	struct bdi_per_cpu_data *pcd = &bdi->pcd[smp_processor_id()];
+	s8 *p = pcd->bdi_stat_diff + item;
+
+	(*p)--;
+
+	if (unlikely(*p < -pcd->stat_threshold)) {
+		int overstep = pcd->stat_threshold / 2;
+
+		bdi_stat_add(*p - overstep, bdi, item);
+		*p = overstep;
+	}
+}
+EXPORT_SYMBOL(__dec_bdi_stat);
+
+void dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__dec_bdi_stat(bdi, item);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(dec_bdi_stat);
+#endif
Index: linux-2.6/drivers/block/rd.c
===================================================================
--- linux-2.6.orig/drivers/block/rd.c
+++ linux-2.6/drivers/block/rd.c
@@ -421,6 +421,8 @@ static int __init rd_init(void)
 	int i;
 	int err = -ENOMEM;
 
+	bdi_stat_init(&rd_file_backing_dev_info);
+
 	if (rd_blocksize > PAGE_SIZE || rd_blocksize < 512 ||
 			(rd_blocksize & (rd_blocksize-1))) {
 		printk("RAMDISK: wrong blocksize %d, reverting to defaults\n",
Index: linux-2.6/drivers/char/mem.c
===================================================================
--- linux-2.6.orig/drivers/char/mem.c
+++ linux-2.6/drivers/char/mem.c
@@ -988,6 +988,8 @@ static int __init chr_dev_init(void)
 			      MKDEV(MEM_MAJOR, devlist[i].minor),
 			      devlist[i].name);
 
+	bdi_stat_init(&zero_bdi);
+
 	return 0;
 }
 
Index: linux-2.6/fs/char_dev.c
===================================================================
--- linux-2.6.orig/fs/char_dev.c
+++ linux-2.6/fs/char_dev.c
@@ -545,6 +545,7 @@ static struct kobject *base_probe(dev_t 
 void __init chrdev_init(void)
 {
 	cdev_map = kobj_map_init(base_probe, &chrdevs_lock);
+	bdi_stat_init(&directly_mappable_cdev_bdi);
 }
 
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 2/6] mm: count dirty pages per BDI
  2007-03-19 15:57 ` Peter Zijlstra
@ 2007-03-19 15:57   ` Peter Zijlstra
  -1 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 15:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra

[-- Attachment #1: bdi_stat_dirty.patch --]
[-- Type: text/plain, Size: 2374 bytes --]

Count per BDI dirty pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/buffer.c                 |    1 +
 include/linux/backing-dev.h |    1 +
 mm/page-writeback.c         |    2 ++
 mm/truncate.c               |    1 +
 4 files changed, 5 insertions(+)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -732,6 +732,7 @@ int __set_page_dirty_buffers(struct page
 	if (page->mapping) {	/* Race with truncate? */
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
+			__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 			task_io_account_write(PAGE_CACHE_SIZE);
 		}
 		radix_tree_tag_set(&mapping->page_tree,
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -766,6 +766,7 @@ int __set_page_dirty_nobuffers(struct pa
 			BUG_ON(mapping2 != mapping);
 			if (mapping_cap_account_dirty(mapping)) {
 				__inc_zone_page_state(page, NR_FILE_DIRTY);
+				__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 				task_io_account_write(PAGE_CACHE_SIZE);
 			}
 			radix_tree_tag_set(&mapping->page_tree,
@@ -892,6 +893,7 @@ int clear_page_dirty_for_io(struct page 
 			set_page_dirty(page);
 		if (TestClearPageDirty(page)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 			return 1;
 		}
 		return 0;
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -71,6 +71,7 @@ void cancel_dirty_page(struct page *page
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 			if (account_size)
 				task_io_account_cancelled_write(account_size);
 		}
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -23,6 +23,7 @@ enum bdi_state {
 };
 
 enum bdi_stat_item {
+	BDI_DIRTY,
 	NR_BDI_STAT_ITEMS
 };
 

--


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 2/6] mm: count dirty pages per BDI
@ 2007-03-19 15:57   ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 15:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra

[-- Attachment #1: bdi_stat_dirty.patch --]
[-- Type: text/plain, Size: 2599 bytes --]

Count per BDI dirty pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/buffer.c                 |    1 +
 include/linux/backing-dev.h |    1 +
 mm/page-writeback.c         |    2 ++
 mm/truncate.c               |    1 +
 4 files changed, 5 insertions(+)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -732,6 +732,7 @@ int __set_page_dirty_buffers(struct page
 	if (page->mapping) {	/* Race with truncate? */
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
+			__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 			task_io_account_write(PAGE_CACHE_SIZE);
 		}
 		radix_tree_tag_set(&mapping->page_tree,
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -766,6 +766,7 @@ int __set_page_dirty_nobuffers(struct pa
 			BUG_ON(mapping2 != mapping);
 			if (mapping_cap_account_dirty(mapping)) {
 				__inc_zone_page_state(page, NR_FILE_DIRTY);
+				__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 				task_io_account_write(PAGE_CACHE_SIZE);
 			}
 			radix_tree_tag_set(&mapping->page_tree,
@@ -892,6 +893,7 @@ int clear_page_dirty_for_io(struct page 
 			set_page_dirty(page);
 		if (TestClearPageDirty(page)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 			return 1;
 		}
 		return 0;
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -71,6 +71,7 @@ void cancel_dirty_page(struct page *page
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 			if (account_size)
 				task_io_account_cancelled_write(account_size);
 		}
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -23,6 +23,7 @@ enum bdi_state {
 };
 
 enum bdi_stat_item {
+	BDI_DIRTY,
 	NR_BDI_STAT_ITEMS
 };
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 3/6] mm: count writeback pages per BDI
  2007-03-19 15:57 ` Peter Zijlstra
@ 2007-03-19 15:57   ` Peter Zijlstra
  -1 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 15:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra

[-- Attachment #1: bdi_stat_writeback.patch --]
[-- Type: text/plain, Size: 1629 bytes --]

Count per BDI writeback pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |    1 +
 mm/page-writeback.c         |    8 ++++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -912,10 +912,12 @@ int test_clear_page_writeback(struct pag
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestClearPageWriteback(page);
-		if (ret)
+		if (ret) {
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			__dec_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestClearPageWriteback(page);
@@ -933,10 +935,12 @@ int test_set_page_writeback(struct page 
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
-		if (!ret)
+		if (!ret) {
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			__inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -24,6 +24,7 @@ enum bdi_state {
 
 enum bdi_stat_item {
 	BDI_DIRTY,
+	BDI_WRITEBACK,
 	NR_BDI_STAT_ITEMS
 };
 

--


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 3/6] mm: count writeback pages per BDI
@ 2007-03-19 15:57   ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 15:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra

[-- Attachment #1: bdi_stat_writeback.patch --]
[-- Type: text/plain, Size: 1854 bytes --]

Count per BDI writeback pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |    1 +
 mm/page-writeback.c         |    8 ++++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -912,10 +912,12 @@ int test_clear_page_writeback(struct pag
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestClearPageWriteback(page);
-		if (ret)
+		if (ret) {
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			__dec_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestClearPageWriteback(page);
@@ -933,10 +935,12 @@ int test_set_page_writeback(struct page 
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
-		if (!ret)
+		if (!ret) {
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			__inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -24,6 +24,7 @@ enum bdi_state {
 
 enum bdi_stat_item {
 	BDI_DIRTY,
+	BDI_WRITEBACK,
 	NR_BDI_STAT_ITEMS
 };
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 4/6] mm: count unstable pages per BDI
  2007-03-19 15:57 ` Peter Zijlstra
@ 2007-03-19 15:57   ` Peter Zijlstra
  -1 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 15:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra

[-- Attachment #1: bdi_stat_unstable.patch --]
[-- Type: text/plain, Size: 2019 bytes --]

Count per BDI unstable pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/write.c              |    4 ++++
 include/linux/backing-dev.h |    1 +
 2 files changed, 5 insertions(+)

Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -474,6 +474,7 @@ nfs_mark_request_commit(struct nfs_page 
 	nfsi->ncommit++;
 	spin_unlock(&nfsi->req_lock);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 }
 #endif
@@ -545,6 +546,7 @@ static void nfs_cancel_commit_list(struc
 	while(!list_empty(head)) {
 		req = nfs_list_entry(head->next);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 		nfs_list_remove_request(req);
 		nfs_inode_remove_request(req);
 		nfs_unlock_request(req);
@@ -1278,6 +1280,7 @@ nfs_commit_list(struct inode *inode, str
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 		nfs_clear_page_writeback(req);
 	}
 	return -ENOMEM;
@@ -1302,6 +1305,7 @@ static void nfs_commit_done(struct rpc_t
 		req = nfs_list_entry(data->pages.next);
 		nfs_list_remove_request(req);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 
 		dprintk("NFS: commit (%s/%Ld %d@%Ld)",
 			req->wb_context->dentry->d_inode->i_sb->s_id,
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -25,6 +25,7 @@ enum bdi_state {
 enum bdi_stat_item {
 	BDI_DIRTY,
 	BDI_WRITEBACK,
+	BDI_UNSTABLE,
 	NR_BDI_STAT_ITEMS
 };
 

--


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 4/6] mm: count unstable pages per BDI
@ 2007-03-19 15:57   ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 15:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra

[-- Attachment #1: bdi_stat_unstable.patch --]
[-- Type: text/plain, Size: 2244 bytes --]

Count per BDI unstable pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/write.c              |    4 ++++
 include/linux/backing-dev.h |    1 +
 2 files changed, 5 insertions(+)

Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -474,6 +474,7 @@ nfs_mark_request_commit(struct nfs_page 
 	nfsi->ncommit++;
 	spin_unlock(&nfsi->req_lock);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 }
 #endif
@@ -545,6 +546,7 @@ static void nfs_cancel_commit_list(struc
 	while(!list_empty(head)) {
 		req = nfs_list_entry(head->next);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 		nfs_list_remove_request(req);
 		nfs_inode_remove_request(req);
 		nfs_unlock_request(req);
@@ -1278,6 +1280,7 @@ nfs_commit_list(struct inode *inode, str
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 		nfs_clear_page_writeback(req);
 	}
 	return -ENOMEM;
@@ -1302,6 +1305,7 @@ static void nfs_commit_done(struct rpc_t
 		req = nfs_list_entry(data->pages.next);
 		nfs_list_remove_request(req);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 
 		dprintk("NFS: commit (%s/%Ld %d@%Ld)",
 			req->wb_context->dentry->d_inode->i_sb->s_id,
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -25,6 +25,7 @@ enum bdi_state {
 enum bdi_stat_item {
 	BDI_DIRTY,
 	BDI_WRITEBACK,
+	BDI_UNSTABLE,
 	NR_BDI_STAT_ITEMS
 };
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 5/6] mm: per device dirty threshold
  2007-03-19 15:57 ` Peter Zijlstra
@ 2007-03-19 15:57   ` Peter Zijlstra
  -1 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 15:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra

[-- Attachment #1: writeback-balance-per-backing_dev.patch --]
[-- Type: text/plain, Size: 10510 bytes --]

Scale writeback cache per backing device, proportional to its writeout speed.

akpm sayeth:
> Which problem are we trying to solve here?  afaik our two uppermost
> problems are:
> 
> a) Heavy write to queue A causes light writer to queue B to blok for a long
> time in balance_dirty_pages().  Even if the devices have the same speed.  

This one; esp when not the same speed. The - my usb stick makes my
computer suck - problem. But even on similar speed, the separation of
device should avoid blocking dev B when dev A is being throttled.

The writeout speed is measure dynamically, so when it doesn't have
anything to write out for a while its writeback cache size goes to 0.

Conversely, when starting up it will in the beginning act almost
synchronous but will quickly build up a 'fair' share of the writeback
cache.

> b) heavy write to device A causes light write to device A to block for a
> long time in balance_dirty_pages(), occasionally.  Harder to fix.

This will indeed take more. I've thought about it though. But one
quickly ends up with per task state.


How it all works:

We pick a 2^n value based on the total vm size to act as a period -
vm_cycle_shift. This period measures 'time' in writeout events.

Each writeout increases time and adds to a per bdi counter. This counter is 
halved when a period expires. So per bdi speed is:

  0.5 * (previous cycle speed) + this cycle's events.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |    8 ++
 mm/backing-dev.c            |    3 
 mm/page-writeback.c         |  145 ++++++++++++++++++++++++++++++++++----------
 3 files changed, 125 insertions(+), 31 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -26,6 +26,8 @@ enum bdi_stat_item {
 	BDI_DIRTY,
 	BDI_WRITEBACK,
 	BDI_UNSTABLE,
+	BDI_WRITEOUT,
+	BDI_WRITEOUT_TOTAL,
 	NR_BDI_STAT_ITEMS
 };
 
@@ -47,6 +49,12 @@ struct backing_dev_info {
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
 	void *unplug_io_data;
 
+	/*
+	 * data used for scaling the writeback cache
+	 */
+	spinlock_t lock;	/* protect the cycle count */
+	unsigned long cycles;	/* writeout cycles */
+
 	atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS];
 #ifdef CONFIG_SMP
 	struct bdi_per_cpu_data pcd[NR_CPUS];
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -49,8 +49,6 @@
  */
 static long ratelimit_pages = 32;
 
-static int dirty_exceeded __cacheline_aligned_in_smp;	/* Dirty mem may be over limit */
-
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
@@ -103,6 +101,77 @@ EXPORT_SYMBOL(laptop_mode);
 static void background_writeout(unsigned long _min_pages);
 
 /*
+ * Scale the writeback cache size proportional to the relative writeout speeds.
+ *
+ * We do this by tracking a floating average per BDI and a global floating
+ * average. We optimize away the '/= 2' for the global average by noting that:
+ *
+ *  if (++i > thresh) i /= 2:
+ *
+ * Can be approximated by:
+ *
+ *   thresh/2 + (++i % thresh/2)
+ *
+ * Furthermore, when we choose thresh to be 2^n it can be written in terms of
+ * binary operations and wraparound artifacts disappear.
+ */
+static int vm_cycle_shift __read_mostly;
+
+/*
+ * Sync up the per BDI average to the global cycle.
+ *
+ * NOTE: we mask out the MSB of the cycle count because bdi_stats really are
+ * not unsigned long. (see comment in backing_dev.h)
+ */
+static void bdi_writeout_norm(struct backing_dev_info *bdi)
+{
+	int bits = vm_cycle_shift;
+	unsigned long cycle = 1UL << bits;
+	unsigned long mask = ~(cycle - 1) | (1UL << BITS_PER_LONG-1);
+	unsigned long total = global_bdi_stat(BDI_WRITEOUT_TOTAL) << 1;
+	unsigned long flags;
+
+	if ((bdi->cycles & mask) == (total & mask))
+		return;
+
+	spin_lock_irqsave(&bdi->lock, flags);
+	while ((bdi->cycles & mask) != (total & mask)) {
+		unsigned long half = bdi_stat(bdi, BDI_WRITEOUT) / 2;
+
+		mod_bdi_stat(bdi, BDI_WRITEOUT, -half);
+		bdi->cycles += cycle;
+	}
+	spin_unlock_irqrestore(&bdi->lock, flags);
+}
+
+static void bdi_writeout_inc(struct backing_dev_info *bdi)
+{
+	if (!bdi_cap_writeback_dirty(bdi))
+		return;
+
+	__inc_bdi_stat(bdi, BDI_WRITEOUT);
+	__inc_bdi_stat(bdi, BDI_WRITEOUT_TOTAL);
+
+	bdi_writeout_norm(bdi);
+}
+
+void get_writeout_scale(struct backing_dev_info *bdi, int *scale, int *div)
+{
+	int bits = vm_cycle_shift - 1;
+	unsigned long total = global_bdi_stat(BDI_WRITEOUT_TOTAL);
+	unsigned long cycle = 1UL << bits;
+	unsigned long mask = cycle - 1;
+
+	if (bdi_cap_writeback_dirty(bdi)) {
+		bdi_writeout_norm(bdi);
+		*scale = bdi_stat(bdi, BDI_WRITEOUT);
+	} else
+		*scale = 0;
+
+	*div = cycle + (total & mask);
+}
+
+/*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
  *
@@ -120,7 +189,7 @@ static void background_writeout(unsigned
  * clamping level.
  */
 static void
-get_dirty_limits(long *pbackground, long *pdirty,
+get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
 					struct address_space *mapping)
 {
 	int background_ratio;		/* Percentages */
@@ -163,6 +232,22 @@ get_dirty_limits(long *pbackground, long
 	}
 	*pbackground = background;
 	*pdirty = dirty;
+
+	if (mapping) {
+		long long tmp = dirty;
+		int scale, div;
+
+		get_writeout_scale(mapping->backing_dev_info, &scale, &div);
+
+		if (scale > div)
+			scale = div;
+
+		tmp = (tmp * 122) >> 7; /* take ~95% of total dirty value */
+		tmp *= scale;
+		do_div(tmp, div);
+
+		*pbdi_dirty = (long)tmp;
+	}
 }
 
 /*
@@ -174,9 +259,10 @@ get_dirty_limits(long *pbackground, long
  */
 static void balance_dirty_pages(struct address_space *mapping)
 {
-	long nr_reclaimable;
+	long bdi_nr_reclaimable;
 	long background_thresh;
 	long dirty_thresh;
+	long bdi_thresh;
 	unsigned long pages_written = 0;
 	unsigned long write_chunk = sync_writeback_pages();
 
@@ -191,32 +277,31 @@ static void balance_dirty_pages(struct a
 			.range_cyclic	= 1,
 		};
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, mapping);
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
-			dirty_thresh)
+		get_dirty_limits(&background_thresh, &dirty_thresh,
+				&bdi_thresh, mapping);
+		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
+					bdi_stat(bdi, BDI_UNSTABLE);
+		if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
+		     	bdi_thresh)
 				break;
 
-		if (!dirty_exceeded)
-			dirty_exceeded = 1;
-
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
 		 * filesystems (i.e. NFS) in which data may have been
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		if (nr_reclaimable) {
+		if (bdi_nr_reclaimable) {
 			writeback_inodes(&wbc);
-			get_dirty_limits(&background_thresh,
-					 	&dirty_thresh, mapping);
-			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-			if (nr_reclaimable +
-				global_page_state(NR_WRITEBACK)
-					<= dirty_thresh)
-						break;
+
+			get_dirty_limits(&background_thresh, &dirty_thresh,
+				       &bdi_thresh, mapping);
+			bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
+						bdi_stat(bdi, BDI_UNSTABLE);
+			if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
+			     	bdi_thresh)
+				break;
+
 			pages_written += write_chunk - wbc.nr_to_write;
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
@@ -224,10 +309,6 @@ static void balance_dirty_pages(struct a
 		congestion_wait(WRITE, HZ/10);
 	}
 
-	if (nr_reclaimable + global_page_state(NR_WRITEBACK)
-		<= dirty_thresh && dirty_exceeded)
-			dirty_exceeded = 0;
-
 	if (writeback_in_progress(bdi))
 		return;		/* pdflush is already working this queue */
 
@@ -240,7 +321,9 @@ static void balance_dirty_pages(struct a
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-	     (!laptop_mode && (nr_reclaimable > background_thresh)))
+			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
+					  + global_page_state(NR_UNSTABLE_NFS)
+					  > background_thresh)))
 		pdflush_operation(background_writeout, 0);
 }
 
@@ -275,9 +358,7 @@ void balance_dirty_pages_ratelimited_nr(
 	unsigned long ratelimit;
 	unsigned long *p;
 
-	ratelimit = ratelimit_pages;
-	if (dirty_exceeded)
-		ratelimit = 8;
+	ratelimit = 8;
 
 	/*
 	 * Check the rate limiting. Also, we do not want to throttle real-time
@@ -302,7 +383,7 @@ void throttle_vm_writeout(void)
 	long dirty_thresh;
 
         for ( ; ; ) {
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
@@ -338,7 +419,7 @@ static void background_writeout(unsigned
 		long background_thresh;
 		long dirty_thresh;
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 		if (global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)
@@ -546,6 +627,7 @@ void __init page_writeback_init(void)
 	mod_timer(&wb_timer, jiffies + dirty_writeback_interval);
 	writeback_set_ratelimit();
 	register_cpu_notifier(&ratelimit_nb);
+	vm_cycle_shift = 3 + ilog2(int_sqrt(vm_total_pages));
 }
 
 /**
@@ -917,6 +999,7 @@ int test_clear_page_writeback(struct pag
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
 			__dec_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+			bdi_writeout_inc(mapping->backing_dev_info);
 		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -75,6 +75,9 @@ void bdi_stat_init(struct backing_dev_in
 {
 	int i;
 
+	spin_lock_init(&bdi->lock);
+	bdi->cycles = 0;
+
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
 		atomic_long_set(&bdi->bdi_stats[i], 0);
 

--


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 5/6] mm: per device dirty threshold
@ 2007-03-19 15:57   ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 15:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra

[-- Attachment #1: writeback-balance-per-backing_dev.patch --]
[-- Type: text/plain, Size: 10735 bytes --]

Scale writeback cache per backing device, proportional to its writeout speed.

akpm sayeth:
> Which problem are we trying to solve here?  afaik our two uppermost
> problems are:
> 
> a) Heavy write to queue A causes light writer to queue B to blok for a long
> time in balance_dirty_pages().  Even if the devices have the same speed.  

This one; esp when not the same speed. The - my usb stick makes my
computer suck - problem. But even on similar speed, the separation of
device should avoid blocking dev B when dev A is being throttled.

The writeout speed is measure dynamically, so when it doesn't have
anything to write out for a while its writeback cache size goes to 0.

Conversely, when starting up it will in the beginning act almost
synchronous but will quickly build up a 'fair' share of the writeback
cache.

> b) heavy write to device A causes light write to device A to block for a
> long time in balance_dirty_pages(), occasionally.  Harder to fix.

This will indeed take more. I've thought about it though. But one
quickly ends up with per task state.


How it all works:

We pick a 2^n value based on the total vm size to act as a period -
vm_cycle_shift. This period measures 'time' in writeout events.

Each writeout increases time and adds to a per bdi counter. This counter is 
halved when a period expires. So per bdi speed is:

  0.5 * (previous cycle speed) + this cycle's events.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |    8 ++
 mm/backing-dev.c            |    3 
 mm/page-writeback.c         |  145 ++++++++++++++++++++++++++++++++++----------
 3 files changed, 125 insertions(+), 31 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -26,6 +26,8 @@ enum bdi_stat_item {
 	BDI_DIRTY,
 	BDI_WRITEBACK,
 	BDI_UNSTABLE,
+	BDI_WRITEOUT,
+	BDI_WRITEOUT_TOTAL,
 	NR_BDI_STAT_ITEMS
 };
 
@@ -47,6 +49,12 @@ struct backing_dev_info {
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
 	void *unplug_io_data;
 
+	/*
+	 * data used for scaling the writeback cache
+	 */
+	spinlock_t lock;	/* protect the cycle count */
+	unsigned long cycles;	/* writeout cycles */
+
 	atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS];
 #ifdef CONFIG_SMP
 	struct bdi_per_cpu_data pcd[NR_CPUS];
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -49,8 +49,6 @@
  */
 static long ratelimit_pages = 32;
 
-static int dirty_exceeded __cacheline_aligned_in_smp;	/* Dirty mem may be over limit */
-
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
@@ -103,6 +101,77 @@ EXPORT_SYMBOL(laptop_mode);
 static void background_writeout(unsigned long _min_pages);
 
 /*
+ * Scale the writeback cache size proportional to the relative writeout speeds.
+ *
+ * We do this by tracking a floating average per BDI and a global floating
+ * average. We optimize away the '/= 2' for the global average by noting that:
+ *
+ *  if (++i > thresh) i /= 2:
+ *
+ * Can be approximated by:
+ *
+ *   thresh/2 + (++i % thresh/2)
+ *
+ * Furthermore, when we choose thresh to be 2^n it can be written in terms of
+ * binary operations and wraparound artifacts disappear.
+ */
+static int vm_cycle_shift __read_mostly;
+
+/*
+ * Sync up the per BDI average to the global cycle.
+ *
+ * NOTE: we mask out the MSB of the cycle count because bdi_stats really are
+ * not unsigned long. (see comment in backing_dev.h)
+ */
+static void bdi_writeout_norm(struct backing_dev_info *bdi)
+{
+	int bits = vm_cycle_shift;
+	unsigned long cycle = 1UL << bits;
+	unsigned long mask = ~(cycle - 1) | (1UL << BITS_PER_LONG-1);
+	unsigned long total = global_bdi_stat(BDI_WRITEOUT_TOTAL) << 1;
+	unsigned long flags;
+
+	if ((bdi->cycles & mask) == (total & mask))
+		return;
+
+	spin_lock_irqsave(&bdi->lock, flags);
+	while ((bdi->cycles & mask) != (total & mask)) {
+		unsigned long half = bdi_stat(bdi, BDI_WRITEOUT) / 2;
+
+		mod_bdi_stat(bdi, BDI_WRITEOUT, -half);
+		bdi->cycles += cycle;
+	}
+	spin_unlock_irqrestore(&bdi->lock, flags);
+}
+
+static void bdi_writeout_inc(struct backing_dev_info *bdi)
+{
+	if (!bdi_cap_writeback_dirty(bdi))
+		return;
+
+	__inc_bdi_stat(bdi, BDI_WRITEOUT);
+	__inc_bdi_stat(bdi, BDI_WRITEOUT_TOTAL);
+
+	bdi_writeout_norm(bdi);
+}
+
+void get_writeout_scale(struct backing_dev_info *bdi, int *scale, int *div)
+{
+	int bits = vm_cycle_shift - 1;
+	unsigned long total = global_bdi_stat(BDI_WRITEOUT_TOTAL);
+	unsigned long cycle = 1UL << bits;
+	unsigned long mask = cycle - 1;
+
+	if (bdi_cap_writeback_dirty(bdi)) {
+		bdi_writeout_norm(bdi);
+		*scale = bdi_stat(bdi, BDI_WRITEOUT);
+	} else
+		*scale = 0;
+
+	*div = cycle + (total & mask);
+}
+
+/*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
  *
@@ -120,7 +189,7 @@ static void background_writeout(unsigned
  * clamping level.
  */
 static void
-get_dirty_limits(long *pbackground, long *pdirty,
+get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
 					struct address_space *mapping)
 {
 	int background_ratio;		/* Percentages */
@@ -163,6 +232,22 @@ get_dirty_limits(long *pbackground, long
 	}
 	*pbackground = background;
 	*pdirty = dirty;
+
+	if (mapping) {
+		long long tmp = dirty;
+		int scale, div;
+
+		get_writeout_scale(mapping->backing_dev_info, &scale, &div);
+
+		if (scale > div)
+			scale = div;
+
+		tmp = (tmp * 122) >> 7; /* take ~95% of total dirty value */
+		tmp *= scale;
+		do_div(tmp, div);
+
+		*pbdi_dirty = (long)tmp;
+	}
 }
 
 /*
@@ -174,9 +259,10 @@ get_dirty_limits(long *pbackground, long
  */
 static void balance_dirty_pages(struct address_space *mapping)
 {
-	long nr_reclaimable;
+	long bdi_nr_reclaimable;
 	long background_thresh;
 	long dirty_thresh;
+	long bdi_thresh;
 	unsigned long pages_written = 0;
 	unsigned long write_chunk = sync_writeback_pages();
 
@@ -191,32 +277,31 @@ static void balance_dirty_pages(struct a
 			.range_cyclic	= 1,
 		};
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, mapping);
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
-			dirty_thresh)
+		get_dirty_limits(&background_thresh, &dirty_thresh,
+				&bdi_thresh, mapping);
+		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
+					bdi_stat(bdi, BDI_UNSTABLE);
+		if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
+		     	bdi_thresh)
 				break;
 
-		if (!dirty_exceeded)
-			dirty_exceeded = 1;
-
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
 		 * filesystems (i.e. NFS) in which data may have been
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		if (nr_reclaimable) {
+		if (bdi_nr_reclaimable) {
 			writeback_inodes(&wbc);
-			get_dirty_limits(&background_thresh,
-					 	&dirty_thresh, mapping);
-			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-			if (nr_reclaimable +
-				global_page_state(NR_WRITEBACK)
-					<= dirty_thresh)
-						break;
+
+			get_dirty_limits(&background_thresh, &dirty_thresh,
+				       &bdi_thresh, mapping);
+			bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
+						bdi_stat(bdi, BDI_UNSTABLE);
+			if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
+			     	bdi_thresh)
+				break;
+
 			pages_written += write_chunk - wbc.nr_to_write;
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
@@ -224,10 +309,6 @@ static void balance_dirty_pages(struct a
 		congestion_wait(WRITE, HZ/10);
 	}
 
-	if (nr_reclaimable + global_page_state(NR_WRITEBACK)
-		<= dirty_thresh && dirty_exceeded)
-			dirty_exceeded = 0;
-
 	if (writeback_in_progress(bdi))
 		return;		/* pdflush is already working this queue */
 
@@ -240,7 +321,9 @@ static void balance_dirty_pages(struct a
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-	     (!laptop_mode && (nr_reclaimable > background_thresh)))
+			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
+					  + global_page_state(NR_UNSTABLE_NFS)
+					  > background_thresh)))
 		pdflush_operation(background_writeout, 0);
 }
 
@@ -275,9 +358,7 @@ void balance_dirty_pages_ratelimited_nr(
 	unsigned long ratelimit;
 	unsigned long *p;
 
-	ratelimit = ratelimit_pages;
-	if (dirty_exceeded)
-		ratelimit = 8;
+	ratelimit = 8;
 
 	/*
 	 * Check the rate limiting. Also, we do not want to throttle real-time
@@ -302,7 +383,7 @@ void throttle_vm_writeout(void)
 	long dirty_thresh;
 
         for ( ; ; ) {
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
@@ -338,7 +419,7 @@ static void background_writeout(unsigned
 		long background_thresh;
 		long dirty_thresh;
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 		if (global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)
@@ -546,6 +627,7 @@ void __init page_writeback_init(void)
 	mod_timer(&wb_timer, jiffies + dirty_writeback_interval);
 	writeback_set_ratelimit();
 	register_cpu_notifier(&ratelimit_nb);
+	vm_cycle_shift = 3 + ilog2(int_sqrt(vm_total_pages));
 }
 
 /**
@@ -917,6 +999,7 @@ int test_clear_page_writeback(struct pag
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
 			__dec_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+			bdi_writeout_inc(mapping->backing_dev_info);
 		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -75,6 +75,9 @@ void bdi_stat_init(struct backing_dev_in
 {
 	int i;
 
+	spin_lock_init(&bdi->lock);
+	bdi->cycles = 0;
+
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
 		atomic_long_set(&bdi->bdi_stats[i], 0);
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 6/6] mm: expose BDI statistics in sysfs.
  2007-03-19 15:57 ` Peter Zijlstra
@ 2007-03-19 15:57   ` Peter Zijlstra
  -1 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 15:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra

[-- Attachment #1: bdi_stat_sysfs.patch --]
[-- Type: text/plain, Size: 2606 bytes --]

Expose the per BDI stats in /sys/block/<dev>/queue/*

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c |   51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -3923,6 +3923,33 @@ static ssize_t queue_max_hw_sectors_show
 	return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_nr_dirty_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%lu\n", bdi_stat(&q->backing_dev_info, BDI_DIRTY));
+}
+
+static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%lu\n", bdi_stat(&q->backing_dev_info, BDI_WRITEBACK));
+}
+
+static ssize_t queue_nr_unstable_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%lu\n", bdi_stat(&q->backing_dev_info, BDI_UNSTABLE));
+}
+
+extern void get_writeout_scale(struct backing_dev_info *, int *, int *);
+
+static ssize_t queue_nr_cache_show(struct request_queue *q, char *page)
+{
+	int scale, div;
+
+	get_writeout_scale(&q->backing_dev_info, &scale, &div);
+	scale *= 1024;
+	scale /= div;
+
+	return sprintf(page, "%d\n", scale);
+}
 
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
@@ -3947,6 +3974,26 @@ static struct queue_sysfs_entry queue_ma
 	.show = queue_max_hw_sectors_show,
 };
 
+static struct queue_sysfs_entry queue_dirty_entry = {
+	.attr = {.name = "dirty_pages", .mode = S_IRUGO },
+	.show = queue_nr_dirty_show,
+};
+
+static struct queue_sysfs_entry queue_writeback_entry = {
+	.attr = {.name = "writeback_pages", .mode = S_IRUGO },
+	.show = queue_nr_writeback_show,
+};
+
+static struct queue_sysfs_entry queue_unstable_entry = {
+	.attr = {.name = "unstable_pages", .mode = S_IRUGO },
+	.show = queue_nr_unstable_show,
+};
+
+static struct queue_sysfs_entry queue_cache_entry = {
+	.attr = {.name = "cache_ratio", .mode = S_IRUGO },
+	.show = queue_nr_cache_show,
+};
+
 static struct queue_sysfs_entry queue_iosched_entry = {
 	.attr = {.name = "scheduler", .mode = S_IRUGO | S_IWUSR },
 	.show = elv_iosched_show,
@@ -3958,6 +4005,10 @@ static struct attribute *default_attrs[]
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
+	&queue_dirty_entry.attr,
+	&queue_writeback_entry.attr,
+	&queue_unstable_entry.attr,
+	&queue_cache_entry.attr,
 	&queue_iosched_entry.attr,
 	NULL,
 };

--


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 6/6] mm: expose BDI statistics in sysfs.
@ 2007-03-19 15:57   ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 15:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra

[-- Attachment #1: bdi_stat_sysfs.patch --]
[-- Type: text/plain, Size: 2831 bytes --]

Expose the per BDI stats in /sys/block/<dev>/queue/*

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c |   51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -3923,6 +3923,33 @@ static ssize_t queue_max_hw_sectors_show
 	return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_nr_dirty_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%lu\n", bdi_stat(&q->backing_dev_info, BDI_DIRTY));
+}
+
+static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%lu\n", bdi_stat(&q->backing_dev_info, BDI_WRITEBACK));
+}
+
+static ssize_t queue_nr_unstable_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%lu\n", bdi_stat(&q->backing_dev_info, BDI_UNSTABLE));
+}
+
+extern void get_writeout_scale(struct backing_dev_info *, int *, int *);
+
+static ssize_t queue_nr_cache_show(struct request_queue *q, char *page)
+{
+	int scale, div;
+
+	get_writeout_scale(&q->backing_dev_info, &scale, &div);
+	scale *= 1024;
+	scale /= div;
+
+	return sprintf(page, "%d\n", scale);
+}
 
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
@@ -3947,6 +3974,26 @@ static struct queue_sysfs_entry queue_ma
 	.show = queue_max_hw_sectors_show,
 };
 
+static struct queue_sysfs_entry queue_dirty_entry = {
+	.attr = {.name = "dirty_pages", .mode = S_IRUGO },
+	.show = queue_nr_dirty_show,
+};
+
+static struct queue_sysfs_entry queue_writeback_entry = {
+	.attr = {.name = "writeback_pages", .mode = S_IRUGO },
+	.show = queue_nr_writeback_show,
+};
+
+static struct queue_sysfs_entry queue_unstable_entry = {
+	.attr = {.name = "unstable_pages", .mode = S_IRUGO },
+	.show = queue_nr_unstable_show,
+};
+
+static struct queue_sysfs_entry queue_cache_entry = {
+	.attr = {.name = "cache_ratio", .mode = S_IRUGO },
+	.show = queue_nr_cache_show,
+};
+
 static struct queue_sysfs_entry queue_iosched_entry = {
 	.attr = {.name = "scheduler", .mode = S_IRUGO | S_IWUSR },
 	.show = elv_iosched_show,
@@ -3958,6 +4005,10 @@ static struct attribute *default_attrs[]
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
+	&queue_dirty_entry.attr,
+	&queue_writeback_entry.attr,
+	&queue_unstable_entry.attr,
+	&queue_cache_entry.attr,
 	&queue_iosched_entry.attr,
 	NULL,
 };

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/6] per device dirty throttling
  2007-03-19 15:57 ` Peter Zijlstra
@ 2007-03-19 18:29   ` Peter Zijlstra
  -1 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 18:29 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, neilb, dgc, tomoki.sekiyama.qu

Sorry for duplicates, I was fooled by an MTA hanging on to them for a
few hours. I counted them lost in cyberspace.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/6] per device dirty throttling
@ 2007-03-19 18:29   ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 18:29 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, neilb, dgc, tomoki.sekiyama.qu

Sorry for duplicates, I was fooled by an MTA hanging on to them for a
few hours. I counted them lost in cyberspace.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 7/6] assorted fixes
  2007-03-19 15:57 ` Peter Zijlstra
@ 2007-03-19 21:48   ` Peter Zijlstra
  -1 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 21:48 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, neilb, dgc, tomoki.sekiyama.qu


Just taking out the MSB isn't enough to counter the clipping on 0 done by
the stats counter accessors. Create some accessors that don't do that.

Also, increase the period to about the side of memory (TODO, determine some
upper bound here). This should give much more stable results. (Best would be
to keep it in the order of whatever vm_dirty_ratio gives, however changing
vm_cycle_shift is dangerous).

Finally, limit the adjustment rate to not grow faster than available dirty
space. Without this the analytic model can use up to 2 times the dirty
limit and the discrete model is basically unbounded.

It goes *BANG* when using NFS,... need to look into that.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |   12 ++++++++++++
 mm/page-writeback.c         |   37 ++++++++++++++++++++++---------------
 2 files changed, 34 insertions(+), 15 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -70,6 +70,12 @@ static inline void bdi_stat_add(long x, 
 	atomic_long_add(x, &bdi_stats[item]);
 }
 
+
+static inline unsigned long __global_bdi_stat(enum bdi_stat_item item)
+{
+	return atomic_long_read(&bdi_stats[item]);
+}
+
 /*
  * cannot be unsigned long and clip on 0.
  */
@@ -83,6 +89,12 @@ static inline unsigned long global_bdi_s
 	return x;
 }
 
+static inline unsigned long __bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	return atomic_long_read(&bdi->bdi_stats[item]);
+}
+
 static inline unsigned long bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item)
 {
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -119,16 +119,13 @@ static int vm_cycle_shift __read_mostly;
 
 /*
  * Sync up the per BDI average to the global cycle.
- *
- * NOTE: we mask out the MSB of the cycle count because bdi_stats really are
- * not unsigned long. (see comment in backing_dev.h)
  */
 static void bdi_writeout_norm(struct backing_dev_info *bdi)
 {
 	int bits = vm_cycle_shift;
 	unsigned long cycle = 1UL << bits;
-	unsigned long mask = ~(cycle - 1) | (1UL << BITS_PER_LONG-1);
-	unsigned long total = global_bdi_stat(BDI_WRITEOUT_TOTAL) << 1;
+	unsigned long mask = ~(cycle - 1);
+	unsigned long total = __global_bdi_stat(BDI_WRITEOUT_TOTAL) << 1;
 	unsigned long flags;
 
 	if ((bdi->cycles & mask) == (total & mask))
@@ -136,7 +133,7 @@ static void bdi_writeout_norm(struct bac
 
 	spin_lock_irqsave(&bdi->lock, flags);
 	while ((bdi->cycles & mask) != (total & mask)) {
-		unsigned long half = bdi_stat(bdi, BDI_WRITEOUT) / 2;
+		unsigned long half = __bdi_stat(bdi, BDI_WRITEOUT) / 2;
 
 		mod_bdi_stat(bdi, BDI_WRITEOUT, -half);
 		bdi->cycles += cycle;
@@ -158,13 +155,13 @@ static void bdi_writeout_inc(struct back
 void get_writeout_scale(struct backing_dev_info *bdi, int *scale, int *div)
 {
 	int bits = vm_cycle_shift - 1;
-	unsigned long total = global_bdi_stat(BDI_WRITEOUT_TOTAL);
+	unsigned long total = __global_bdi_stat(BDI_WRITEOUT_TOTAL);
 	unsigned long cycle = 1UL << bits;
 	unsigned long mask = cycle - 1;
 
 	if (bdi_cap_writeback_dirty(bdi)) {
 		bdi_writeout_norm(bdi);
-		*scale = bdi_stat(bdi, BDI_WRITEOUT);
+		*scale = __bdi_stat(bdi, BDI_WRITEOUT);
 	} else
 		*scale = 0;
 
@@ -234,19 +231,29 @@ get_dirty_limits(long *pbackground, long
 	*pdirty = dirty;
 
 	if (mapping) {
+		struct backing_dev_info *bdi = mapping->backing_dev_info;
+		long reserve;
 		long long tmp = dirty;
 		int scale, div;
 
-		get_writeout_scale(mapping->backing_dev_info, &scale, &div);
-
-		if (scale > div)
-			scale = div;
+		get_writeout_scale(bdi, &scale, &div);
 
-		tmp = (tmp * 122) >> 7; /* take ~95% of total dirty value */
 		tmp *= scale;
 		do_div(tmp, div);
 
-		*pbdi_dirty = (long)tmp;
+		reserve = dirty -
+			(global_bdi_stat(BDI_DIRTY) +
+			 global_bdi_stat(BDI_WRITEBACK) +
+			 global_bdi_stat(BDI_UNSTABLE));
+
+		if (reserve < 0)
+			reserve = 0;
+
+		reserve += bdi_stat(bdi, BDI_DIRTY) +
+			bdi_stat(bdi, BDI_WRITEBACK) +
+			bdi_stat(bdi, BDI_UNSTABLE);
+
+		*pbdi_dirty = min((long)tmp, reserve);
 	}
 }
 
@@ -627,7 +634,7 @@ void __init page_writeback_init(void)
 	mod_timer(&wb_timer, jiffies + dirty_writeback_interval);
 	writeback_set_ratelimit();
 	register_cpu_notifier(&ratelimit_nb);
-	vm_cycle_shift = 3 + ilog2(int_sqrt(vm_total_pages));
+	vm_cycle_shift = ilog2(vm_total_pages);
 }
 
 /**



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 7/6] assorted fixes
@ 2007-03-19 21:48   ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-19 21:48 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, neilb, dgc, tomoki.sekiyama.qu

Just taking out the MSB isn't enough to counter the clipping on 0 done by
the stats counter accessors. Create some accessors that don't do that.

Also, increase the period to about the side of memory (TODO, determine some
upper bound here). This should give much more stable results. (Best would be
to keep it in the order of whatever vm_dirty_ratio gives, however changing
vm_cycle_shift is dangerous).

Finally, limit the adjustment rate to not grow faster than available dirty
space. Without this the analytic model can use up to 2 times the dirty
limit and the discrete model is basically unbounded.

It goes *BANG* when using NFS,... need to look into that.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |   12 ++++++++++++
 mm/page-writeback.c         |   37 ++++++++++++++++++++++---------------
 2 files changed, 34 insertions(+), 15 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -70,6 +70,12 @@ static inline void bdi_stat_add(long x, 
 	atomic_long_add(x, &bdi_stats[item]);
 }
 
+
+static inline unsigned long __global_bdi_stat(enum bdi_stat_item item)
+{
+	return atomic_long_read(&bdi_stats[item]);
+}
+
 /*
  * cannot be unsigned long and clip on 0.
  */
@@ -83,6 +89,12 @@ static inline unsigned long global_bdi_s
 	return x;
 }
 
+static inline unsigned long __bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	return atomic_long_read(&bdi->bdi_stats[item]);
+}
+
 static inline unsigned long bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item)
 {
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -119,16 +119,13 @@ static int vm_cycle_shift __read_mostly;
 
 /*
  * Sync up the per BDI average to the global cycle.
- *
- * NOTE: we mask out the MSB of the cycle count because bdi_stats really are
- * not unsigned long. (see comment in backing_dev.h)
  */
 static void bdi_writeout_norm(struct backing_dev_info *bdi)
 {
 	int bits = vm_cycle_shift;
 	unsigned long cycle = 1UL << bits;
-	unsigned long mask = ~(cycle - 1) | (1UL << BITS_PER_LONG-1);
-	unsigned long total = global_bdi_stat(BDI_WRITEOUT_TOTAL) << 1;
+	unsigned long mask = ~(cycle - 1);
+	unsigned long total = __global_bdi_stat(BDI_WRITEOUT_TOTAL) << 1;
 	unsigned long flags;
 
 	if ((bdi->cycles & mask) == (total & mask))
@@ -136,7 +133,7 @@ static void bdi_writeout_norm(struct bac
 
 	spin_lock_irqsave(&bdi->lock, flags);
 	while ((bdi->cycles & mask) != (total & mask)) {
-		unsigned long half = bdi_stat(bdi, BDI_WRITEOUT) / 2;
+		unsigned long half = __bdi_stat(bdi, BDI_WRITEOUT) / 2;
 
 		mod_bdi_stat(bdi, BDI_WRITEOUT, -half);
 		bdi->cycles += cycle;
@@ -158,13 +155,13 @@ static void bdi_writeout_inc(struct back
 void get_writeout_scale(struct backing_dev_info *bdi, int *scale, int *div)
 {
 	int bits = vm_cycle_shift - 1;
-	unsigned long total = global_bdi_stat(BDI_WRITEOUT_TOTAL);
+	unsigned long total = __global_bdi_stat(BDI_WRITEOUT_TOTAL);
 	unsigned long cycle = 1UL << bits;
 	unsigned long mask = cycle - 1;
 
 	if (bdi_cap_writeback_dirty(bdi)) {
 		bdi_writeout_norm(bdi);
-		*scale = bdi_stat(bdi, BDI_WRITEOUT);
+		*scale = __bdi_stat(bdi, BDI_WRITEOUT);
 	} else
 		*scale = 0;
 
@@ -234,19 +231,29 @@ get_dirty_limits(long *pbackground, long
 	*pdirty = dirty;
 
 	if (mapping) {
+		struct backing_dev_info *bdi = mapping->backing_dev_info;
+		long reserve;
 		long long tmp = dirty;
 		int scale, div;
 
-		get_writeout_scale(mapping->backing_dev_info, &scale, &div);
-
-		if (scale > div)
-			scale = div;
+		get_writeout_scale(bdi, &scale, &div);
 
-		tmp = (tmp * 122) >> 7; /* take ~95% of total dirty value */
 		tmp *= scale;
 		do_div(tmp, div);
 
-		*pbdi_dirty = (long)tmp;
+		reserve = dirty -
+			(global_bdi_stat(BDI_DIRTY) +
+			 global_bdi_stat(BDI_WRITEBACK) +
+			 global_bdi_stat(BDI_UNSTABLE));
+
+		if (reserve < 0)
+			reserve = 0;
+
+		reserve += bdi_stat(bdi, BDI_DIRTY) +
+			bdi_stat(bdi, BDI_WRITEBACK) +
+			bdi_stat(bdi, BDI_UNSTABLE);
+
+		*pbdi_dirty = min((long)tmp, reserve);
 	}
 }
 
@@ -627,7 +634,7 @@ void __init page_writeback_init(void)
 	mod_timer(&wb_timer, jiffies + dirty_writeback_interval);
 	writeback_set_ratelimit();
 	register_cpu_notifier(&ratelimit_nb);
-	vm_cycle_shift = 3 + ilog2(int_sqrt(vm_total_pages));
+	vm_cycle_shift = ilog2(vm_total_pages);
 }
 
 /**


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/6] per device dirty throttling
  2007-03-19 15:57 ` Peter Zijlstra
@ 2007-03-20  7:47   ` David Chinner
  -1 siblings, 0 replies; 28+ messages in thread
From: David Chinner @ 2007-03-20  7:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, akpm, neilb, dgc, tomoki.sekiyama.qu

On Mon, Mar 19, 2007 at 04:57:37PM +0100, Peter Zijlstra wrote:
> This patch-set implements per device dirty page throttling. Which should solve
> the problem we currently have with one device hogging the dirty limit.
> 
> Preliminary testing shows good results:

I just ran some higher throughput number on this patchset.

Identical 4-disk dm stripes, XFS, 4p x86_64, 16GB RAM, dirty_ratio = 5:

One dm stripe: 320MB/s
two dm stripes: 310+315MB/s
three dm stripes: 254+253+253MB/s (pci-x bus bound)

The three stripe test was for 100GB of data to each
filesystem - all the writes finished with 1s of each other
at 7m4s. Interestingly, the amount of memory in cache for
each of these devices was almost exactly the same - about
5.2GB each. Looks good so far....

Hmmm - small problem - root disk (XFS) got stuck in
balance_dirty_pages_ratelimited_nr() after the above write test
attempting to unmount the filesystems (i.e. umount trying
to modify /etc/mtab got stuck and the root fs locked up)

(reboot)

None-identical dm stripes, XFS, run alone:

Single disk: 80MB/s
2 disk dm stripe: 155MB/s
4 disk dm stripe: 310MB/s

Combined, after some runtime:

# ls -sh /mnt/dm*/test
10G /mnt/dm0/test	19G /mnt/dm1/test	41G /mnt/dm2/test
15G /mnt/dm0/test	27G /mnt/dm1/test	52G /mnt/dm2/test
18G /mnt/dm0/test	32G /mnt/dm1/test	64G /mnt/dm2/test
24G /mnt/dm0/test	45G /mnt/dm1/test	86G /mnt/dm2/test
27G /mnt/dm0/test	51G /mnt/dm1/test	95G /mnt/dm2/test
29G /mnt/dm0/test	52G /mnt/dm1/test	97G /mnt/dm2/test
29G /mnt/dm0/test	54G /mnt/dm1/test	101G /mnt/dm2/test [done]
35G /mnt/dm0/test	65G /mnt/dm1/test	101G /mnt/dm2/test
38G /mnt/dm0/test	70G /mnt/dm1/test	101G /mnt/dm2/test

And so on. Final number:

Single disk: 70MB/s
2 disk dm stripe: 130MB/s
4 disk dm stripe: 260MB/s

So overall we've lost about 15-20% of the theoretical aggregate
perfomrance, but we haven't starved any of the devices over a
long period of time.

However, looking at vmstat for total throughput, there are periods
of time where it appears that the fastest disk goes idle. That is,
we drop from an aggregate of about 550MB/s to below 300MB/s for
several seconds at a time. You can sort of see this from the file
size output above - long term the ratios remain the same, but in the
short term we see quite a bit of variability.

When the fast disk completed, I saw almost the same thing, but
this time it seems like the slow disk (i.e. ~230MB/s to ~150MB/s)
stopped for several seconds.

I haven't really digested what the patches do, but it's almost
like it is throttling a device completely while it allows another
to finish writing it's quota (underestimating bandwidth?).

(umount after writes hung again. Same root disk thing as before....)

This is looking promising, Peter. When it is more stable I'll run
some more tests....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/6] per device dirty throttling
@ 2007-03-20  7:47   ` David Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: David Chinner @ 2007-03-20  7:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, akpm, neilb, dgc, tomoki.sekiyama.qu

On Mon, Mar 19, 2007 at 04:57:37PM +0100, Peter Zijlstra wrote:
> This patch-set implements per device dirty page throttling. Which should solve
> the problem we currently have with one device hogging the dirty limit.
> 
> Preliminary testing shows good results:

I just ran some higher throughput number on this patchset.

Identical 4-disk dm stripes, XFS, 4p x86_64, 16GB RAM, dirty_ratio = 5:

One dm stripe: 320MB/s
two dm stripes: 310+315MB/s
three dm stripes: 254+253+253MB/s (pci-x bus bound)

The three stripe test was for 100GB of data to each
filesystem - all the writes finished with 1s of each other
at 7m4s. Interestingly, the amount of memory in cache for
each of these devices was almost exactly the same - about
5.2GB each. Looks good so far....

Hmmm - small problem - root disk (XFS) got stuck in
balance_dirty_pages_ratelimited_nr() after the above write test
attempting to unmount the filesystems (i.e. umount trying
to modify /etc/mtab got stuck and the root fs locked up)

(reboot)

None-identical dm stripes, XFS, run alone:

Single disk: 80MB/s
2 disk dm stripe: 155MB/s
4 disk dm stripe: 310MB/s

Combined, after some runtime:

# ls -sh /mnt/dm*/test
10G /mnt/dm0/test	19G /mnt/dm1/test	41G /mnt/dm2/test
15G /mnt/dm0/test	27G /mnt/dm1/test	52G /mnt/dm2/test
18G /mnt/dm0/test	32G /mnt/dm1/test	64G /mnt/dm2/test
24G /mnt/dm0/test	45G /mnt/dm1/test	86G /mnt/dm2/test
27G /mnt/dm0/test	51G /mnt/dm1/test	95G /mnt/dm2/test
29G /mnt/dm0/test	52G /mnt/dm1/test	97G /mnt/dm2/test
29G /mnt/dm0/test	54G /mnt/dm1/test	101G /mnt/dm2/test [done]
35G /mnt/dm0/test	65G /mnt/dm1/test	101G /mnt/dm2/test
38G /mnt/dm0/test	70G /mnt/dm1/test	101G /mnt/dm2/test

And so on. Final number:

Single disk: 70MB/s
2 disk dm stripe: 130MB/s
4 disk dm stripe: 260MB/s

So overall we've lost about 15-20% of the theoretical aggregate
perfomrance, but we haven't starved any of the devices over a
long period of time.

However, looking at vmstat for total throughput, there are periods
of time where it appears that the fastest disk goes idle. That is,
we drop from an aggregate of about 550MB/s to below 300MB/s for
several seconds at a time. You can sort of see this from the file
size output above - long term the ratios remain the same, but in the
short term we see quite a bit of variability.

When the fast disk completed, I saw almost the same thing, but
this time it seems like the slow disk (i.e. ~230MB/s to ~150MB/s)
stopped for several seconds.

I haven't really digested what the patches do, but it's almost
like it is throttling a device completely while it allows another
to finish writing it's quota (underestimating bandwidth?).

(umount after writes hung again. Same root disk thing as before....)

This is looking promising, Peter. When it is more stable I'll run
some more tests....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/6] per device dirty throttling
  2007-03-20  7:47   ` David Chinner
@ 2007-03-20  8:08     ` Peter Zijlstra
  -1 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-20  8:08 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-mm, linux-kernel, akpm, neilb, tomoki.sekiyama.qu

On Tue, 2007-03-20 at 18:47 +1100, David Chinner wrote:
> On Mon, Mar 19, 2007 at 04:57:37PM +0100, Peter Zijlstra wrote:
> > This patch-set implements per device dirty page throttling. Which should solve
> > the problem we currently have with one device hogging the dirty limit.
> > 
> > Preliminary testing shows good results:
> 
> I just ran some higher throughput number on this patchset.
> 
> Identical 4-disk dm stripes, XFS, 4p x86_64, 16GB RAM, dirty_ratio = 5:
> 
> One dm stripe: 320MB/s
> two dm stripes: 310+315MB/s
> three dm stripes: 254+253+253MB/s (pci-x bus bound)
> 
> The three stripe test was for 100GB of data to each
> filesystem - all the writes finished with 1s of each other
> at 7m4s. Interestingly, the amount of memory in cache for
> each of these devices was almost exactly the same - about
> 5.2GB each. Looks good so far....
> 
> Hmmm - small problem - root disk (XFS) got stuck in
> balance_dirty_pages_ratelimited_nr() after the above write test
> attempting to unmount the filesystems (i.e. umount trying
> to modify /etc/mtab got stuck and the root fs locked up)
> 
> (reboot)

Hmm, interesting, I'll look into it.

> None-identical dm stripes, XFS, run alone:
> 
> Single disk: 80MB/s
> 2 disk dm stripe: 155MB/s
> 4 disk dm stripe: 310MB/s
> 
> Combined, after some runtime:
> 
> # ls -sh /mnt/dm*/test
> 10G /mnt/dm0/test	19G /mnt/dm1/test	41G /mnt/dm2/test
> 15G /mnt/dm0/test	27G /mnt/dm1/test	52G /mnt/dm2/test
> 18G /mnt/dm0/test	32G /mnt/dm1/test	64G /mnt/dm2/test
> 24G /mnt/dm0/test	45G /mnt/dm1/test	86G /mnt/dm2/test
> 27G /mnt/dm0/test	51G /mnt/dm1/test	95G /mnt/dm2/test
> 29G /mnt/dm0/test	52G /mnt/dm1/test	97G /mnt/dm2/test
> 29G /mnt/dm0/test	54G /mnt/dm1/test	101G /mnt/dm2/test [done]
> 35G /mnt/dm0/test	65G /mnt/dm1/test	101G /mnt/dm2/test
> 38G /mnt/dm0/test	70G /mnt/dm1/test	101G /mnt/dm2/test
> 
> And so on. Final number:
> 
> Single disk: 70MB/s
> 2 disk dm stripe: 130MB/s
> 4 disk dm stripe: 260MB/s
> 
> So overall we've lost about 15-20% of the theoretical aggregate
> perfomrance, but we haven't starved any of the devices over a
> long period of time.
> 
> However, looking at vmstat for total throughput, there are periods
> of time where it appears that the fastest disk goes idle. That is,
> we drop from an aggregate of about 550MB/s to below 300MB/s for
> several seconds at a time. You can sort of see this from the file
> size output above - long term the ratios remain the same, but in the
> short term we see quite a bit of variability.

I suspect you did not apply 7/6? There is some trouble with signed vs
unsigned in the initial patch set that I tried to 'fix' by masking out
the MSB, but that doesn't work and results in 'time' getting stuck for
about half the time.

> When the fast disk completed, I saw almost the same thing, but
> this time it seems like the slow disk (i.e. ~230MB/s to ~150MB/s)
> stopped for several seconds.
> 
> I haven't really digested what the patches do,

If you have questions please ask, I'll try to write up coherent
answers :-)

>  but it's almost
> like it is throttling a device completely while it allows another
> to finish writing it's quota (underestimating bandwidth?).

Yeah, there is some lumpy-ness in BIO submission or write completions it
seems, and when that granularity (multiplied by the number of active
devices) is larger than the 'time' period over with we average
(indicated by vm_cycle_shift) very weird stuff can happen.

> (umount after writes hung again. Same root disk thing as before....)
> 
> This is looking promising, Peter. When it is more stable I'll run
> some more tests....

Thanks for the tests.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/6] per device dirty throttling
@ 2007-03-20  8:08     ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-20  8:08 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-mm, linux-kernel, akpm, neilb, tomoki.sekiyama.qu

On Tue, 2007-03-20 at 18:47 +1100, David Chinner wrote:
> On Mon, Mar 19, 2007 at 04:57:37PM +0100, Peter Zijlstra wrote:
> > This patch-set implements per device dirty page throttling. Which should solve
> > the problem we currently have with one device hogging the dirty limit.
> > 
> > Preliminary testing shows good results:
> 
> I just ran some higher throughput number on this patchset.
> 
> Identical 4-disk dm stripes, XFS, 4p x86_64, 16GB RAM, dirty_ratio = 5:
> 
> One dm stripe: 320MB/s
> two dm stripes: 310+315MB/s
> three dm stripes: 254+253+253MB/s (pci-x bus bound)
> 
> The three stripe test was for 100GB of data to each
> filesystem - all the writes finished with 1s of each other
> at 7m4s. Interestingly, the amount of memory in cache for
> each of these devices was almost exactly the same - about
> 5.2GB each. Looks good so far....
> 
> Hmmm - small problem - root disk (XFS) got stuck in
> balance_dirty_pages_ratelimited_nr() after the above write test
> attempting to unmount the filesystems (i.e. umount trying
> to modify /etc/mtab got stuck and the root fs locked up)
> 
> (reboot)

Hmm, interesting, I'll look into it.

> None-identical dm stripes, XFS, run alone:
> 
> Single disk: 80MB/s
> 2 disk dm stripe: 155MB/s
> 4 disk dm stripe: 310MB/s
> 
> Combined, after some runtime:
> 
> # ls -sh /mnt/dm*/test
> 10G /mnt/dm0/test	19G /mnt/dm1/test	41G /mnt/dm2/test
> 15G /mnt/dm0/test	27G /mnt/dm1/test	52G /mnt/dm2/test
> 18G /mnt/dm0/test	32G /mnt/dm1/test	64G /mnt/dm2/test
> 24G /mnt/dm0/test	45G /mnt/dm1/test	86G /mnt/dm2/test
> 27G /mnt/dm0/test	51G /mnt/dm1/test	95G /mnt/dm2/test
> 29G /mnt/dm0/test	52G /mnt/dm1/test	97G /mnt/dm2/test
> 29G /mnt/dm0/test	54G /mnt/dm1/test	101G /mnt/dm2/test [done]
> 35G /mnt/dm0/test	65G /mnt/dm1/test	101G /mnt/dm2/test
> 38G /mnt/dm0/test	70G /mnt/dm1/test	101G /mnt/dm2/test
> 
> And so on. Final number:
> 
> Single disk: 70MB/s
> 2 disk dm stripe: 130MB/s
> 4 disk dm stripe: 260MB/s
> 
> So overall we've lost about 15-20% of the theoretical aggregate
> perfomrance, but we haven't starved any of the devices over a
> long period of time.
> 
> However, looking at vmstat for total throughput, there are periods
> of time where it appears that the fastest disk goes idle. That is,
> we drop from an aggregate of about 550MB/s to below 300MB/s for
> several seconds at a time. You can sort of see this from the file
> size output above - long term the ratios remain the same, but in the
> short term we see quite a bit of variability.

I suspect you did not apply 7/6? There is some trouble with signed vs
unsigned in the initial patch set that I tried to 'fix' by masking out
the MSB, but that doesn't work and results in 'time' getting stuck for
about half the time.

> When the fast disk completed, I saw almost the same thing, but
> this time it seems like the slow disk (i.e. ~230MB/s to ~150MB/s)
> stopped for several seconds.
> 
> I haven't really digested what the patches do,

If you have questions please ask, I'll try to write up coherent
answers :-)

>  but it's almost
> like it is throttling a device completely while it allows another
> to finish writing it's quota (underestimating bandwidth?).

Yeah, there is some lumpy-ness in BIO submission or write completions it
seems, and when that granularity (multiplied by the number of active
devices) is larger than the 'time' period over with we average
(indicated by vm_cycle_shift) very weird stuff can happen.

> (umount after writes hung again. Same root disk thing as before....)
> 
> This is looking promising, Peter. When it is more stable I'll run
> some more tests....

Thanks for the tests.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/6] per device dirty throttling
  2007-03-20  8:08     ` Peter Zijlstra
@ 2007-03-20  9:38       ` David Chinner
  -1 siblings, 0 replies; 28+ messages in thread
From: David Chinner @ 2007-03-20  9:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Chinner, linux-mm, linux-kernel, akpm, neilb, tomoki.sekiyama.qu

On Tue, Mar 20, 2007 at 09:08:24AM +0100, Peter Zijlstra wrote:
> On Tue, 2007-03-20 at 18:47 +1100, David Chinner wrote:
> > So overall we've lost about 15-20% of the theoretical aggregate
> > perfomrance, but we haven't starved any of the devices over a
> > long period of time.
> > 
> > However, looking at vmstat for total throughput, there are periods
> > of time where it appears that the fastest disk goes idle. That is,
> > we drop from an aggregate of about 550MB/s to below 300MB/s for
> > several seconds at a time. You can sort of see this from the file
> > size output above - long term the ratios remain the same, but in the
> > short term we see quite a bit of variability.
> 
> I suspect you did not apply 7/6? There is some trouble with signed vs
> unsigned in the initial patch set that I tried to 'fix' by masking out
> the MSB, but that doesn't work and results in 'time' getting stuck for
> about half the time.

I applied the fixes patch as well, so i had all that you posted...

> >  but it's almost
> > like it is throttling a device completely while it allows another
> > to finish writing it's quota (underestimating bandwidth?).
> 
> Yeah, there is some lumpy-ness in BIO submission or write completions it
> seems, and when that granularity (multiplied by the number of active
> devices) is larger than the 'time' period over with we average
> (indicated by vm_cycle_shift) very weird stuff can happen.

Sounds like the period is a bit too short atm if we can get into this
sort of problem with only 2 active devices....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/6] per device dirty throttling
@ 2007-03-20  9:38       ` David Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: David Chinner @ 2007-03-20  9:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Chinner, linux-mm, linux-kernel, akpm, neilb, tomoki.sekiyama.qu

On Tue, Mar 20, 2007 at 09:08:24AM +0100, Peter Zijlstra wrote:
> On Tue, 2007-03-20 at 18:47 +1100, David Chinner wrote:
> > So overall we've lost about 15-20% of the theoretical aggregate
> > perfomrance, but we haven't starved any of the devices over a
> > long period of time.
> > 
> > However, looking at vmstat for total throughput, there are periods
> > of time where it appears that the fastest disk goes idle. That is,
> > we drop from an aggregate of about 550MB/s to below 300MB/s for
> > several seconds at a time. You can sort of see this from the file
> > size output above - long term the ratios remain the same, but in the
> > short term we see quite a bit of variability.
> 
> I suspect you did not apply 7/6? There is some trouble with signed vs
> unsigned in the initial patch set that I tried to 'fix' by masking out
> the MSB, but that doesn't work and results in 'time' getting stuck for
> about half the time.

I applied the fixes patch as well, so i had all that you posted...

> >  but it's almost
> > like it is throttling a device completely while it allows another
> > to finish writing it's quota (underestimating bandwidth?).
> 
> Yeah, there is some lumpy-ness in BIO submission or write completions it
> seems, and when that granularity (multiplied by the number of active
> devices) is larger than the 'time' period over with we average
> (indicated by vm_cycle_shift) very weird stuff can happen.

Sounds like the period is a bit too short atm if we can get into this
sort of problem with only 2 active devices....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/6] per device dirty throttling
  2007-03-20  9:38       ` David Chinner
@ 2007-03-20  9:45         ` Peter Zijlstra
  -1 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-20  9:45 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-mm, linux-kernel, akpm, neilb, tomoki.sekiyama.qu

On Tue, 2007-03-20 at 20:38 +1100, David Chinner wrote:
> On Tue, Mar 20, 2007 at 09:08:24AM +0100, Peter Zijlstra wrote:
> > On Tue, 2007-03-20 at 18:47 +1100, David Chinner wrote:
> > > So overall we've lost about 15-20% of the theoretical aggregate
> > > perfomrance, but we haven't starved any of the devices over a
> > > long period of time.
> > > 
> > > However, looking at vmstat for total throughput, there are periods
> > > of time where it appears that the fastest disk goes idle. That is,
> > > we drop from an aggregate of about 550MB/s to below 300MB/s for
> > > several seconds at a time. You can sort of see this from the file
> > > size output above - long term the ratios remain the same, but in the
> > > short term we see quite a bit of variability.
> > 
> > I suspect you did not apply 7/6? There is some trouble with signed vs
> > unsigned in the initial patch set that I tried to 'fix' by masking out
> > the MSB, but that doesn't work and results in 'time' getting stuck for
> > about half the time.
> 
> I applied the fixes patch as well, so i had all that you posted...

Humm, not that then.

> > >  but it's almost
> > > like it is throttling a device completely while it allows another
> > > to finish writing it's quota (underestimating bandwidth?).
> > 
> > Yeah, there is some lumpy-ness in BIO submission or write completions it
> > seems, and when that granularity (multiplied by the number of active
> > devices) is larger than the 'time' period over with we average
> > (indicated by vm_cycle_shift) very weird stuff can happen.
> 
> Sounds like the period is a bit too short atm if we can get into this
> sort of problem with only 2 active devices....

Yeah, trouble is, I significantly extended this period in 7/6.
Will have to ponder a bit on what is happening then.

Anyway, thanks for the feedback.

I'll try and reproduce the umount problem, maybe that will give some
hints.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/6] per device dirty throttling
@ 2007-03-20  9:45         ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-20  9:45 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-mm, linux-kernel, akpm, neilb, tomoki.sekiyama.qu

On Tue, 2007-03-20 at 20:38 +1100, David Chinner wrote:
> On Tue, Mar 20, 2007 at 09:08:24AM +0100, Peter Zijlstra wrote:
> > On Tue, 2007-03-20 at 18:47 +1100, David Chinner wrote:
> > > So overall we've lost about 15-20% of the theoretical aggregate
> > > perfomrance, but we haven't starved any of the devices over a
> > > long period of time.
> > > 
> > > However, looking at vmstat for total throughput, there are periods
> > > of time where it appears that the fastest disk goes idle. That is,
> > > we drop from an aggregate of about 550MB/s to below 300MB/s for
> > > several seconds at a time. You can sort of see this from the file
> > > size output above - long term the ratios remain the same, but in the
> > > short term we see quite a bit of variability.
> > 
> > I suspect you did not apply 7/6? There is some trouble with signed vs
> > unsigned in the initial patch set that I tried to 'fix' by masking out
> > the MSB, but that doesn't work and results in 'time' getting stuck for
> > about half the time.
> 
> I applied the fixes patch as well, so i had all that you posted...

Humm, not that then.

> > >  but it's almost
> > > like it is throttling a device completely while it allows another
> > > to finish writing it's quota (underestimating bandwidth?).
> > 
> > Yeah, there is some lumpy-ness in BIO submission or write completions it
> > seems, and when that granularity (multiplied by the number of active
> > devices) is larger than the 'time' period over with we average
> > (indicated by vm_cycle_shift) very weird stuff can happen.
> 
> Sounds like the period is a bit too short atm if we can get into this
> sort of problem with only 2 active devices....

Yeah, trouble is, I significantly extended this period in 7/6.
Will have to ponder a bit on what is happening then.

Anyway, thanks for the feedback.

I'll try and reproduce the umount problem, maybe that will give some
hints.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/6] per device dirty throttling
  2007-03-20  9:45         ` Peter Zijlstra
@ 2007-03-20 15:38           ` Peter Zijlstra
  -1 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-20 15:38 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-mm, linux-kernel, akpm, neilb, tomoki.sekiyama.qu


This seems to fix the worst of it, I'll run with it for a few days and
respin the patches and repost when nothing weird happens.

---
Found missing bdi_stat_init() sites for NFS and Fuse

Optimize bdi_writeout_norm(), break out of the loop when we hit zero. This will
allow 'new' BDIs to catch up without triggering NMI/softlockup msgs.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/fuse/inode.c     |    1 +
 fs/nfs/client.c     |    1 +
 mm/page-writeback.c |   25 +++++++++++++++++++------
 3 files changed, 21 insertions(+), 6 deletions(-)

Index: linux-2.6/fs/nfs/client.c
===================================================================
--- linux-2.6.orig/fs/nfs/client.c
+++ linux-2.6/fs/nfs/client.c
@@ -657,6 +657,7 @@ static void nfs_server_set_fsinfo(struct
 		server->rsize = NFS_MAX_FILE_IO_SIZE;
 	server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
+	bdi_stat_init(&server->backing_dev_info);
 
 	if (server->wsize > max_rpc_payload)
 		server->wsize = max_rpc_payload;
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -114,6 +114,13 @@ static void background_writeout(unsigned
  *
  * Furthermore, when we choose thresh to be 2^n it can be written in terms of
  * binary operations and wraparound artifacts disappear.
+ *
+ * Also note that this yields a natural counter of the elapsed periods:
+ *
+ *   i / thresh
+ *
+ * Its monotonous increasing property can be applied to mitigate the wrap-
+ * around issue.
  */
 static int vm_cycle_shift __read_mostly;
 
@@ -125,19 +132,25 @@ static void bdi_writeout_norm(struct bac
 	int bits = vm_cycle_shift;
 	unsigned long cycle = 1UL << bits;
 	unsigned long mask = ~(cycle - 1);
-	unsigned long total = __global_bdi_stat(BDI_WRITEOUT_TOTAL) << 1;
+	unsigned long global_cycle =
+		(__global_bdi_stat(BDI_WRITEOUT_TOTAL) << 1) & mask;
 	unsigned long flags;
 
-	if ((bdi->cycles & mask) == (total & mask))
+	if ((bdi->cycles & mask) == global_cycle)
 		return;
 
 	spin_lock_irqsave(&bdi->lock, flags);
-	while ((bdi->cycles & mask) != (total & mask)) {
-		unsigned long half = __bdi_stat(bdi, BDI_WRITEOUT) / 2;
+	while ((bdi->cycles & mask) != global_cycle) {
+		unsigned long val = __bdi_stat(bdi, BDI_WRITEOUT);
+		unsigned long half = (val + 1) / 2;
+
+		if (!val)
+			break;
 
 		mod_bdi_stat(bdi, BDI_WRITEOUT, -half);
 		bdi->cycles += cycle;
 	}
+	bdi->cycles = global_cycle;
 	spin_unlock_irqrestore(&bdi->lock, flags);
 }
 
@@ -146,10 +159,10 @@ static void bdi_writeout_inc(struct back
 	if (!bdi_cap_writeback_dirty(bdi))
 		return;
 
+	bdi_writeout_norm(bdi);
+
 	__inc_bdi_stat(bdi, BDI_WRITEOUT);
 	__inc_bdi_stat(bdi, BDI_WRITEOUT_TOTAL);
-
-	bdi_writeout_norm(bdi);
 }
 
 void get_writeout_scale(struct backing_dev_info *bdi, int *scale, int *div)
Index: linux-2.6/fs/fuse/inode.c
===================================================================
--- linux-2.6.orig/fs/fuse/inode.c
+++ linux-2.6/fs/fuse/inode.c
@@ -413,6 +413,7 @@ static struct fuse_conn *new_conn(void)
 		atomic_set(&fc->num_waiting, 0);
 		fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 		fc->bdi.unplug_io_fn = default_unplug_io_fn;
+		bdi_stat_init(&fc->bdi);
 		fc->reqctr = 0;
 		fc->blocked = 1;
 		get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/6] per device dirty throttling
@ 2007-03-20 15:38           ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-20 15:38 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-mm, linux-kernel, akpm, neilb, tomoki.sekiyama.qu

This seems to fix the worst of it, I'll run with it for a few days and
respin the patches and repost when nothing weird happens.

---
Found missing bdi_stat_init() sites for NFS and Fuse

Optimize bdi_writeout_norm(), break out of the loop when we hit zero. This will
allow 'new' BDIs to catch up without triggering NMI/softlockup msgs.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/fuse/inode.c     |    1 +
 fs/nfs/client.c     |    1 +
 mm/page-writeback.c |   25 +++++++++++++++++++------
 3 files changed, 21 insertions(+), 6 deletions(-)

Index: linux-2.6/fs/nfs/client.c
===================================================================
--- linux-2.6.orig/fs/nfs/client.c
+++ linux-2.6/fs/nfs/client.c
@@ -657,6 +657,7 @@ static void nfs_server_set_fsinfo(struct
 		server->rsize = NFS_MAX_FILE_IO_SIZE;
 	server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
+	bdi_stat_init(&server->backing_dev_info);
 
 	if (server->wsize > max_rpc_payload)
 		server->wsize = max_rpc_payload;
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -114,6 +114,13 @@ static void background_writeout(unsigned
  *
  * Furthermore, when we choose thresh to be 2^n it can be written in terms of
  * binary operations and wraparound artifacts disappear.
+ *
+ * Also note that this yields a natural counter of the elapsed periods:
+ *
+ *   i / thresh
+ *
+ * Its monotonous increasing property can be applied to mitigate the wrap-
+ * around issue.
  */
 static int vm_cycle_shift __read_mostly;
 
@@ -125,19 +132,25 @@ static void bdi_writeout_norm(struct bac
 	int bits = vm_cycle_shift;
 	unsigned long cycle = 1UL << bits;
 	unsigned long mask = ~(cycle - 1);
-	unsigned long total = __global_bdi_stat(BDI_WRITEOUT_TOTAL) << 1;
+	unsigned long global_cycle =
+		(__global_bdi_stat(BDI_WRITEOUT_TOTAL) << 1) & mask;
 	unsigned long flags;
 
-	if ((bdi->cycles & mask) == (total & mask))
+	if ((bdi->cycles & mask) == global_cycle)
 		return;
 
 	spin_lock_irqsave(&bdi->lock, flags);
-	while ((bdi->cycles & mask) != (total & mask)) {
-		unsigned long half = __bdi_stat(bdi, BDI_WRITEOUT) / 2;
+	while ((bdi->cycles & mask) != global_cycle) {
+		unsigned long val = __bdi_stat(bdi, BDI_WRITEOUT);
+		unsigned long half = (val + 1) / 2;
+
+		if (!val)
+			break;
 
 		mod_bdi_stat(bdi, BDI_WRITEOUT, -half);
 		bdi->cycles += cycle;
 	}
+	bdi->cycles = global_cycle;
 	spin_unlock_irqrestore(&bdi->lock, flags);
 }
 
@@ -146,10 +159,10 @@ static void bdi_writeout_inc(struct back
 	if (!bdi_cap_writeback_dirty(bdi))
 		return;
 
+	bdi_writeout_norm(bdi);
+
 	__inc_bdi_stat(bdi, BDI_WRITEOUT);
 	__inc_bdi_stat(bdi, BDI_WRITEOUT_TOTAL);
-
-	bdi_writeout_norm(bdi);
 }
 
 void get_writeout_scale(struct backing_dev_info *bdi, int *scale, int *div)
Index: linux-2.6/fs/fuse/inode.c
===================================================================
--- linux-2.6.orig/fs/fuse/inode.c
+++ linux-2.6/fs/fuse/inode.c
@@ -413,6 +413,7 @@ static struct fuse_conn *new_conn(void)
 		atomic_set(&fc->num_waiting, 0);
 		fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 		fc->bdi.unplug_io_fn = default_unplug_io_fn;
+		bdi_stat_init(&fc->bdi);
 		fc->reqctr = 0;
 		fc->blocked = 1;
 		get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2007-03-20 15:38 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-19 15:57 [RFC][PATCH 0/6] per device dirty throttling Peter Zijlstra
2007-03-19 15:57 ` Peter Zijlstra
2007-03-19 15:57 ` [RFC][PATCH 1/6] mm: scalable bdi statistics counters Peter Zijlstra
2007-03-19 15:57   ` Peter Zijlstra
2007-03-19 15:57 ` [RFC][PATCH 2/6] mm: count dirty pages per BDI Peter Zijlstra
2007-03-19 15:57   ` Peter Zijlstra
2007-03-19 15:57 ` [RFC][PATCH 3/6] mm: count writeback " Peter Zijlstra
2007-03-19 15:57   ` Peter Zijlstra
2007-03-19 15:57 ` [RFC][PATCH 4/6] mm: count unstable " Peter Zijlstra
2007-03-19 15:57   ` Peter Zijlstra
2007-03-19 15:57 ` [RFC][PATCH 5/6] mm: per device dirty threshold Peter Zijlstra
2007-03-19 15:57   ` Peter Zijlstra
2007-03-19 15:57 ` [RFC][PATCH 6/6] mm: expose BDI statistics in sysfs Peter Zijlstra
2007-03-19 15:57   ` Peter Zijlstra
2007-03-19 18:29 ` [RFC][PATCH 0/6] per device dirty throttling Peter Zijlstra
2007-03-19 18:29   ` Peter Zijlstra
2007-03-19 21:48 ` [RFC][PATCH 7/6] assorted fixes Peter Zijlstra
2007-03-19 21:48   ` Peter Zijlstra
2007-03-20  7:47 ` [RFC][PATCH 0/6] per device dirty throttling David Chinner
2007-03-20  7:47   ` David Chinner
2007-03-20  8:08   ` Peter Zijlstra
2007-03-20  8:08     ` Peter Zijlstra
2007-03-20  9:38     ` David Chinner
2007-03-20  9:38       ` David Chinner
2007-03-20  9:45       ` Peter Zijlstra
2007-03-20  9:45         ` Peter Zijlstra
2007-03-20 15:38         ` Peter Zijlstra
2007-03-20 15:38           ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.