linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
@ 2012-09-16 12:25 Namjae Jeon
  2012-09-20  8:44 ` Fengguang Wu
  0 siblings, 1 reply; 16+ messages in thread
From: Namjae Jeon @ 2012-09-16 12:25 UTC (permalink / raw)
  To: fengguang.wu; +Cc: linux-kernel, Namjae Jeon, Namjae Jeon, Vivek Trivedi

From: Namjae Jeon <namjae.jeon@samsung.com>

This patch is based on suggestion by Wu Fengguang:
https://lkml.org/lkml/2011/8/19/19

kernel has mechanism to do writeback as per dirty_ratio and dirty_background
ratio. It also maintains per task dirty rate limit to keep balance of
dirty pages at any given instance by doing bdi bandwidth estimation.

Kernel also has max_ratio/min_ratio tunables to specify percentage of
writecache to control per bdi dirty limits and task throttling.

However, there might be a usecase where user wants a per bdi writeback tuning
parameter to flush dirty data once per bdi dirty data reach a threshold
especially at NFS server.

dirty_background_centisecs provides an interface where user can tune
background writeback start threshold using
/sys/block/sda/bdi/dirty_background_centisecs

dirty_background_centisecs is used alongwith average bdi write bandwidth
estimation to start background writeback.

One of the use case to demonstrate the patch functionality can be
on NFS setup:-
We have a NFS setup with ethernet line of 100Mbps, while the USB
disk is attached to server, which has a local speed of 25MBps. Server
and client both are arm target boards.

Now if we perform a write operation over NFS (client to server), as
per the network speed, data can travel at max speed of 100Mbps. But
if we check the default write speed of USB hdd over NFS it comes
around to 8MB/sec, far below the speed of network.

Reason being is as per the NFS logic, during write operation, initially
pages are dirtied on NFS client side, then after reaching the dirty
threshold/writeback limit (or in case of sync) data is actually sent
to NFS server (so now again pages are dirtied on server side). This
will be done in COMMIT call from client to server i.e if 100MB of data
is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds.

After the data is received, now it will take approx 100/25 ~4 Seconds to
write the data to USB Hdd on server side. Hence making the overall time
to write this much of data ~12 seconds, which in practically comes out to
be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS
client.

However we may improve this write performace by making the use of NFS
server idle time i.e while data is being received from the client,
simultaneously initiate the writeback thread on server side. So instead
of waiting for the complete data to come and then start the writeback,
we can work in parallel while the network is still busy in receiving the
data. Hence in this way overall performace will be improved.

If we tune dirty_background_centisecs, we can see there
is increase in the performace and it comes out to be ~ 11MB/seconds.
Results are:-

Write test(create a 1 GB file) result at 'NFS client' after changing 
/sys/block/sda/bdi/dirty_background_centisecs 
on  *** NFS Server only - not on NFS Client ****

---------------------------------------------------------------------
|WRITE Test with various 'dirty_background_centisecs' at NFS Server |
---------------------------------------------------------------------
|          | default = 0 | 300 centisec| 200 centisec| 100 centisec |
---------------------------------------------------------------------
|RecSize   |  WriteSpeed |  WriteSpeed |  WriteSpeed |  WriteSpeed  |
---------------------------------------------------------------------
|10485760  |  8.44MB/sec |  8.60MB/sec |  9.30MB/sec |  10.27MB/sec |
| 1048576  |  8.48MB/sec |  8.87MB/sec |  9.31MB/sec |  10.34MB/sec |
|  524288  |  8.37MB/sec |  8.42MB/sec |  9.84MB/sec |  10.47MB/sec |
|  262144  |  8.16MB/sec |  8.51MB/sec |  9.52MB/sec |  10.62MB/sec |
|  131072  |  8.48MB/sec |  8.81MB/sec |  9.42MB/sec |  10.55MB/sec |
|   65536  |  8.38MB/sec |  9.09MB/sec |  9.76MB/sec |  10.53MB/sec |
|   32768  |  8.65MB/sec |  9.00MB/sec |  9.57MB/sec |  10.54MB/sec |
|   16384  |  8.27MB/sec |  8.80MB/sec |  9.39MB/sec |  10.43MB/sec |
|    8192  |  8.52MB/sec |  8.70MB/sec |  9.40MB/sec |  10.50MB/sec |
|    4096  |  8.20MB/sec |  8.63MB/sec |  9.80MB/sec |  10.35MB/sec |
---------------------------------------------------------------------

we can see, average write speed is increased to ~10-11MB/sec.
============================================================

This patch provides the changes per block devices. So that we may modify the
dirty_background_centisecs as per the device and overall system is not impacted
by the changes and we get improved perforamace in certain use cases.

NOTE: dirty_background_centisecs is used alongwith average bdi write bandwidth
estimation to start background writeback. But, bdi write bandwidth estimation
is an _estimation_ and may become wildly wrong. dirty_background_centisecs
tuning may not always work to the user expectations. dirty_background_centisecs
will require careful tuning by users on NFS Server.
As a good use case, dirty_background_time should be set around 100 (1 sec).
It should not be set to very small value, otherwise it will start
flushing for small I/O size dirty data.

Changes since v1:
* make default value of 'dirty_background_centisecs = 0' sothat there is no change
  in default writeback behaviour.
* Add description of dirty_background_centisecs in documentation.

Original-patch-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Vivek Trivedi <t.vivek@samsung.com>
---
 fs/fs-writeback.c           |   21 +++++++++++++++++++--
 include/linux/backing-dev.h |    1 +
 include/linux/writeback.h   |    1 +
 mm/backing-dev.c            |   21 +++++++++++++++++++++
 mm/page-writeback.c         |    3 ++-
 5 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index fd255c0..c427130 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -769,6 +769,22 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
 	return false;
 }
 
+bool over_dirty_bground_time(struct backing_dev_info *bdi)
+{
+	unsigned long background_thresh;
+
+	if (!bdi->dirty_background_centisecs)
+		return false;
+
+	background_thresh = bdi->avg_write_bandwidth *
+				bdi->dirty_background_centisecs / 100;
+
+	if (bdi_stat(bdi, BDI_RECLAIMABLE) > background_thresh)
+		return true;
+
+	return false;
+}
+
 /*
  * Called under wb->list_lock. If there are multiple wb per bdi,
  * only the flusher working on the first wb should do it.
@@ -828,7 +844,8 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * For background writeout, stop when we are below the
 		 * background dirty threshold
 		 */
-		if (work->for_background && !over_bground_thresh(wb->bdi))
+		if (work->for_background && !over_bground_thresh(wb->bdi) &&
+			!over_dirty_bground_time(wb->bdi))
 			break;
 
 		/*
@@ -920,7 +937,7 @@ static unsigned long get_nr_dirty_pages(void)
 
 static long wb_check_background_flush(struct bdi_writeback *wb)
 {
-	if (over_bground_thresh(wb->bdi)) {
+	if (over_bground_thresh(wb->bdi) || over_dirty_bground_time(wb->bdi)) {
 
 		struct wb_writeback_work work = {
 			.nr_pages	= LONG_MAX,
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 2a9a9ab..43d2e15 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -95,6 +95,7 @@ struct backing_dev_info {
 
 	unsigned int min_ratio;
 	unsigned int max_ratio, max_prop_frac;
+	unsigned int dirty_background_centisecs;
 
 	struct bdi_writeback wb;  /* default writeback info for this bdi */
 	spinlock_t wb_lock;	  /* protects work_list */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 50c3e8f..6dc2abe 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -96,6 +96,7 @@ long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
 void inode_wait_for_writeback(struct inode *inode);
+bool over_dirty_bground_time(struct backing_dev_info *bdi);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d3ca2b3..b1b2fd2 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -221,12 +221,32 @@ static ssize_t max_ratio_store(struct device *dev,
 }
 BDI_SHOW(max_ratio, bdi->max_ratio)
 
+static ssize_t dirty_background_centisecs_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct backing_dev_info *bdi = dev_get_drvdata(dev);
+	unsigned int centisecs;
+	ssize_t ret;
+
+	ret = kstrtouint(buf, 10, &centisecs);
+	if (ret < 0)
+		return ret;
+
+	bdi->dirty_background_centisecs = centisecs;
+	if (over_dirty_bground_time(bdi))
+		bdi_start_background_writeback(bdi);
+
+	return count;
+}
+BDI_SHOW(dirty_background_centisecs, bdi->dirty_background_centisecs)
+
 #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)
 
 static struct device_attribute bdi_dev_attrs[] = {
 	__ATTR_RW(read_ahead_kb),
 	__ATTR_RW(min_ratio),
 	__ATTR_RW(max_ratio),
+	__ATTR_RW(dirty_background_centisecs),
 	__ATTR_NULL,
 };
 
@@ -628,6 +648,7 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->min_ratio = 0;
 	bdi->max_ratio = 100;
 	bdi->max_prop_frac = FPROP_FRAC_BASE;
+	bdi->dirty_background_centisecs = 0;
 	spin_lock_init(&bdi->wb_lock);
 	INIT_LIST_HEAD(&bdi->bdi_list);
 	INIT_LIST_HEAD(&bdi->work_list);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5ad5ce2..8c1530d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1403,7 +1403,8 @@ pause:
 	if (laptop_mode)
 		return;
 
-	if (nr_reclaimable > background_thresh)
+	if (nr_reclaimable > background_thresh ||
+			over_dirty_bground_time(bdi))
 		bdi_start_background_writeback(bdi);
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
  2012-09-16 12:25 [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable Namjae Jeon
@ 2012-09-20  8:44 ` Fengguang Wu
  2012-09-24 22:23   ` Jan Kara
  2012-09-25  1:36   ` Dave Chinner
  0 siblings, 2 replies; 16+ messages in thread
From: Fengguang Wu @ 2012-09-20  8:44 UTC (permalink / raw)
  To: Namjae Jeon
  Cc: Jan Kara, Dave Chinner, linux-kernel, Namjae Jeon, Vivek Trivedi,
	Linux Memory Management List, linux-fsdevel

[ CC FS and MM lists ]

Patch looks good to me, however we need to be careful because it's
introducing a new interface. So it's desirable to get some acks from
the FS/MM developers.

Thanks,
Fengguang

On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> From: Namjae Jeon <namjae.jeon@samsung.com>
> 
> This patch is based on suggestion by Wu Fengguang:
> https://lkml.org/lkml/2011/8/19/19
> 
> kernel has mechanism to do writeback as per dirty_ratio and dirty_background
> ratio. It also maintains per task dirty rate limit to keep balance of
> dirty pages at any given instance by doing bdi bandwidth estimation.
> 
> Kernel also has max_ratio/min_ratio tunables to specify percentage of
> writecache to control per bdi dirty limits and task throttling.
> 
> However, there might be a usecase where user wants a per bdi writeback tuning
> parameter to flush dirty data once per bdi dirty data reach a threshold
> especially at NFS server.
> 
> dirty_background_centisecs provides an interface where user can tune
> background writeback start threshold using
> /sys/block/sda/bdi/dirty_background_centisecs
> 
> dirty_background_centisecs is used alongwith average bdi write bandwidth
> estimation to start background writeback.
> 
> One of the use case to demonstrate the patch functionality can be
> on NFS setup:-
> We have a NFS setup with ethernet line of 100Mbps, while the USB
> disk is attached to server, which has a local speed of 25MBps. Server
> and client both are arm target boards.
> 
> Now if we perform a write operation over NFS (client to server), as
> per the network speed, data can travel at max speed of 100Mbps. But
> if we check the default write speed of USB hdd over NFS it comes
> around to 8MB/sec, far below the speed of network.
> 
> Reason being is as per the NFS logic, during write operation, initially
> pages are dirtied on NFS client side, then after reaching the dirty
> threshold/writeback limit (or in case of sync) data is actually sent
> to NFS server (so now again pages are dirtied on server side). This
> will be done in COMMIT call from client to server i.e if 100MB of data
> is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds.
> 
> After the data is received, now it will take approx 100/25 ~4 Seconds to
> write the data to USB Hdd on server side. Hence making the overall time
> to write this much of data ~12 seconds, which in practically comes out to
> be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS
> client.
> 
> However we may improve this write performace by making the use of NFS
> server idle time i.e while data is being received from the client,
> simultaneously initiate the writeback thread on server side. So instead
> of waiting for the complete data to come and then start the writeback,
> we can work in parallel while the network is still busy in receiving the
> data. Hence in this way overall performace will be improved.
> 
> If we tune dirty_background_centisecs, we can see there
> is increase in the performace and it comes out to be ~ 11MB/seconds.
> Results are:-
> 
> Write test(create a 1 GB file) result at 'NFS client' after changing 
> /sys/block/sda/bdi/dirty_background_centisecs 
> on  *** NFS Server only - not on NFS Client ****
> 
> ---------------------------------------------------------------------
> |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
> ---------------------------------------------------------------------
> |          | default = 0 | 300 centisec| 200 centisec| 100 centisec |
> ---------------------------------------------------------------------
> |RecSize   |  WriteSpeed |  WriteSpeed |  WriteSpeed |  WriteSpeed  |
> ---------------------------------------------------------------------
> |10485760  |  8.44MB/sec |  8.60MB/sec |  9.30MB/sec |  10.27MB/sec |
> | 1048576  |  8.48MB/sec |  8.87MB/sec |  9.31MB/sec |  10.34MB/sec |
> |  524288  |  8.37MB/sec |  8.42MB/sec |  9.84MB/sec |  10.47MB/sec |
> |  262144  |  8.16MB/sec |  8.51MB/sec |  9.52MB/sec |  10.62MB/sec |
> |  131072  |  8.48MB/sec |  8.81MB/sec |  9.42MB/sec |  10.55MB/sec |
> |   65536  |  8.38MB/sec |  9.09MB/sec |  9.76MB/sec |  10.53MB/sec |
> |   32768  |  8.65MB/sec |  9.00MB/sec |  9.57MB/sec |  10.54MB/sec |
> |   16384  |  8.27MB/sec |  8.80MB/sec |  9.39MB/sec |  10.43MB/sec |
> |    8192  |  8.52MB/sec |  8.70MB/sec |  9.40MB/sec |  10.50MB/sec |
> |    4096  |  8.20MB/sec |  8.63MB/sec |  9.80MB/sec |  10.35MB/sec |
> ---------------------------------------------------------------------
> 
> we can see, average write speed is increased to ~10-11MB/sec.
> ============================================================
> 
> This patch provides the changes per block devices. So that we may modify the
> dirty_background_centisecs as per the device and overall system is not impacted
> by the changes and we get improved perforamace in certain use cases.
> 
> NOTE: dirty_background_centisecs is used alongwith average bdi write bandwidth
> estimation to start background writeback. But, bdi write bandwidth estimation
> is an _estimation_ and may become wildly wrong. dirty_background_centisecs
> tuning may not always work to the user expectations. dirty_background_centisecs
> will require careful tuning by users on NFS Server.
> As a good use case, dirty_background_time should be set around 100 (1 sec).
> It should not be set to very small value, otherwise it will start
> flushing for small I/O size dirty data.
> 
> Changes since v1:
> * make default value of 'dirty_background_centisecs = 0' sothat there is no change
>   in default writeback behaviour.
> * Add description of dirty_background_centisecs in documentation.
> 
> Original-patch-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
> Signed-off-by: Vivek Trivedi <t.vivek@samsung.com>
> ---
>  fs/fs-writeback.c           |   21 +++++++++++++++++++--
>  include/linux/backing-dev.h |    1 +
>  include/linux/writeback.h   |    1 +
>  mm/backing-dev.c            |   21 +++++++++++++++++++++
>  mm/page-writeback.c         |    3 ++-
>  5 files changed, 44 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index fd255c0..c427130 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -769,6 +769,22 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
>  	return false;
>  }
>  
> +bool over_dirty_bground_time(struct backing_dev_info *bdi)
> +{
> +	unsigned long background_thresh;
> +
> +	if (!bdi->dirty_background_centisecs)
> +		return false;
> +
> +	background_thresh = bdi->avg_write_bandwidth *
> +				bdi->dirty_background_centisecs / 100;
> +
> +	if (bdi_stat(bdi, BDI_RECLAIMABLE) > background_thresh)
> +		return true;
> +
> +	return false;
> +}
> +
>  /*
>   * Called under wb->list_lock. If there are multiple wb per bdi,
>   * only the flusher working on the first wb should do it.
> @@ -828,7 +844,8 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		 * For background writeout, stop when we are below the
>  		 * background dirty threshold
>  		 */
> -		if (work->for_background && !over_bground_thresh(wb->bdi))
> +		if (work->for_background && !over_bground_thresh(wb->bdi) &&
> +			!over_dirty_bground_time(wb->bdi))
>  			break;
>  
>  		/*
> @@ -920,7 +937,7 @@ static unsigned long get_nr_dirty_pages(void)
>  
>  static long wb_check_background_flush(struct bdi_writeback *wb)
>  {
> -	if (over_bground_thresh(wb->bdi)) {
> +	if (over_bground_thresh(wb->bdi) || over_dirty_bground_time(wb->bdi)) {
>  
>  		struct wb_writeback_work work = {
>  			.nr_pages	= LONG_MAX,
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 2a9a9ab..43d2e15 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -95,6 +95,7 @@ struct backing_dev_info {
>  
>  	unsigned int min_ratio;
>  	unsigned int max_ratio, max_prop_frac;
> +	unsigned int dirty_background_centisecs;
>  
>  	struct bdi_writeback wb;  /* default writeback info for this bdi */
>  	spinlock_t wb_lock;	  /* protects work_list */
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 50c3e8f..6dc2abe 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -96,6 +96,7 @@ long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
>  long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
>  void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
>  void inode_wait_for_writeback(struct inode *inode);
> +bool over_dirty_bground_time(struct backing_dev_info *bdi);
>  
>  /* writeback.h requires fs.h; it, too, is not included from here. */
>  static inline void wait_on_inode(struct inode *inode)
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index d3ca2b3..b1b2fd2 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -221,12 +221,32 @@ static ssize_t max_ratio_store(struct device *dev,
>  }
>  BDI_SHOW(max_ratio, bdi->max_ratio)
>  
> +static ssize_t dirty_background_centisecs_store(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t count)
> +{
> +	struct backing_dev_info *bdi = dev_get_drvdata(dev);
> +	unsigned int centisecs;
> +	ssize_t ret;
> +
> +	ret = kstrtouint(buf, 10, &centisecs);
> +	if (ret < 0)
> +		return ret;
> +
> +	bdi->dirty_background_centisecs = centisecs;
> +	if (over_dirty_bground_time(bdi))
> +		bdi_start_background_writeback(bdi);
> +
> +	return count;
> +}
> +BDI_SHOW(dirty_background_centisecs, bdi->dirty_background_centisecs)
> +
>  #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)
>  
>  static struct device_attribute bdi_dev_attrs[] = {
>  	__ATTR_RW(read_ahead_kb),
>  	__ATTR_RW(min_ratio),
>  	__ATTR_RW(max_ratio),
> +	__ATTR_RW(dirty_background_centisecs),
>  	__ATTR_NULL,
>  };
>  
> @@ -628,6 +648,7 @@ int bdi_init(struct backing_dev_info *bdi)
>  	bdi->min_ratio = 0;
>  	bdi->max_ratio = 100;
>  	bdi->max_prop_frac = FPROP_FRAC_BASE;
> +	bdi->dirty_background_centisecs = 0;
>  	spin_lock_init(&bdi->wb_lock);
>  	INIT_LIST_HEAD(&bdi->bdi_list);
>  	INIT_LIST_HEAD(&bdi->work_list);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5ad5ce2..8c1530d 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1403,7 +1403,8 @@ pause:
>  	if (laptop_mode)
>  		return;
>  
> -	if (nr_reclaimable > background_thresh)
> +	if (nr_reclaimable > background_thresh ||
> +			over_dirty_bground_time(bdi))
>  		bdi_start_background_writeback(bdi);
>  }
>  
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
  2012-09-20  8:44 ` Fengguang Wu
@ 2012-09-24 22:23   ` Jan Kara
  2012-09-25  6:50     ` Namjae Jeon
  2012-09-26 16:56     ` Fengguang Wu
  2012-09-25  1:36   ` Dave Chinner
  1 sibling, 2 replies; 16+ messages in thread
From: Jan Kara @ 2012-09-24 22:23 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Namjae Jeon, Jan Kara, Dave Chinner, linux-kernel, Namjae Jeon,
	Vivek Trivedi, Linux Memory Management List, linux-fsdevel

On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
> On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> > From: Namjae Jeon <namjae.jeon@samsung.com>
> > 
> > This patch is based on suggestion by Wu Fengguang:
> > https://lkml.org/lkml/2011/8/19/19
> > 
> > kernel has mechanism to do writeback as per dirty_ratio and dirty_background
> > ratio. It also maintains per task dirty rate limit to keep balance of
> > dirty pages at any given instance by doing bdi bandwidth estimation.
> > 
> > Kernel also has max_ratio/min_ratio tunables to specify percentage of
> > writecache to control per bdi dirty limits and task throttling.
> > 
> > However, there might be a usecase where user wants a per bdi writeback tuning
> > parameter to flush dirty data once per bdi dirty data reach a threshold
> > especially at NFS server.
> > 
> > dirty_background_centisecs provides an interface where user can tune
> > background writeback start threshold using
> > /sys/block/sda/bdi/dirty_background_centisecs
> > 
> > dirty_background_centisecs is used alongwith average bdi write bandwidth
> > estimation to start background writeback.
  The functionality you describe, i.e. start flushing bdi when there's
reasonable amount of dirty data on it, looks sensible and useful. However
I'm not so sure whether the interface you propose is the right one.
Traditionally, we allow user to set amount of dirty data (either in bytes
or percentage of memory) when background writeback should start. You
propose setting the amount of data in centisecs-to-write. Why that
difference? Also this interface ties our throughput estimation code (which
is an implementation detail of current dirty throttling) with the userspace
API. So we'd have to maintain the estimation code forever, possibly also
face problems when we change the estimation code (and thus estimates in
some cases) and users will complain that the values they set originally no
longer work as they used to.

Also, as with each knob, there's a problem how to properly set its value?
Most admins won't know about the knob and so won't touch it. Others might
know about the knob but will have hard time figuring out what value should
they set. So if there's a new knob, it should have a sensible initial
value. And since this feature looks like a useful one, it shouldn't be
zero.

So my personal preference would be to have bdi->dirty_background_ratio and
bdi->dirty_background_bytes and start background writeback whenever
one of global background limit and per-bdi background limit is exceeded. I
think this interface will do the job as well and it's easier to maintain in
future.

								Honza

> > One of the use case to demonstrate the patch functionality can be
> > on NFS setup:-
> > We have a NFS setup with ethernet line of 100Mbps, while the USB
> > disk is attached to server, which has a local speed of 25MBps. Server
> > and client both are arm target boards.
> > 
> > Now if we perform a write operation over NFS (client to server), as
> > per the network speed, data can travel at max speed of 100Mbps. But
> > if we check the default write speed of USB hdd over NFS it comes
> > around to 8MB/sec, far below the speed of network.
> > 
> > Reason being is as per the NFS logic, during write operation, initially
> > pages are dirtied on NFS client side, then after reaching the dirty
> > threshold/writeback limit (or in case of sync) data is actually sent
> > to NFS server (so now again pages are dirtied on server side). This
> > will be done in COMMIT call from client to server i.e if 100MB of data
> > is dirtied and sent then it will take minimum 100MB/100Mbps ~ 8-9 seconds.
> > 
> > After the data is received, now it will take approx 100/25 ~4 Seconds to
> > write the data to USB Hdd on server side. Hence making the overall time
> > to write this much of data ~12 seconds, which in practically comes out to
> > be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS
> > client.
> > 
> > However we may improve this write performace by making the use of NFS
> > server idle time i.e while data is being received from the client,
> > simultaneously initiate the writeback thread on server side. So instead
> > of waiting for the complete data to come and then start the writeback,
> > we can work in parallel while the network is still busy in receiving the
> > data. Hence in this way overall performace will be improved.
> > 
> > If we tune dirty_background_centisecs, we can see there
> > is increase in the performace and it comes out to be ~ 11MB/seconds.
> > Results are:-
> > 
> > Write test(create a 1 GB file) result at 'NFS client' after changing 
> > /sys/block/sda/bdi/dirty_background_centisecs 
> > on  *** NFS Server only - not on NFS Client ****
> > 
> > ---------------------------------------------------------------------
> > |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
> > ---------------------------------------------------------------------
> > |          | default = 0 | 300 centisec| 200 centisec| 100 centisec |
> > ---------------------------------------------------------------------
> > |RecSize   |  WriteSpeed |  WriteSpeed |  WriteSpeed |  WriteSpeed  |
> > ---------------------------------------------------------------------
> > |10485760  |  8.44MB/sec |  8.60MB/sec |  9.30MB/sec |  10.27MB/sec |
> > | 1048576  |  8.48MB/sec |  8.87MB/sec |  9.31MB/sec |  10.34MB/sec |
> > |  524288  |  8.37MB/sec |  8.42MB/sec |  9.84MB/sec |  10.47MB/sec |
> > |  262144  |  8.16MB/sec |  8.51MB/sec |  9.52MB/sec |  10.62MB/sec |
> > |  131072  |  8.48MB/sec |  8.81MB/sec |  9.42MB/sec |  10.55MB/sec |
> > |   65536  |  8.38MB/sec |  9.09MB/sec |  9.76MB/sec |  10.53MB/sec |
> > |   32768  |  8.65MB/sec |  9.00MB/sec |  9.57MB/sec |  10.54MB/sec |
> > |   16384  |  8.27MB/sec |  8.80MB/sec |  9.39MB/sec |  10.43MB/sec |
> > |    8192  |  8.52MB/sec |  8.70MB/sec |  9.40MB/sec |  10.50MB/sec |
> > |    4096  |  8.20MB/sec |  8.63MB/sec |  9.80MB/sec |  10.35MB/sec |
> > ---------------------------------------------------------------------
> > 
> > we can see, average write speed is increased to ~10-11MB/sec.
> > ============================================================
> > 
> > This patch provides the changes per block devices. So that we may modify the
> > dirty_background_centisecs as per the device and overall system is not impacted
> > by the changes and we get improved perforamace in certain use cases.
> > 
> > NOTE: dirty_background_centisecs is used alongwith average bdi write bandwidth
> > estimation to start background writeback. But, bdi write bandwidth estimation
> > is an _estimation_ and may become wildly wrong. dirty_background_centisecs
> > tuning may not always work to the user expectations. dirty_background_centisecs
> > will require careful tuning by users on NFS Server.
> > As a good use case, dirty_background_time should be set around 100 (1 sec).
> > It should not be set to very small value, otherwise it will start
> > flushing for small I/O size dirty data.
> > 
> > Changes since v1:
> > * make default value of 'dirty_background_centisecs = 0' sothat there is no change
> >   in default writeback behaviour.
> > * Add description of dirty_background_centisecs in documentation.
> > 
> > Original-patch-by: Wu Fengguang <fengguang.wu@intel.com>
> > Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
> > Signed-off-by: Vivek Trivedi <t.vivek@samsung.com>
> > ---
> >  fs/fs-writeback.c           |   21 +++++++++++++++++++--
> >  include/linux/backing-dev.h |    1 +
> >  include/linux/writeback.h   |    1 +
> >  mm/backing-dev.c            |   21 +++++++++++++++++++++
> >  mm/page-writeback.c         |    3 ++-
> >  5 files changed, 44 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index fd255c0..c427130 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -769,6 +769,22 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
> >  	return false;
> >  }
> >  
> > +bool over_dirty_bground_time(struct backing_dev_info *bdi)
> > +{
> > +	unsigned long background_thresh;
> > +
> > +	if (!bdi->dirty_background_centisecs)
> > +		return false;
> > +
> > +	background_thresh = bdi->avg_write_bandwidth *
> > +				bdi->dirty_background_centisecs / 100;
> > +
> > +	if (bdi_stat(bdi, BDI_RECLAIMABLE) > background_thresh)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> >  /*
> >   * Called under wb->list_lock. If there are multiple wb per bdi,
> >   * only the flusher working on the first wb should do it.
> > @@ -828,7 +844,8 @@ static long wb_writeback(struct bdi_writeback *wb,
> >  		 * For background writeout, stop when we are below the
> >  		 * background dirty threshold
> >  		 */
> > -		if (work->for_background && !over_bground_thresh(wb->bdi))
> > +		if (work->for_background && !over_bground_thresh(wb->bdi) &&
> > +			!over_dirty_bground_time(wb->bdi))
> >  			break;
> >  
> >  		/*
> > @@ -920,7 +937,7 @@ static unsigned long get_nr_dirty_pages(void)
> >  
> >  static long wb_check_background_flush(struct bdi_writeback *wb)
> >  {
> > -	if (over_bground_thresh(wb->bdi)) {
> > +	if (over_bground_thresh(wb->bdi) || over_dirty_bground_time(wb->bdi)) {
> >  
> >  		struct wb_writeback_work work = {
> >  			.nr_pages	= LONG_MAX,
> > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > index 2a9a9ab..43d2e15 100644
> > --- a/include/linux/backing-dev.h
> > +++ b/include/linux/backing-dev.h
> > @@ -95,6 +95,7 @@ struct backing_dev_info {
> >  
> >  	unsigned int min_ratio;
> >  	unsigned int max_ratio, max_prop_frac;
> > +	unsigned int dirty_background_centisecs;
> >  
> >  	struct bdi_writeback wb;  /* default writeback info for this bdi */
> >  	spinlock_t wb_lock;	  /* protects work_list */
> > diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> > index 50c3e8f..6dc2abe 100644
> > --- a/include/linux/writeback.h
> > +++ b/include/linux/writeback.h
> > @@ -96,6 +96,7 @@ long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
> >  long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
> >  void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
> >  void inode_wait_for_writeback(struct inode *inode);
> > +bool over_dirty_bground_time(struct backing_dev_info *bdi);
> >  
> >  /* writeback.h requires fs.h; it, too, is not included from here. */
> >  static inline void wait_on_inode(struct inode *inode)
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index d3ca2b3..b1b2fd2 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -221,12 +221,32 @@ static ssize_t max_ratio_store(struct device *dev,
> >  }
> >  BDI_SHOW(max_ratio, bdi->max_ratio)
> >  
> > +static ssize_t dirty_background_centisecs_store(struct device *dev,
> > +		struct device_attribute *attr, const char *buf, size_t count)
> > +{
> > +	struct backing_dev_info *bdi = dev_get_drvdata(dev);
> > +	unsigned int centisecs;
> > +	ssize_t ret;
> > +
> > +	ret = kstrtouint(buf, 10, &centisecs);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	bdi->dirty_background_centisecs = centisecs;
> > +	if (over_dirty_bground_time(bdi))
> > +		bdi_start_background_writeback(bdi);
> > +
> > +	return count;
> > +}
> > +BDI_SHOW(dirty_background_centisecs, bdi->dirty_background_centisecs)
> > +
> >  #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)
> >  
> >  static struct device_attribute bdi_dev_attrs[] = {
> >  	__ATTR_RW(read_ahead_kb),
> >  	__ATTR_RW(min_ratio),
> >  	__ATTR_RW(max_ratio),
> > +	__ATTR_RW(dirty_background_centisecs),
> >  	__ATTR_NULL,
> >  };
> >  
> > @@ -628,6 +648,7 @@ int bdi_init(struct backing_dev_info *bdi)
> >  	bdi->min_ratio = 0;
> >  	bdi->max_ratio = 100;
> >  	bdi->max_prop_frac = FPROP_FRAC_BASE;
> > +	bdi->dirty_background_centisecs = 0;
> >  	spin_lock_init(&bdi->wb_lock);
> >  	INIT_LIST_HEAD(&bdi->bdi_list);
> >  	INIT_LIST_HEAD(&bdi->work_list);
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 5ad5ce2..8c1530d 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -1403,7 +1403,8 @@ pause:
> >  	if (laptop_mode)
> >  		return;
> >  
> > -	if (nr_reclaimable > background_thresh)
> > +	if (nr_reclaimable > background_thresh ||
> > +			over_dirty_bground_time(bdi))
> >  		bdi_start_background_writeback(bdi);
> >  }
> >  
> > -- 
> > 1.7.9.5
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
  2012-09-20  8:44 ` Fengguang Wu
  2012-09-24 22:23   ` Jan Kara
@ 2012-09-25  1:36   ` Dave Chinner
  2012-09-25  6:46     ` Namjae Jeon
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2012-09-25  1:36 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Namjae Jeon, Jan Kara, linux-kernel, Namjae Jeon, Vivek Trivedi,
	Linux Memory Management List, linux-fsdevel

On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote:
> [ CC FS and MM lists ]
> 
> Patch looks good to me, however we need to be careful because it's
> introducing a new interface. So it's desirable to get some acks from
> the FS/MM developers.
> 
> Thanks,
> Fengguang
> 
> On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> > From: Namjae Jeon <namjae.jeon@samsung.com>
> > 
> > This patch is based on suggestion by Wu Fengguang:
> > https://lkml.org/lkml/2011/8/19/19
> > 
> > kernel has mechanism to do writeback as per dirty_ratio and dirty_background
> > ratio. It also maintains per task dirty rate limit to keep balance of
> > dirty pages at any given instance by doing bdi bandwidth estimation.
> > 
> > Kernel also has max_ratio/min_ratio tunables to specify percentage of
> > writecache to control per bdi dirty limits and task throttling.
> > 
> > However, there might be a usecase where user wants a per bdi writeback tuning
> > parameter to flush dirty data once per bdi dirty data reach a threshold
> > especially at NFS server.
> > 
> > dirty_background_centisecs provides an interface where user can tune
> > background writeback start threshold using
> > /sys/block/sda/bdi/dirty_background_centisecs
> > 
> > dirty_background_centisecs is used alongwith average bdi write bandwidth
> > estimation to start background writeback.
> > 
> > One of the use case to demonstrate the patch functionality can be
> > on NFS setup:-
> > We have a NFS setup with ethernet line of 100Mbps, while the USB
> > disk is attached to server, which has a local speed of 25MBps. Server
> > and client both are arm target boards.
> > 
> > Now if we perform a write operation over NFS (client to server), as
> > per the network speed, data can travel at max speed of 100Mbps. But
> > if we check the default write speed of USB hdd over NFS it comes
> > around to 8MB/sec, far below the speed of network.
> > 
> > Reason being is as per the NFS logic, during write operation, initially
> > pages are dirtied on NFS client side, then after reaching the dirty
> > threshold/writeback limit (or in case of sync) data is actually sent
> > to NFS server (so now again pages are dirtied on server side). This
> > will be done in COMMIT call from client to server i.e if 100MB of data
> > is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds.
> > 
> > After the data is received, now it will take approx 100/25 ~4 Seconds to
> > write the data to USB Hdd on server side. Hence making the overall time
> > to write this much of data ~12 seconds, which in practically comes out to
> > be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS
> > client.
> > 
> > However we may improve this write performace by making the use of NFS
> > server idle time i.e while data is being received from the client,
> > simultaneously initiate the writeback thread on server side. So instead
> > of waiting for the complete data to come and then start the writeback,
> > we can work in parallel while the network is still busy in receiving the
> > data. Hence in this way overall performace will be improved.
> > 
> > If we tune dirty_background_centisecs, we can see there
> > is increase in the performace and it comes out to be ~ 11MB/seconds.
> > Results are:-
> > 
> > Write test(create a 1 GB file) result at 'NFS client' after changing 
> > /sys/block/sda/bdi/dirty_background_centisecs 
> > on  *** NFS Server only - not on NFS Client ****

What is the configuration of the client and server? How much RAM,
what their dirty_* parameters are set to, network speed, server disk
speed for local sequential IO, etc?

> > ---------------------------------------------------------------------
> > |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
> > ---------------------------------------------------------------------
> > |          | default = 0 | 300 centisec| 200 centisec| 100 centisec |
> > ---------------------------------------------------------------------
> > |RecSize   |  WriteSpeed |  WriteSpeed |  WriteSpeed |  WriteSpeed  |
> > ---------------------------------------------------------------------
> > |10485760  |  8.44MB/sec |  8.60MB/sec |  9.30MB/sec |  10.27MB/sec |
> > | 1048576  |  8.48MB/sec |  8.87MB/sec |  9.31MB/sec |  10.34MB/sec |
> > |  524288  |  8.37MB/sec |  8.42MB/sec |  9.84MB/sec |  10.47MB/sec |
> > |  262144  |  8.16MB/sec |  8.51MB/sec |  9.52MB/sec |  10.62MB/sec |
> > |  131072  |  8.48MB/sec |  8.81MB/sec |  9.42MB/sec |  10.55MB/sec |
> > |   65536  |  8.38MB/sec |  9.09MB/sec |  9.76MB/sec |  10.53MB/sec |
> > |   32768  |  8.65MB/sec |  9.00MB/sec |  9.57MB/sec |  10.54MB/sec |
> > |   16384  |  8.27MB/sec |  8.80MB/sec |  9.39MB/sec |  10.43MB/sec |
> > |    8192  |  8.52MB/sec |  8.70MB/sec |  9.40MB/sec |  10.50MB/sec |
> > |    4096  |  8.20MB/sec |  8.63MB/sec |  9.80MB/sec |  10.35MB/sec |
> > ---------------------------------------------------------------------

While this set of numbers looks good, it's a very limited in scope.
I can't evaluate whether the change is worthwhile or not from this
test. If I was writing this patch, the questions I'd be seeking to
answer before proposing it for inclusion are as follows....

1. what's the comparison in performance to typical NFS
server writeback parameter tuning? i.e. dirty_background_ratio=5,
dirty_ratio=10, dirty_expire_centiseconds=1000,
dirty_writeback_centisecs=1? i.e. does this give change give any
benefit over the current common practice for configuring NFS
servers?

2. what happens when you have 10 clients all writing to the server
at once? Or a 100? NFS servers rarely have a single writer to a
single file at a time, so what impact does this change have on
multiple concurrent file write performance from multiple clients?

3. Following on from the multiple client test, what difference does it
make to file fragmentation rates? Writing more frequently means
smaller allocations and writes, and that tends to lead to higher
fragmentation rates, especially when multiple files are being
written concurrently. Higher fragmentation also means lower
performance over time as fragmentation accelerates filesystem aging
effects on performance.  IOWs, it may be faster when new, but it
will be slower 3 months down the track and that's a bad tradeoff to
make.

4. What happens for higher bandwidth network links? e.g. gigE or
10gigE? Are the improvements still there? Or does it cause
regressions at higher speeds? I'm especially interested in what
happens to multiple writers at higher network speeds, because that's
a key performance metric used to measure enterprise level NFS
servers.

5. Are the improvements consistent across different filesystem
types?  We've had writeback changes in the past cause improvements
on one filesystem but significant regressions on others.  I'd
suggest that you need to present results for ext4, XFS and btrfs so
that we have a decent idea of what we can expect from the change to
the generic code.

Yeah, I'm asking a lot of questions. That's because the generic
writeback code is extremely important to performance and the impact
of a change cannot be evaluated from a single test.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
  2012-09-25  1:36   ` Dave Chinner
@ 2012-09-25  6:46     ` Namjae Jeon
  2012-09-25  6:54       ` Namjae Jeon
  2012-10-19  7:51       ` Namjae Jeon
  0 siblings, 2 replies; 16+ messages in thread
From: Namjae Jeon @ 2012-09-25  6:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Fengguang Wu, Jan Kara, linux-kernel, Namjae Jeon, Vivek Trivedi,
	Linux Memory Management List, linux-fsdevel

2012/9/25, Dave Chinner <david@fromorbit.com>:
> On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote:
>> [ CC FS and MM lists ]
>>
>> Patch looks good to me, however we need to be careful because it's
>> introducing a new interface. So it's desirable to get some acks from
>> the FS/MM developers.
>>
>> Thanks,
>> Fengguang
>>
>> On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
>> > From: Namjae Jeon <namjae.jeon@samsung.com>
>> >
>> > This patch is based on suggestion by Wu Fengguang:
>> > https://lkml.org/lkml/2011/8/19/19
>> >
>> > kernel has mechanism to do writeback as per dirty_ratio and
>> > dirty_background
>> > ratio. It also maintains per task dirty rate limit to keep balance of
>> > dirty pages at any given instance by doing bdi bandwidth estimation.
>> >
>> > Kernel also has max_ratio/min_ratio tunables to specify percentage of
>> > writecache to control per bdi dirty limits and task throttling.
>> >
>> > However, there might be a usecase where user wants a per bdi writeback
>> > tuning
>> > parameter to flush dirty data once per bdi dirty data reach a threshold
>> > especially at NFS server.
>> >
>> > dirty_background_centisecs provides an interface where user can tune
>> > background writeback start threshold using
>> > /sys/block/sda/bdi/dirty_background_centisecs
>> >
>> > dirty_background_centisecs is used alongwith average bdi write
>> > bandwidth
>> > estimation to start background writeback.
>> >
>> > One of the use case to demonstrate the patch functionality can be
>> > on NFS setup:-
>> > We have a NFS setup with ethernet line of 100Mbps, while the USB
>> > disk is attached to server, which has a local speed of 25MBps. Server
>> > and client both are arm target boards.
>> >
>> > Now if we perform a write operation over NFS (client to server), as
>> > per the network speed, data can travel at max speed of 100Mbps. But
>> > if we check the default write speed of USB hdd over NFS it comes
>> > around to 8MB/sec, far below the speed of network.
>> >
>> > Reason being is as per the NFS logic, during write operation, initially
>> > pages are dirtied on NFS client side, then after reaching the dirty
>> > threshold/writeback limit (or in case of sync) data is actually sent
>> > to NFS server (so now again pages are dirtied on server side). This
>> > will be done in COMMIT call from client to server i.e if 100MB of data
>> > is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9
>> > seconds.
>> >
>> > After the data is received, now it will take approx 100/25 ~4 Seconds
>> > to
>> > write the data to USB Hdd on server side. Hence making the overall time
>> > to write this much of data ~12 seconds, which in practically comes out
>> > to
>> > be near 7 to 8MB/second. After this a COMMIT response will be sent to
>> > NFS
>> > client.
>> >
>> > However we may improve this write performace by making the use of NFS
>> > server idle time i.e while data is being received from the client,
>> > simultaneously initiate the writeback thread on server side. So instead
>> > of waiting for the complete data to come and then start the writeback,
>> > we can work in parallel while the network is still busy in receiving
>> > the
>> > data. Hence in this way overall performace will be improved.
>> >
>> > If we tune dirty_background_centisecs, we can see there
>> > is increase in the performace and it comes out to be ~ 11MB/seconds.
>> > Results are:-
>> >
>> > Write test(create a 1 GB file) result at 'NFS client' after changing
>> > /sys/block/sda/bdi/dirty_background_centisecs
>> > on  *** NFS Server only - not on NFS Client ****
>

Hi. Dave.

> What is the configuration of the client and server? How much RAM,
> what their dirty_* parameters are set to, network speed, server disk
> speed for local sequential IO, etc?
these results are on ARM, 512MB RAM and XFS over NFS with default
writeback settings(only our writeback setting - dirty_back​ground_cen
tisecs changed at nfs server only). Network speed is ~100MB/sec and
local disk speed is ~25MB/sec.

>
>> > ---------------------------------------------------------------------
>> > |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
>> > ---------------------------------------------------------------------
>> > |          | default = 0 | 300 centisec| 200 centisec| 100 centisec |
>> > ---------------------------------------------------------------------
>> > |RecSize   |  WriteSpeed |  WriteSpeed |  WriteSpeed |  WriteSpeed  |
>> > ---------------------------------------------------------------------
>> > |10485760  |  8.44MB/sec |  8.60MB/sec |  9.30MB/sec |  10.27MB/sec |
>> > | 1048576  |  8.48MB/sec |  8.87MB/sec |  9.31MB/sec |  10.34MB/sec |
>> > |  524288  |  8.37MB/sec |  8.42MB/sec |  9.84MB/sec |  10.47MB/sec |
>> > |  262144  |  8.16MB/sec |  8.51MB/sec |  9.52MB/sec |  10.62MB/sec |
>> > |  131072  |  8.48MB/sec |  8.81MB/sec |  9.42MB/sec |  10.55MB/sec |
>> > |   65536  |  8.38MB/sec |  9.09MB/sec |  9.76MB/sec |  10.53MB/sec |
>> > |   32768  |  8.65MB/sec |  9.00MB/sec |  9.57MB/sec |  10.54MB/sec |
>> > |   16384  |  8.27MB/sec |  8.80MB/sec |  9.39MB/sec |  10.43MB/sec |
>> > |    8192  |  8.52MB/sec |  8.70MB/sec |  9.40MB/sec |  10.50MB/sec |
>> > |    4096  |  8.20MB/sec |  8.63MB/sec |  9.80MB/sec |  10.35MB/sec |
>> > ---------------------------------------------------------------------
>
> While this set of numbers looks good, it's a very limited in scope.
> I can't evaluate whether the change is worthwhile or not from this
> test. If I was writing this patch, the questions I'd be seeking to
> answer before proposing it for inclusion are as follows....
>
> 1. what's the comparison in performance to typical NFS
> server writeback parameter tuning? i.e. dirty_background_ratio=5,
> dirty_ratio=10, dirty_expire_centiseconds=1000,
> dirty_writeback_centisecs=1? i.e. does this give change give any
> benefit over the current common practice for configuring NFS
> servers?
>
> 2. what happens when you have 10 clients all writing to the server
> at once? Or a 100? NFS servers rarely have a single writer to a
> single file at a time, so what impact does this change have on
> multiple concurrent file write performance from multiple clients?
>
> 3. Following on from the multiple client test, what difference does it
> make to file fragmentation rates? Writing more frequently means
> smaller allocations and writes, and that tends to lead to higher
> fragmentation rates, especially when multiple files are being
> written concurrently. Higher fragmentation also means lower
> performance over time as fragmentation accelerates filesystem aging
> effects on performance.  IOWs, it may be faster when new, but it
> will be slower 3 months down the track and that's a bad tradeoff to
> make.
>
> 4. What happens for higher bandwidth network links? e.g. gigE or
> 10gigE? Are the improvements still there? Or does it cause
> regressions at higher speeds? I'm especially interested in what
> happens to multiple writers at higher network speeds, because that's
> a key performance metric used to measure enterprise level NFS
> servers.
>
> 5. Are the improvements consistent across different filesystem
> types?  We've had writeback changes in the past cause improvements
> on one filesystem but significant regressions on others.  I'd
> suggest that you need to present results for ext4, XFS and btrfs so
> that we have a decent idea of what we can expect from the change to
> the generic code.
>
> Yeah, I'm asking a lot of questions. That's because the generic
> writeback code is extremely important to performance and the impact
> of a change cannot be evaluated from a single test.
Yes, I agree.
I will share patch behavior in gigabit Ethernet, different
filesystems(e.g. ext4, xfs and btrfs) and multiple NFS clients setup.

Thanks.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
  2012-09-24 22:23   ` Jan Kara
@ 2012-09-25  6:50     ` Namjae Jeon
  2012-09-26 16:56     ` Fengguang Wu
  1 sibling, 0 replies; 16+ messages in thread
From: Namjae Jeon @ 2012-09-25  6:50 UTC (permalink / raw)
  To: Jan Kara, Fengguang Wu
  Cc: Dave Chinner, linux-kernel, Namjae Jeon, Vivek Trivedi,
	Linux Memory Management List, linux-fsdevel

2012/9/25, Jan Kara <jack@suse.cz>:
> On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
>> On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
>> > From: Namjae Jeon <namjae.jeon@samsung.com>
>> >
>> > This patch is based on suggestion by Wu Fengguang:
>> > https://lkml.org/lkml/2011/8/19/19
>> >
>> > kernel has mechanism to do writeback as per dirty_ratio and
>> > dirty_background
>> > ratio. It also maintains per task dirty rate limit to keep balance of
>> > dirty pages at any given instance by doing bdi bandwidth estimation.
>> >
>> > Kernel also has max_ratio/min_ratio tunables to specify percentage of
>> > writecache to control per bdi dirty limits and task throttling.
>> >
>> > However, there might be a usecase where user wants a per bdi writeback
>> > tuning
>> > parameter to flush dirty data once per bdi dirty data reach a threshold
>> > especially at NFS server.
>> >
>> > dirty_background_centisecs provides an interface where user can tune
>> > background writeback start threshold using
>> > /sys/block/sda/bdi/dirty_background_centisecs
>> >
>> > dirty_background_centisecs is used alongwith average bdi write
>> > bandwidth
>> > estimation to start background writeback.
>   The functionality you describe, i.e. start flushing bdi when there's
> reasonable amount of dirty data on it, looks sensible and useful. However
> I'm not so sure whether the interface you propose is the right one.
> Traditionally, we allow user to set amount of dirty data (either in bytes
> or percentage of memory) when background writeback should start. You
> propose setting the amount of data in centisecs-to-write. Why that
> difference? Also this interface ties our throughput estimation code (which
> is an implementation detail of current dirty throttling) with the userspace
> API. So we'd have to maintain the estimation code forever, possibly also
> face problems when we change the estimation code (and thus estimates in
> some cases) and users will complain that the values they set originally no
> longer work as they used to.
>
> Also, as with each knob, there's a problem how to properly set its value?
> Most admins won't know about the knob and so won't touch it. Others might
> know about the knob but will have hard time figuring out what value should
> they set. So if there's a new knob, it should have a sensible initial
> value. And since this feature looks like a useful one, it shouldn't be
> zero.
>
> So my personal preference would be to have bdi->dirty_background_ratio and
> bdi->dirty_background_bytes and start background writeback whenever
> one of global background limit and per-bdi background limit is exceeded. I
> think this interface will do the job as well and it's easier to maintain in
> future.
Hi Jan.
Thanks for review and your opinion.

Hi. Wu.
How about adding per-bdi - bdi->dirty_background_ratio and
bdi->dirty_background_bytes interface as suggested by Jan?

Thanks.
>
> 								Honza

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
  2012-09-25  6:46     ` Namjae Jeon
@ 2012-09-25  6:54       ` Namjae Jeon
  2012-10-19  7:51       ` Namjae Jeon
  1 sibling, 0 replies; 16+ messages in thread
From: Namjae Jeon @ 2012-09-25  6:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Fengguang Wu, Jan Kara, linux-kernel, Namjae Jeon, Vivek Trivedi,
	Linux Memory Management List, linux-fsdevel

2012/9/25, Namjae Jeon <linkinjeon@gmail.com>:
> 2012/9/25, Dave Chinner <david@fromorbit.com>:
>> On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote:
>>> [ CC FS and MM lists ]
>>>
>>> Patch looks good to me, however we need to be careful because it's
>>> introducing a new interface. So it's desirable to get some acks from
>>> the FS/MM developers.
>>>
>>> Thanks,
>>> Fengguang
>>>
>>> On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
>>> > From: Namjae Jeon <namjae.jeon@samsung.com>
>>> >
>>> > This patch is based on suggestion by Wu Fengguang:
>>> > https://lkml.org/lkml/2011/8/19/19
>>> >
>>> > kernel has mechanism to do writeback as per dirty_ratio and
>>> > dirty_background
>>> > ratio. It also maintains per task dirty rate limit to keep balance of
>>> > dirty pages at any given instance by doing bdi bandwidth estimation.
>>> >
>>> > Kernel also has max_ratio/min_ratio tunables to specify percentage of
>>> > writecache to control per bdi dirty limits and task throttling.
>>> >
>>> > However, there might be a usecase where user wants a per bdi writeback
>>> > tuning
>>> > parameter to flush dirty data once per bdi dirty data reach a
>>> > threshold
>>> > especially at NFS server.
>>> >
>>> > dirty_background_centisecs provides an interface where user can tune
>>> > background writeback start threshold using
>>> > /sys/block/sda/bdi/dirty_background_centisecs
>>> >
>>> > dirty_background_centisecs is used alongwith average bdi write
>>> > bandwidth
>>> > estimation to start background writeback.
>>> >
>>> > One of the use case to demonstrate the patch functionality can be
>>> > on NFS setup:-
>>> > We have a NFS setup with ethernet line of 100Mbps, while the USB
>>> > disk is attached to server, which has a local speed of 25MBps. Server
>>> > and client both are arm target boards.
>>> >
>>> > Now if we perform a write operation over NFS (client to server), as
>>> > per the network speed, data can travel at max speed of 100Mbps. But
>>> > if we check the default write speed of USB hdd over NFS it comes
>>> > around to 8MB/sec, far below the speed of network.
>>> >
>>> > Reason being is as per the NFS logic, during write operation,
>>> > initially
>>> > pages are dirtied on NFS client side, then after reaching the dirty
>>> > threshold/writeback limit (or in case of sync) data is actually sent
>>> > to NFS server (so now again pages are dirtied on server side). This
>>> > will be done in COMMIT call from client to server i.e if 100MB of data
>>> > is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9
>>> > seconds.
>>> >
>>> > After the data is received, now it will take approx 100/25 ~4 Seconds
>>> > to
>>> > write the data to USB Hdd on server side. Hence making the overall
>>> > time
>>> > to write this much of data ~12 seconds, which in practically comes out
>>> > to
>>> > be near 7 to 8MB/second. After this a COMMIT response will be sent to
>>> > NFS
>>> > client.
>>> >
>>> > However we may improve this write performace by making the use of NFS
>>> > server idle time i.e while data is being received from the client,
>>> > simultaneously initiate the writeback thread on server side. So
>>> > instead
>>> > of waiting for the complete data to come and then start the writeback,
>>> > we can work in parallel while the network is still busy in receiving
>>> > the
>>> > data. Hence in this way overall performace will be improved.
>>> >
>>> > If we tune dirty_background_centisecs, we can see there
>>> > is increase in the performace and it comes out to be ~ 11MB/seconds.
>>> > Results are:-
>>> >
>>> > Write test(create a 1 GB file) result at 'NFS client' after changing
>>> > /sys/block/sda/bdi/dirty_background_centisecs
>>> > on  *** NFS Server only - not on NFS Client ****
>>
>
> Hi. Dave.
>
>> What is the configuration of the client and server? How much RAM,
>> what their dirty_* parameters are set to, network speed, server disk
>> speed for local sequential IO, etc?
> these results are on ARM, 512MB RAM and XFS over NFS with default
> writeback settings(only our writeback setting - dirty_back​ground_cen
> tisecs changed at nfs server only). Network speed is ~100MB/sec and
Sorry, there is typo:)
                    ^^100Mb/sec
> local disk speed is ~25MB/sec.
>
>>
>>> > ---------------------------------------------------------------------
>>> > |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
>>> > ---------------------------------------------------------------------
>>> > |          | default = 0 | 300 centisec| 200 centisec| 100 centisec |
>>> > ---------------------------------------------------------------------
>>> > |RecSize   |  WriteSpeed |  WriteSpeed |  WriteSpeed |  WriteSpeed  |
>>> > ---------------------------------------------------------------------
>>> > |10485760  |  8.44MB/sec |  8.60MB/sec |  9.30MB/sec |  10.27MB/sec |
>>> > | 1048576  |  8.48MB/sec |  8.87MB/sec |  9.31MB/sec |  10.34MB/sec |
>>> > |  524288  |  8.37MB/sec |  8.42MB/sec |  9.84MB/sec |  10.47MB/sec |
>>> > |  262144  |  8.16MB/sec |  8.51MB/sec |  9.52MB/sec |  10.62MB/sec |
>>> > |  131072  |  8.48MB/sec |  8.81MB/sec |  9.42MB/sec |  10.55MB/sec |
>>> > |   65536  |  8.38MB/sec |  9.09MB/sec |  9.76MB/sec |  10.53MB/sec |
>>> > |   32768  |  8.65MB/sec |  9.00MB/sec |  9.57MB/sec |  10.54MB/sec |
>>> > |   16384  |  8.27MB/sec |  8.80MB/sec |  9.39MB/sec |  10.43MB/sec |
>>> > |    8192  |  8.52MB/sec |  8.70MB/sec |  9.40MB/sec |  10.50MB/sec |
>>> > |    4096  |  8.20MB/sec |  8.63MB/sec |  9.80MB/sec |  10.35MB/sec |
>>> > ---------------------------------------------------------------------
>>
>> While this set of numbers looks good, it's a very limited in scope.
>> I can't evaluate whether the change is worthwhile or not from this
>> test. If I was writing this patch, the questions I'd be seeking to
>> answer before proposing it for inclusion are as follows....
>>
>> 1. what's the comparison in performance to typical NFS
>> server writeback parameter tuning? i.e. dirty_background_ratio=5,
>> dirty_ratio=10, dirty_expire_centiseconds=1000,
>> dirty_writeback_centisecs=1? i.e. does this give change give any
>> benefit over the current common practice for configuring NFS
>> servers?
>>
>> 2. what happens when you have 10 clients all writing to the server
>> at once? Or a 100? NFS servers rarely have a single writer to a
>> single file at a time, so what impact does this change have on
>> multiple concurrent file write performance from multiple clients?
>>
>> 3. Following on from the multiple client test, what difference does it
>> make to file fragmentation rates? Writing more frequently means
>> smaller allocations and writes, and that tends to lead to higher
>> fragmentation rates, especially when multiple files are being
>> written concurrently. Higher fragmentation also means lower
>> performance over time as fragmentation accelerates filesystem aging
>> effects on performance.  IOWs, it may be faster when new, but it
>> will be slower 3 months down the track and that's a bad tradeoff to
>> make.
>>
>> 4. What happens for higher bandwidth network links? e.g. gigE or
>> 10gigE? Are the improvements still there? Or does it cause
>> regressions at higher speeds? I'm especially interested in what
>> happens to multiple writers at higher network speeds, because that's
>> a key performance metric used to measure enterprise level NFS
>> servers.
>>
>> 5. Are the improvements consistent across different filesystem
>> types?  We've had writeback changes in the past cause improvements
>> on one filesystem but significant regressions on others.  I'd
>> suggest that you need to present results for ext4, XFS and btrfs so
>> that we have a decent idea of what we can expect from the change to
>> the generic code.
>>
>> Yeah, I'm asking a lot of questions. That's because the generic
>> writeback code is extremely important to performance and the impact
>> of a change cannot be evaluated from a single test.
> Yes, I agree.
> I will share patch behavior in gigabit Ethernet, different
> filesystems(e.g. ext4, xfs and btrfs) and multiple NFS clients setup.
>
> Thanks.
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> david@fromorbit.com
>>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
  2012-09-24 22:23   ` Jan Kara
  2012-09-25  6:50     ` Namjae Jeon
@ 2012-09-26 16:56     ` Fengguang Wu
  2012-09-26 20:22       ` Jan Kara
  1 sibling, 1 reply; 16+ messages in thread
From: Fengguang Wu @ 2012-09-26 16:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: Namjae Jeon, Dave Chinner, linux-kernel, Namjae Jeon,
	Vivek Trivedi, Linux Memory Management List, linux-fsdevel

On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
> On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
> > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> > > From: Namjae Jeon <namjae.jeon@samsung.com>
> > > 
> > > This patch is based on suggestion by Wu Fengguang:
> > > https://lkml.org/lkml/2011/8/19/19
> > > 
> > > kernel has mechanism to do writeback as per dirty_ratio and dirty_background
> > > ratio. It also maintains per task dirty rate limit to keep balance of
> > > dirty pages at any given instance by doing bdi bandwidth estimation.
> > > 
> > > Kernel also has max_ratio/min_ratio tunables to specify percentage of
> > > writecache to control per bdi dirty limits and task throttling.
> > > 
> > > However, there might be a usecase where user wants a per bdi writeback tuning
> > > parameter to flush dirty data once per bdi dirty data reach a threshold
> > > especially at NFS server.
> > > 
> > > dirty_background_centisecs provides an interface where user can tune
> > > background writeback start threshold using
> > > /sys/block/sda/bdi/dirty_background_centisecs
> > > 
> > > dirty_background_centisecs is used alongwith average bdi write bandwidth
> > > estimation to start background writeback.
>   The functionality you describe, i.e. start flushing bdi when there's
> reasonable amount of dirty data on it, looks sensible and useful. However
> I'm not so sure whether the interface you propose is the right one.
> Traditionally, we allow user to set amount of dirty data (either in bytes
> or percentage of memory) when background writeback should start. You
> propose setting the amount of data in centisecs-to-write. Why that
> difference? Also this interface ties our throughput estimation code (which
> is an implementation detail of current dirty throttling) with the userspace
> API. So we'd have to maintain the estimation code forever, possibly also
> face problems when we change the estimation code (and thus estimates in
> some cases) and users will complain that the values they set originally no
> longer work as they used to.

Yes, that bandwidth estimation is not all that (and in theory cannot
be made) reliable which may be a surprise to the user. Which make the
interface flaky.

> Also, as with each knob, there's a problem how to properly set its value?
> Most admins won't know about the knob and so won't touch it. Others might
> know about the knob but will have hard time figuring out what value should
> they set. So if there's a new knob, it should have a sensible initial
> value. And since this feature looks like a useful one, it shouldn't be
> zero.

Agreed in principle. There seems be no reasonable defaults for the
centisecs-to-write interface, mainly due to its inaccurate nature,
especially the initial value may be wildly wrong on fresh system
bootup. This is also true for your proposed interfaces, see below.

> So my personal preference would be to have bdi->dirty_background_ratio and
> bdi->dirty_background_bytes and start background writeback whenever
> one of global background limit and per-bdi background limit is exceeded. I
> think this interface will do the job as well and it's easier to maintain in
> future.

bdi->dirty_background_ratio, if I understand its semantics right, is
unfortunately flaky in the same principle as centisecs-to-write,
because it relies on the (implicitly estimation of) writeout
proportions. The writeout proportions for each bdi starts with 0,
which is even worse than the 100MB/s initial value for
bdi->write_bandwidth and will trigger background writeback on the
first write.

bdi->dirty_background_bytes is, however, reliable, and gives users
total control. If we export this interface alone, I'd imagine users
who want to control centisecs-to-write could run a simple script to
periodically get the write bandwith value out of the existing bdi
interface and echo it into bdi->dirty_background_bytes. Which makes
simple yet good enough centisecs-to-write controlling.

So what do you think about exporting a really dumb
bdi->dirty_background_bytes, which will effectively give smart users
the freedom to do smart control over per-bdi background writeback
threshold? The users are offered the freedom to do his own bandwidth
estimation and choose not to rely on the kernel estimation, which will
free us from the burden of maintaining a flaky interface as well. :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
  2012-09-26 16:56     ` Fengguang Wu
@ 2012-09-26 20:22       ` Jan Kara
  2012-09-27  6:00         ` Namjae Jeon
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2012-09-26 20:22 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Namjae Jeon, Dave Chinner, linux-kernel, Namjae Jeon,
	Vivek Trivedi, Linux Memory Management List, linux-fsdevel

On Thu 27-09-12 00:56:02, Wu Fengguang wrote:
> On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
> > On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
> > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> > > > From: Namjae Jeon <namjae.jeon@samsung.com>
> > > > 
> > > > This patch is based on suggestion by Wu Fengguang:
> > > > https://lkml.org/lkml/2011/8/19/19
> > > > 
> > > > kernel has mechanism to do writeback as per dirty_ratio and dirty_background
> > > > ratio. It also maintains per task dirty rate limit to keep balance of
> > > > dirty pages at any given instance by doing bdi bandwidth estimation.
> > > > 
> > > > Kernel also has max_ratio/min_ratio tunables to specify percentage of
> > > > writecache to control per bdi dirty limits and task throttling.
> > > > 
> > > > However, there might be a usecase where user wants a per bdi writeback tuning
> > > > parameter to flush dirty data once per bdi dirty data reach a threshold
> > > > especially at NFS server.
> > > > 
> > > > dirty_background_centisecs provides an interface where user can tune
> > > > background writeback start threshold using
> > > > /sys/block/sda/bdi/dirty_background_centisecs
> > > > 
> > > > dirty_background_centisecs is used alongwith average bdi write bandwidth
> > > > estimation to start background writeback.
> >   The functionality you describe, i.e. start flushing bdi when there's
> > reasonable amount of dirty data on it, looks sensible and useful. However
> > I'm not so sure whether the interface you propose is the right one.
> > Traditionally, we allow user to set amount of dirty data (either in bytes
> > or percentage of memory) when background writeback should start. You
> > propose setting the amount of data in centisecs-to-write. Why that
> > difference? Also this interface ties our throughput estimation code (which
> > is an implementation detail of current dirty throttling) with the userspace
> > API. So we'd have to maintain the estimation code forever, possibly also
> > face problems when we change the estimation code (and thus estimates in
> > some cases) and users will complain that the values they set originally no
> > longer work as they used to.
> 
> Yes, that bandwidth estimation is not all that (and in theory cannot
> be made) reliable which may be a surprise to the user. Which make the
> interface flaky.
> 
> > Also, as with each knob, there's a problem how to properly set its value?
> > Most admins won't know about the knob and so won't touch it. Others might
> > know about the knob but will have hard time figuring out what value should
> > they set. So if there's a new knob, it should have a sensible initial
> > value. And since this feature looks like a useful one, it shouldn't be
> > zero.
> 
> Agreed in principle. There seems be no reasonable defaults for the
> centisecs-to-write interface, mainly due to its inaccurate nature,
> especially the initial value may be wildly wrong on fresh system
> bootup. This is also true for your proposed interfaces, see below.
> 
> > So my personal preference would be to have bdi->dirty_background_ratio and
> > bdi->dirty_background_bytes and start background writeback whenever
> > one of global background limit and per-bdi background limit is exceeded. I
> > think this interface will do the job as well and it's easier to maintain in
> > future.
> 
> bdi->dirty_background_ratio, if I understand its semantics right, is
> unfortunately flaky in the same principle as centisecs-to-write,
> because it relies on the (implicitly estimation of) writeout
> proportions. The writeout proportions for each bdi starts with 0,
> which is even worse than the 100MB/s initial value for
> bdi->write_bandwidth and will trigger background writeback on the
> first write.
  Well, I meant bdi->dirty_backround_ratio wouldn't use writeout proportion
estimates at all. Limit would be
  dirtiable_memory * bdi->dirty_backround_ratio.

After all we want to start writeout to bdi when we have enough pages to
reasonably load the device for a while which has nothing to do with how
much is written to this device as compared to other devices.
 
OTOH I'm not particularly attached to this interface. Especially since on a
lot of today's machines, 1% is rather big so people might often end up
using dirty_background_bytes anyway.

> bdi->dirty_background_bytes is, however, reliable, and gives users
> total control. If we export this interface alone, I'd imagine users
> who want to control centisecs-to-write could run a simple script to
> periodically get the write bandwith value out of the existing bdi
> interface and echo it into bdi->dirty_background_bytes. Which makes
> simple yet good enough centisecs-to-write controlling.
> 
> So what do you think about exporting a really dumb
> bdi->dirty_background_bytes, which will effectively give smart users
> the freedom to do smart control over per-bdi background writeback
> threshold? The users are offered the freedom to do his own bandwidth
> estimation and choose not to rely on the kernel estimation, which will
> free us from the burden of maintaining a flaky interface as well. :)
  That's fine with me. Just it would be nice if we gave
bdi->dirty_background_bytes some useful initial value. Maybe like
dirtiable_memory * dirty_background_ratio?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
  2012-09-26 20:22       ` Jan Kara
@ 2012-09-27  6:00         ` Namjae Jeon
  2012-09-27 11:52           ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Namjae Jeon @ 2012-09-27  6:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Dave Chinner, linux-kernel, Namjae Jeon,
	Vivek Trivedi, Linux Memory Management List, linux-fsdevel

2012/9/27, Jan Kara <jack@suse.cz>:
> On Thu 27-09-12 00:56:02, Wu Fengguang wrote:
>> On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
>> > On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
>> > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
>> > > > From: Namjae Jeon <namjae.jeon@samsung.com>
>> > > >
>> > > > This patch is based on suggestion by Wu Fengguang:
>> > > > https://lkml.org/lkml/2011/8/19/19
>> > > >
>> > > > kernel has mechanism to do writeback as per dirty_ratio and
>> > > > dirty_background
>> > > > ratio. It also maintains per task dirty rate limit to keep balance
>> > > > of
>> > > > dirty pages at any given instance by doing bdi bandwidth
>> > > > estimation.
>> > > >
>> > > > Kernel also has max_ratio/min_ratio tunables to specify percentage
>> > > > of
>> > > > writecache to control per bdi dirty limits and task throttling.
>> > > >
>> > > > However, there might be a usecase where user wants a per bdi
>> > > > writeback tuning
>> > > > parameter to flush dirty data once per bdi dirty data reach a
>> > > > threshold
>> > > > especially at NFS server.
>> > > >
>> > > > dirty_background_centisecs provides an interface where user can
>> > > > tune
>> > > > background writeback start threshold using
>> > > > /sys/block/sda/bdi/dirty_background_centisecs
>> > > >
>> > > > dirty_background_centisecs is used alongwith average bdi write
>> > > > bandwidth
>> > > > estimation to start background writeback.
>> >   The functionality you describe, i.e. start flushing bdi when there's
>> > reasonable amount of dirty data on it, looks sensible and useful.
>> > However
>> > I'm not so sure whether the interface you propose is the right one.
>> > Traditionally, we allow user to set amount of dirty data (either in
>> > bytes
>> > or percentage of memory) when background writeback should start. You
>> > propose setting the amount of data in centisecs-to-write. Why that
>> > difference? Also this interface ties our throughput estimation code
>> > (which
>> > is an implementation detail of current dirty throttling) with the
>> > userspace
>> > API. So we'd have to maintain the estimation code forever, possibly
>> > also
>> > face problems when we change the estimation code (and thus estimates in
>> > some cases) and users will complain that the values they set originally
>> > no
>> > longer work as they used to.
>>
>> Yes, that bandwidth estimation is not all that (and in theory cannot
>> be made) reliable which may be a surprise to the user. Which make the
>> interface flaky.
>>
>> > Also, as with each knob, there's a problem how to properly set its
>> > value?
>> > Most admins won't know about the knob and so won't touch it. Others
>> > might
>> > know about the knob but will have hard time figuring out what value
>> > should
>> > they set. So if there's a new knob, it should have a sensible initial
>> > value. And since this feature looks like a useful one, it shouldn't be
>> > zero.
>>
>> Agreed in principle. There seems be no reasonable defaults for the
>> centisecs-to-write interface, mainly due to its inaccurate nature,
>> especially the initial value may be wildly wrong on fresh system
>> bootup. This is also true for your proposed interfaces, see below.
>>
>> > So my personal preference would be to have bdi->dirty_background_ratio
>> > and
>> > bdi->dirty_background_bytes and start background writeback whenever
>> > one of global background limit and per-bdi background limit is exceeded.
>> > I
>> > think this interface will do the job as well and it's easier to maintain
>> > in
>> > future.
>>
>> bdi->dirty_background_ratio, if I understand its semantics right, is
>> unfortunately flaky in the same principle as centisecs-to-write,
>> because it relies on the (implicitly estimation of) writeout
>> proportions. The writeout proportions for each bdi starts with 0,
>> which is even worse than the 100MB/s initial value for
>> bdi->write_bandwidth and will trigger background writeback on the
>> first write.
>   Well, I meant bdi->dirty_backround_ratio wouldn't use writeout proportion
> estimates at all. Limit would be
>   dirtiable_memory * bdi->dirty_backround_ratio.
>
> After all we want to start writeout to bdi when we have enough pages to
> reasonably load the device for a while which has nothing to do with how
> much is written to this device as compared to other devices.
>
> OTOH I'm not particularly attached to this interface. Especially since on a
> lot of today's machines, 1% is rather big so people might often end up
> using dirty_background_bytes anyway.
>
>> bdi->dirty_background_bytes is, however, reliable, and gives users
>> total control. If we export this interface alone, I'd imagine users
>> who want to control centisecs-to-write could run a simple script to
>> periodically get the write bandwith value out of the existing bdi
>> interface and echo it into bdi->dirty_background_bytes. Which makes
>> simple yet good enough centisecs-to-write controlling.
>>
>> So what do you think about exporting a really dumb
>> bdi->dirty_background_bytes, which will effectively give smart users
>> the freedom to do smart control over per-bdi background writeback
>> threshold? The users are offered the freedom to do his own bandwidth
>> estimation and choose not to rely on the kernel estimation, which will
>> free us from the burden of maintaining a flaky interface as well. :)
>   That's fine with me. Just it would be nice if we gave
> bdi->dirty_background_bytes some useful initial value. Maybe like
> dirtiable_memory * dirty_background_ratio?
Hi. Jan.
Global dirty_background_bytes default value is zero that means
flushing is started based on dirty_background_ratio and dirtiable
memory.
Is it correct to set per bdi default dirty threshold
(bdi->dirty_background_bytes) equal to global dirty threshold  -
dirtiable_memory * dirty_background_ratio ?
In my opinion, default setting for per bdi-> dirty_background_bytes
should be zero to avoid any confusion and any change in default
writeback behaviour.

Thanks.
>
> 								Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
  2012-09-27  6:00         ` Namjae Jeon
@ 2012-09-27 11:52           ` Jan Kara
  2012-09-28  1:59             ` Namjae Jeon
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2012-09-27 11:52 UTC (permalink / raw)
  To: Namjae Jeon
  Cc: Jan Kara, Fengguang Wu, Dave Chinner, linux-kernel, Namjae Jeon,
	Vivek Trivedi, Linux Memory Management List, linux-fsdevel

On Thu 27-09-12 15:00:18, Namjae Jeon wrote:
> 2012/9/27, Jan Kara <jack@suse.cz>:
> > On Thu 27-09-12 00:56:02, Wu Fengguang wrote:
> >> On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
> >> > On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
> >> > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> >> > > > From: Namjae Jeon <namjae.jeon@samsung.com>
> >> > > >
> >> > > > This patch is based on suggestion by Wu Fengguang:
> >> > > > https://lkml.org/lkml/2011/8/19/19
> >> > > >
> >> > > > kernel has mechanism to do writeback as per dirty_ratio and
> >> > > > dirty_background
> >> > > > ratio. It also maintains per task dirty rate limit to keep balance
> >> > > > of
> >> > > > dirty pages at any given instance by doing bdi bandwidth
> >> > > > estimation.
> >> > > >
> >> > > > Kernel also has max_ratio/min_ratio tunables to specify percentage
> >> > > > of
> >> > > > writecache to control per bdi dirty limits and task throttling.
> >> > > >
> >> > > > However, there might be a usecase where user wants a per bdi
> >> > > > writeback tuning
> >> > > > parameter to flush dirty data once per bdi dirty data reach a
> >> > > > threshold
> >> > > > especially at NFS server.
> >> > > >
> >> > > > dirty_background_centisecs provides an interface where user can
> >> > > > tune
> >> > > > background writeback start threshold using
> >> > > > /sys/block/sda/bdi/dirty_background_centisecs
> >> > > >
> >> > > > dirty_background_centisecs is used alongwith average bdi write
> >> > > > bandwidth
> >> > > > estimation to start background writeback.
> >> >   The functionality you describe, i.e. start flushing bdi when there's
> >> > reasonable amount of dirty data on it, looks sensible and useful.
> >> > However
> >> > I'm not so sure whether the interface you propose is the right one.
> >> > Traditionally, we allow user to set amount of dirty data (either in
> >> > bytes
> >> > or percentage of memory) when background writeback should start. You
> >> > propose setting the amount of data in centisecs-to-write. Why that
> >> > difference? Also this interface ties our throughput estimation code
> >> > (which
> >> > is an implementation detail of current dirty throttling) with the
> >> > userspace
> >> > API. So we'd have to maintain the estimation code forever, possibly
> >> > also
> >> > face problems when we change the estimation code (and thus estimates in
> >> > some cases) and users will complain that the values they set originally
> >> > no
> >> > longer work as they used to.
> >>
> >> Yes, that bandwidth estimation is not all that (and in theory cannot
> >> be made) reliable which may be a surprise to the user. Which make the
> >> interface flaky.
> >>
> >> > Also, as with each knob, there's a problem how to properly set its
> >> > value?
> >> > Most admins won't know about the knob and so won't touch it. Others
> >> > might
> >> > know about the knob but will have hard time figuring out what value
> >> > should
> >> > they set. So if there's a new knob, it should have a sensible initial
> >> > value. And since this feature looks like a useful one, it shouldn't be
> >> > zero.
> >>
> >> Agreed in principle. There seems be no reasonable defaults for the
> >> centisecs-to-write interface, mainly due to its inaccurate nature,
> >> especially the initial value may be wildly wrong on fresh system
> >> bootup. This is also true for your proposed interfaces, see below.
> >>
> >> > So my personal preference would be to have bdi->dirty_background_ratio
> >> > and
> >> > bdi->dirty_background_bytes and start background writeback whenever
> >> > one of global background limit and per-bdi background limit is exceeded.
> >> > I
> >> > think this interface will do the job as well and it's easier to maintain
> >> > in
> >> > future.
> >>
> >> bdi->dirty_background_ratio, if I understand its semantics right, is
> >> unfortunately flaky in the same principle as centisecs-to-write,
> >> because it relies on the (implicitly estimation of) writeout
> >> proportions. The writeout proportions for each bdi starts with 0,
> >> which is even worse than the 100MB/s initial value for
> >> bdi->write_bandwidth and will trigger background writeback on the
> >> first write.
> >   Well, I meant bdi->dirty_backround_ratio wouldn't use writeout proportion
> > estimates at all. Limit would be
> >   dirtiable_memory * bdi->dirty_backround_ratio.
> >
> > After all we want to start writeout to bdi when we have enough pages to
> > reasonably load the device for a while which has nothing to do with how
> > much is written to this device as compared to other devices.
> >
> > OTOH I'm not particularly attached to this interface. Especially since on a
> > lot of today's machines, 1% is rather big so people might often end up
> > using dirty_background_bytes anyway.
> >
> >> bdi->dirty_background_bytes is, however, reliable, and gives users
> >> total control. If we export this interface alone, I'd imagine users
> >> who want to control centisecs-to-write could run a simple script to
> >> periodically get the write bandwith value out of the existing bdi
> >> interface and echo it into bdi->dirty_background_bytes. Which makes
> >> simple yet good enough centisecs-to-write controlling.
> >>
> >> So what do you think about exporting a really dumb
> >> bdi->dirty_background_bytes, which will effectively give smart users
> >> the freedom to do smart control over per-bdi background writeback
> >> threshold? The users are offered the freedom to do his own bandwidth
> >> estimation and choose not to rely on the kernel estimation, which will
> >> free us from the burden of maintaining a flaky interface as well. :)
> >   That's fine with me. Just it would be nice if we gave
> > bdi->dirty_background_bytes some useful initial value. Maybe like
> > dirtiable_memory * dirty_background_ratio?
> Global dirty_background_bytes default value is zero that means
> flushing is started based on dirty_background_ratio and dirtiable
> memory.
> Is it correct to set per bdi default dirty threshold
> (bdi->dirty_background_bytes) equal to global dirty threshold  -
> dirtiable_memory * dirty_background_ratio ?
  Right, the default setting I proposed doesn't make a difference. And it's
not obvious how to create one which is more meaningful. Pity.

> In my opinion, default setting for per bdi-> dirty_background_bytes
> should be zero to avoid any confusion and any change in default
> writeback behaviour.
  OK, fine with me.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
  2012-09-27 11:52           ` Jan Kara
@ 2012-09-28  1:59             ` Namjae Jeon
  0 siblings, 0 replies; 16+ messages in thread
From: Namjae Jeon @ 2012-09-28  1:59 UTC (permalink / raw)
  To: Jan Kara, Fengguang Wu
  Cc: Dave Chinner, linux-kernel, Namjae Jeon, Vivek Trivedi,
	Linux Memory Management List, linux-fsdevel

2012/9/27, Jan Kara <jack@suse.cz>:
> On Thu 27-09-12 15:00:18, Namjae Jeon wrote:
>> 2012/9/27, Jan Kara <jack@suse.cz>:
>> > On Thu 27-09-12 00:56:02, Wu Fengguang wrote:
>> >> On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
>> >> > On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
>> >> > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
>> >> > > > From: Namjae Jeon <namjae.jeon@samsung.com>
>> >> > > >
>> >> > > > This patch is based on suggestion by Wu Fengguang:
>> >> > > > https://lkml.org/lkml/2011/8/19/19
>> >> > > >
>> >> > > > kernel has mechanism to do writeback as per dirty_ratio and
>> >> > > > dirty_background
>> >> > > > ratio. It also maintains per task dirty rate limit to keep
>> >> > > > balance
>> >> > > > of
>> >> > > > dirty pages at any given instance by doing bdi bandwidth
>> >> > > > estimation.
>> >> > > >
>> >> > > > Kernel also has max_ratio/min_ratio tunables to specify
>> >> > > > percentage
>> >> > > > of
>> >> > > > writecache to control per bdi dirty limits and task throttling.
>> >> > > >
>> >> > > > However, there might be a usecase where user wants a per bdi
>> >> > > > writeback tuning
>> >> > > > parameter to flush dirty data once per bdi dirty data reach a
>> >> > > > threshold
>> >> > > > especially at NFS server.
>> >> > > >
>> >> > > > dirty_background_centisecs provides an interface where user can
>> >> > > > tune
>> >> > > > background writeback start threshold using
>> >> > > > /sys/block/sda/bdi/dirty_background_centisecs
>> >> > > >
>> >> > > > dirty_background_centisecs is used alongwith average bdi write
>> >> > > > bandwidth
>> >> > > > estimation to start background writeback.
>> >> >   The functionality you describe, i.e. start flushing bdi when
>> >> > there's
>> >> > reasonable amount of dirty data on it, looks sensible and useful.
>> >> > However
>> >> > I'm not so sure whether the interface you propose is the right one.
>> >> > Traditionally, we allow user to set amount of dirty data (either in
>> >> > bytes
>> >> > or percentage of memory) when background writeback should start. You
>> >> > propose setting the amount of data in centisecs-to-write. Why that
>> >> > difference? Also this interface ties our throughput estimation code
>> >> > (which
>> >> > is an implementation detail of current dirty throttling) with the
>> >> > userspace
>> >> > API. So we'd have to maintain the estimation code forever, possibly
>> >> > also
>> >> > face problems when we change the estimation code (and thus estimates
>> >> > in
>> >> > some cases) and users will complain that the values they set
>> >> > originally
>> >> > no
>> >> > longer work as they used to.
>> >>
>> >> Yes, that bandwidth estimation is not all that (and in theory cannot
>> >> be made) reliable which may be a surprise to the user. Which make the
>> >> interface flaky.
>> >>
>> >> > Also, as with each knob, there's a problem how to properly set its
>> >> > value?
>> >> > Most admins won't know about the knob and so won't touch it. Others
>> >> > might
>> >> > know about the knob but will have hard time figuring out what value
>> >> > should
>> >> > they set. So if there's a new knob, it should have a sensible
>> >> > initial
>> >> > value. And since this feature looks like a useful one, it shouldn't
>> >> > be
>> >> > zero.
>> >>
>> >> Agreed in principle. There seems be no reasonable defaults for the
>> >> centisecs-to-write interface, mainly due to its inaccurate nature,
>> >> especially the initial value may be wildly wrong on fresh system
>> >> bootup. This is also true for your proposed interfaces, see below.
>> >>
>> >> > So my personal preference would be to have
>> >> > bdi->dirty_background_ratio
>> >> > and
>> >> > bdi->dirty_background_bytes and start background writeback whenever
>> >> > one of global background limit and per-bdi background limit is
>> >> > exceeded.
>> >> > I
>> >> > think this interface will do the job as well and it's easier to
>> >> > maintain
>> >> > in
>> >> > future.
>> >>
>> >> bdi->dirty_background_ratio, if I understand its semantics right, is
>> >> unfortunately flaky in the same principle as centisecs-to-write,
>> >> because it relies on the (implicitly estimation of) writeout
>> >> proportions. The writeout proportions for each bdi starts with 0,
>> >> which is even worse than the 100MB/s initial value for
>> >> bdi->write_bandwidth and will trigger background writeback on the
>> >> first write.
>> >   Well, I meant bdi->dirty_backround_ratio wouldn't use writeout
>> > proportion
>> > estimates at all. Limit would be
>> >   dirtiable_memory * bdi->dirty_backround_ratio.
>> >
>> > After all we want to start writeout to bdi when we have enough pages to
>> > reasonably load the device for a while which has nothing to do with how
>> > much is written to this device as compared to other devices.
>> >
>> > OTOH I'm not particularly attached to this interface. Especially since
>> > on a
>> > lot of today's machines, 1% is rather big so people might often end up
>> > using dirty_background_bytes anyway.
>> >
>> >> bdi->dirty_background_bytes is, however, reliable, and gives users
>> >> total control. If we export this interface alone, I'd imagine users
>> >> who want to control centisecs-to-write could run a simple script to
>> >> periodically get the write bandwith value out of the existing bdi
>> >> interface and echo it into bdi->dirty_background_bytes. Which makes
>> >> simple yet good enough centisecs-to-write controlling.
>> >>
>> >> So what do you think about exporting a really dumb
>> >> bdi->dirty_background_bytes, which will effectively give smart users
>> >> the freedom to do smart control over per-bdi background writeback
>> >> threshold? The users are offered the freedom to do his own bandwidth
>> >> estimation and choose not to rely on the kernel estimation, which will
>> >> free us from the burden of maintaining a flaky interface as well. :)
>> >   That's fine with me. Just it would be nice if we gave
>> > bdi->dirty_background_bytes some useful initial value. Maybe like
>> > dirtiable_memory * dirty_background_ratio?
>> Global dirty_background_bytes default value is zero that means
>> flushing is started based on dirty_background_ratio and dirtiable
>> memory.
>> Is it correct to set per bdi default dirty threshold
>> (bdi->dirty_background_bytes) equal to global dirty threshold  -
>> dirtiable_memory * dirty_background_ratio ?
>   Right, the default setting I proposed doesn't make a difference. And it's
> not obvious how to create one which is more meaningful. Pity.
>
>> In my opinion, default setting for per bdi-> dirty_background_bytes
>> should be zero to avoid any confusion and any change in default
>> writeback behaviour.
>   OK, fine with me.
Okay, I will make the patches as your opinion again.
Thanks Jan and Wu !
>
> 								Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
  2012-09-25  6:46     ` Namjae Jeon
  2012-09-25  6:54       ` Namjae Jeon
@ 2012-10-19  7:51       ` Namjae Jeon
  2012-10-22  1:25         ` Dave Chinner
  1 sibling, 1 reply; 16+ messages in thread
From: Namjae Jeon @ 2012-10-19  7:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Fengguang Wu, Jan Kara, linux-kernel, Namjae Jeon, Vivek Trivedi,
	Linux Memory Management List, linux-fsdevel

Hi Dave.

Test Procedure:

1) Local USB disk WRITE speed on NFS server is ~25 MB/s

2) Run WRITE test(create 1 GB file) on NFS Client with default
writeback settings on NFS Server. By default
bdi->dirty_background_bytes = 0, that means no change in default
writeback behaviour

3) Next we change bdi->dirty_background_bytes = 25 MB (almost equal to
local USB disk write speed on NFS Server)
*** only on NFS Server - not on NFS Client ***

[NFS Server]
# echo $((25*1024*1024)) > /sys/block/sdb/bdi/dirty_background_bytes
# cat /sys/block/sdb/bdi/dirty_background_bytes
26214400

4) Run WRITE test again on NFS client to see change in WRITE speed at NFS client

Test setup details:
Test result on PC - FC16 - RAM 3 GB - ethernet - 1000 Mbits/s,
Create 1 GB File

--------------------------------------------------------------------------------
Table 1: XFS over NFS - WRITE SPEED on NFS Client
--------------------------------------------------------------------------------
             default writeback        bdi->dirty_background_bytes
                      setting                  = 25 MB

RecSize     write speed(MB/s)   write speed(MB/s)          % Change
10485760       27.39                       28.53                          4%
1048576         27.9                        28.59                          2%
524288           27.55                      28.94                          5%
262144           25.4                        28.58                         13%
131072           25.73                       27.55                         7%
65536             25.85                       28.45                        10%
32768             26.13                       28.64                        10%
16384             26.17                       27.93                         7%
8192              25.64                        28.07                         9%
4096              26.28                        28.19                         7%

------------------------------------------------------------------------------
Table 2: EXT4 over NFS - WRITE SPEED on NFS Client
------------------------------------------------------------------------------
                  default writeback    bdi->dirty_background_bytes
                          setting              = 25 MB

RecSize     write speed(MB/s)   write speed(MB/s)       % Change
10485760         23.87                     28.3                        19%
1048576           24.81                    27.79                       12%
524288            24.53                     28.14                       15%
262144            24.21                     27.99                       16%
131072            24.11                     28.33                       18%
65536              23.73                     28.21                       19%
32768              25.66                     27.52                        7%
16384              24.3                       27.67                       14%
8192                23.6                       27.08                       15%
4096                23.35                     27.24                        17%

As mentioned in the above Table 1 & 2, there is performance
improvement on NFS client on gigabit Ethernet on both EXT4/XFS over
NFS. We did not observe any degradation in write speed.
However, performance gain varies on different file systems i.e.
different on XFS & EXT4 over NFS.

We also tried this change on BTRFS over NFS, but we did not see any
significant change in WRITE speed.

----------------------------------------------------------------------------------
Multiple NFS Client test:
-----------------------------------------------------------------------------------
Sorry - We could not arrange multiple PCs to verify this.
So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards:
ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File

-----------------------------------------------------------------------------
Table 3: bdi->dirty_background_bytes = 0 MB
         - default writeback behaviour
-----------------------------------------------------------------------------
RecSize        Write Speed        Write Speed     Combined
                    on Client 1          on client 2     write speed
                        (MB/s)               (MB/s)          (MB/s)

10485760            5.45                  5.36              10.81
1048576              5.44                  5.34              10.78
524288                5.48                  5.51              10.99
262144                6.24                  4.83              11.07
131072                5.58                  5.53              11.11
65536                  5.51                  5.48              10.99
32768                  5.42                  5.46              10.88
16384                  5.62                  5.58              11.2
8192                    5.59                  5.49              11.08
4096                    5.57                  6.38              11.95

-----------------------------------------------------------------------
Table 4: bdi->dirty_background_bytes = 25 MB
-----------------------------------------------------------------------
RecSize        Write Speed        Write Speed     Combined
                     on Client 1        on client 2     write speed
                         (MB/s)            (MB/s)          (MB/s)

10485760              5.43              5.76             11.19
1048576                5.51              5.72             11.23
524288                  5.37              5.69             11.06
262144                  5.46              5.51             10.97
131072                  5.64              5.6               11.24
65536                    5.53              5.64             11.17
32768                    5.51              5.53             11.04
16384                    5.51              5.51             11.02
8192                      5.61              5.59             11.2
4096                      6.11              5.65             11.76

As mentioned in the above table 3 & 4, there is no significant drop in
WRITE speed of individual NFS Clients.
There is minor improvement on combined write speed of NFS client 1 & 2.


> 1. what's the comparison in performance to typical NFS
> server writeback parameter tuning? i.e. dirty_background_ratio=5,
> dirty_ratio=10, dirty_expire_centiseconds=1000,
> dirty_writeback_centisecs=1? i.e. does this give change give any
> benefit over the current common practice for configuring NFS
> servers?

Agreed, that above improvement in write speed can be achieved by
tuning above write-back parameters.
But if we change these settings, it will change write-back behavior
system wide.
On the other hand, if we change proposed per bdi setting,
bdi->dirty_background_bytes it will change write-back behavior for the
block device exported on NFS server.

> 2. what happens when you have 10 clients all writing to the server
> at once? Or a 100? NFS servers rarely have a single writer to a
> single file at a time, so what impact does this change have on
> multiple concurrent file write performance from multiple clients

Sorry, we could not arrange more than 2 PCs for verifying this.
So, We tried this on 3 ARM target boards - 1 ARM board as NFS server &
2 ARM targets as NFS clients connected with HUB on 100 Mbits/s
ethernet link speed.
Please refer above Table 3 & 4 results for this.

> 3. Following on from the multiple client test, what difference does it
> make to file fragmentation rates? Writing more frequently means
> smaller allocations and writes, and that tends to lead to higher
> fragmentation rates, especially when multiple files are being
> written concurrently. Higher fragmentation also means lower
> performance over time as fragmentation accelerates filesystem aging
> effects on performance.  IOWs, it may be faster when new, but it
> will be slower 3 months down the track and that's a bad tradeoff to
> make.

We agree that there could be bit more framentation. But as you know,
we are not changing writeback settings at NFS clients.
So, write-back behavior on NFS client will not change - IO requests
will be buffered at NFS client as per existing write-back behavior.

Also, by default we set bdi->dirty_background_bytes = 0, so, it does
not change default writeback setting unless user wants to tune it as
per the environment.


> 4. What happens for higher bandwidth network links? e.g. gigE or
> 10gigE? Are the improvements still there? Or does it cause
> regressions at higher speeds? I'm especially interested in what
> happens to multiple writers at higher network speeds, because that's
> a key performance metric used to measure enterprise level NFS
> servers.

As mentioned in the above table 1 & 2, on GIGABIT Ethernet interface
also it provides performance improvement in WRITE speed. It is not
degrading write speed.

We could not arrange 10gigE, so we could verify this patch only on
100Mbits/s and 1000Mbits/s Ethernet link.

> 5. Are the improvements consistent across different filesystem
> types?  We've had writeback changes in the past cause improvements
> on one filesystem but significant regressions on others.  I'd
> suggest that you need to present results for ext4, XFS and btrfs so
> that we have a decent idea of what we can expect from the change to
> the generic code.

As mentioned in the above Table 1 & 2, performance gain in WRITE speed
is different on different file systems i.e. different on NFS client
over XFS & EXT4.
We also tried BTRFS over NFS, but we could not see any WRITE speed
performance gain/degrade on BTRFS over NFS, so we are not posting
BTRFS results here.

Please let us know your opinion.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
  2012-10-19  7:51       ` Namjae Jeon
@ 2012-10-22  1:25         ` Dave Chinner
  2012-11-19 23:18           ` Namjae Jeon
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2012-10-22  1:25 UTC (permalink / raw)
  To: Namjae Jeon
  Cc: Fengguang Wu, Jan Kara, linux-kernel, Namjae Jeon, Vivek Trivedi,
	Linux Memory Management List, linux-fsdevel

On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote:
> Hi Dave.
> 
> Test Procedure:
> 
> 1) Local USB disk WRITE speed on NFS server is ~25 MB/s
> 
> 2) Run WRITE test(create 1 GB file) on NFS Client with default
> writeback settings on NFS Server. By default
> bdi->dirty_background_bytes = 0, that means no change in default
> writeback behaviour
> 
> 3) Next we change bdi->dirty_background_bytes = 25 MB (almost equal to
> local USB disk write speed on NFS Server)
> *** only on NFS Server - not on NFS Client ***

Ok, so the results look good, but it's not really addressing what I
was asking, though.  A typical desktop PC has a disk that can do
100MB/s and GbE, so I was expecting a test that showed throughput
close to GbE maximums at least (ie. around that 100MB/s). I have 3
year old, low end, low power hardware (atom) that hanles twice the
throughput you are testing here, and most current consumer NAS
devices are more powerful than this. IOWs, I think the rates you are
testing at are probably too low even for the consumer NAS market to
consider relevant...

> ----------------------------------------------------------------------------------
> Multiple NFS Client test:
> -----------------------------------------------------------------------------------
> Sorry - We could not arrange multiple PCs to verify this.
> So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards:
> ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File

But this really doesn't tells us anything - it's still only 100Mb/s,
which we'd expect is already getting very close to line rate even
with low powered client hardware.

What I'm concerned about the NFS server "sweet spot" - a $10k server
that exports 20TB of storage and can sustain close to a GB/s of NFS
traffic over a single 10GbE link with tens to hundreds of clients.
100MB/s and 10 clients is about the minimum needed to be able to
extrapolate a litle and make an informed guess of how it will scale
up....

> > 1. what's the comparison in performance to typical NFS
> > server writeback parameter tuning? i.e. dirty_background_ratio=5,
> > dirty_ratio=10, dirty_expire_centiseconds=1000,
> > dirty_writeback_centisecs=1? i.e. does this give change give any
> > benefit over the current common practice for configuring NFS
> > servers?
> 
> Agreed, that above improvement in write speed can be achieved by
> tuning above write-back parameters.
> But if we change these settings, it will change write-back behavior
> system wide.
> On the other hand, if we change proposed per bdi setting,
> bdi->dirty_background_bytes it will change write-back behavior for the
> block device exported on NFS server.

I already know what the difference between global vs per-bdi tuning
means.  What I want to know is how your results compare
*numerically* to just having a tweaked global setting on a vanilla
kernel.  i.e. is there really any performance benefit to per-bdi
configuration that cannot be gained by existing methods?

> > 2. what happens when you have 10 clients all writing to the server
> > at once? Or a 100? NFS servers rarely have a single writer to a
> > single file at a time, so what impact does this change have on
> > multiple concurrent file write performance from multiple clients
> 
> Sorry, we could not arrange more than 2 PCs for verifying this.

Really? Well, perhaps there's some tools that might be useful for
you here:

http://oss.sgi.com/projects/nfs/testtools/

"Weber

Test load generator for NFS. Uses multiple threads, multiple
sockets and multiple IP addresses to simulate loads from many
machines, thus enabling testing of NFS server setups with larger
client counts than can be tested with physical infrastructure (or
Virtual Machine clients). Has been useful in automated NFS testing
and as a pinpoint NFS load generator tool for performance
development."

> > 3. Following on from the multiple client test, what difference does it
> > make to file fragmentation rates? Writing more frequently means
> > smaller allocations and writes, and that tends to lead to higher
> > fragmentation rates, especially when multiple files are being
> > written concurrently. Higher fragmentation also means lower
> > performance over time as fragmentation accelerates filesystem aging
> > effects on performance.  IOWs, it may be faster when new, but it
> > will be slower 3 months down the track and that's a bad tradeoff to
> > make.
> 
> We agree that there could be bit more framentation. But as you know,
> we are not changing writeback settings at NFS clients.
> So, write-back behavior on NFS client will not change - IO requests
> will be buffered at NFS client as per existing write-back behavior.

I think you misunderstand - writeback settings on the server greatly
impact the way the server writes data and therefore the way files
are fragmented. It has nothing to do with client side tuning.

Effectively, what you are presenting is best case numbers - empty
filesystem, single client, streaming write, no fragmentation, no
allocation contention, no competing IO load that causes write
latency occurring.  Testing with lots of clients introduces all of
these things, and that will greatly impact server behaviour.
Aggregation in memory isolates a lot of this variation from
writeback and hence smooths out a lot of the variability that leads
to fragmentation, seeks, latency spikes and preamture filesystem
aging.

That is, if you set a 100MB dirty_bytes limit on a bdi it will give
really good buffering for a single client doing a streaming write.
If you've got 10 clients, then assuming fair distribution of server
resources, then that is 10MB per client per writeback trigger.
That's line ball as to whether it will cause fragmentation severe
enough to impact server throughput. If you've got 100 clients,then
that's only 1MB per client per writeback trigger, and that's
definitely too low to maintain decent writeback behaviour.  i.e.
you're now writing 100 files 1MB at a time, and that tends towards
random IO patterns rather than sequential IO patterns. Seek time
dertermines throughput, not IO bandwidth limits.

IOWs, as the client count goes up, the writeback patterns will tends
more towards random IO than sequential IO unless the amount of
buffering allowed before writeback triggers also grows. That's
important, because random IO is much slower than sequential IO.
What I'd like to have is some insight into whether this patch
changes that inflection point, for better or for worse. The only way
to find that is to run multi-client testing....

> > 5. Are the improvements consistent across different filesystem
> > types?  We've had writeback changes in the past cause improvements
> > on one filesystem but significant regressions on others.  I'd
> > suggest that you need to present results for ext4, XFS and btrfs so
> > that we have a decent idea of what we can expect from the change to
> > the generic code.
> 
> As mentioned in the above Table 1 & 2, performance gain in WRITE speed
> is different on different file systems i.e. different on NFS client
> over XFS & EXT4.
> We also tried BTRFS over NFS, but we could not see any WRITE speed
> performance gain/degrade on BTRFS over NFS, so we are not posting
> BTRFS results here.

You should post btrfs numbers even if they show no change. It wasn't
until I got this far that I even realised that you'd even tested
BTRFS. I don't know what to make of this, because I don't know what
the throughput rates compared to XFS and EXT4 are....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
  2012-10-22  1:25         ` Dave Chinner
@ 2012-11-19 23:18           ` Namjae Jeon
       [not found]             ` <50bec4af.0aad2a0a.2fc9.6fe5SMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 1 reply; 16+ messages in thread
From: Namjae Jeon @ 2012-11-19 23:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Fengguang Wu, Jan Kara, linux-kernel, Namjae Jeon, Vivek Trivedi,
	Linux Memory Management List, linux-fsdevel

2012/10/22, Dave Chinner <david@fromorbit.com>:
> On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote:
>> Hi Dave.
>>
>> Test Procedure:
>>
>> 1) Local USB disk WRITE speed on NFS server is ~25 MB/s
>>
>> 2) Run WRITE test(create 1 GB file) on NFS Client with default
>> writeback settings on NFS Server. By default
>> bdi->dirty_background_bytes = 0, that means no change in default
>> writeback behaviour
>>
>> 3) Next we change bdi->dirty_background_bytes = 25 MB (almost equal to
>> local USB disk write speed on NFS Server)
>> *** only on NFS Server - not on NFS Client ***
>
> Ok, so the results look good, but it's not really addressing what I
> was asking, though.  A typical desktop PC has a disk that can do
> 100MB/s and GbE, so I was expecting a test that showed throughput
> close to GbE maximums at least (ie. around that 100MB/s). I have 3
> year old, low end, low power hardware (atom) that hanles twice the
> throughput you are testing here, and most current consumer NAS
> devices are more powerful than this. IOWs, I think the rates you are
> testing at are probably too low even for the consumer NAS market to
> consider relevant...
>
>> ----------------------------------------------------------------------------------
>> Multiple NFS Client test:
>> -----------------------------------------------------------------------------------
>> Sorry - We could not arrange multiple PCs to verify this.
>> So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards:
>> ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File
>
> But this really doesn't tells us anything - it's still only 100Mb/s,
> which we'd expect is already getting very close to line rate even
> with low powered client hardware.
>
> What I'm concerned about the NFS server "sweet spot" - a $10k server
> that exports 20TB of storage and can sustain close to a GB/s of NFS
> traffic over a single 10GbE link with tens to hundreds of clients.
> 100MB/s and 10 clients is about the minimum needed to be able to
> extrapolate a litle and make an informed guess of how it will scale
> up....
>
>> > 1. what's the comparison in performance to typical NFS
>> > server writeback parameter tuning? i.e. dirty_background_ratio=5,
>> > dirty_ratio=10, dirty_expire_centiseconds=1000,
>> > dirty_writeback_centisecs=1? i.e. does this give change give any
>> > benefit over the current common practice for configuring NFS
>> > servers?
>>
>> Agreed, that above improvement in write speed can be achieved by
>> tuning above write-back parameters.
>> But if we change these settings, it will change write-back behavior
>> system wide.
>> On the other hand, if we change proposed per bdi setting,
>> bdi->dirty_background_bytes it will change write-back behavior for the
>> block device exported on NFS server.
>
> I already know what the difference between global vs per-bdi tuning
> means.  What I want to know is how your results compare
> *numerically* to just having a tweaked global setting on a vanilla
> kernel.  i.e. is there really any performance benefit to per-bdi
> configuration that cannot be gained by existing methods?
>
>> > 2. what happens when you have 10 clients all writing to the server
>> > at once? Or a 100? NFS servers rarely have a single writer to a
>> > single file at a time, so what impact does this change have on
>> > multiple concurrent file write performance from multiple clients
>>
>> Sorry, we could not arrange more than 2 PCs for verifying this.
>
> Really? Well, perhaps there's some tools that might be useful for
> you here:
>
> http://oss.sgi.com/projects/nfs/testtools/
>
> "Weber
>
> Test load generator for NFS. Uses multiple threads, multiple
> sockets and multiple IP addresses to simulate loads from many
> machines, thus enabling testing of NFS server setups with larger
> client counts than can be tested with physical infrastructure (or
> Virtual Machine clients). Has been useful in automated NFS testing
> and as a pinpoint NFS load generator tool for performance
> development."
>

Hi Dave,
We ran "weber" test on below setup:
1) SATA HDD - Local WRITE speed ~120 MB/s, NFS WRITE speed ~90 MB/s
2) Used 10GbE - network interface to mount NFS

We ran "weber" test with  NFS clients ranging from 1 to 100,
below is the % GAIN in NFS WRITE speed with
bdi->dirty_background_bytes = 100 MB at NFS server

-------------------------------------------------
| Number of NFS Clients |% GAIN in WRITE Speed  |
|-----------------------------------------------|
|         1             |     19.83 %           |
|-----------------------------------------------|
|         2             |      2.97 %           |
|-----------------------------------------------|
|         3             |      2.01 %           |
|-----------------------------------------------|
|        10             |      0.25 %           |
|-----------------------------------------------|
|        20             |      0.23 %           |
|-----------------------------------------------|
|        30             |      0.13 %           |
|-----------------------------------------------|
|       100             |    - 0.60 %           |
-------------------------------------------------

with bdi->dirty_background_bytes setting at NFS server, we observed
that NFS WRITE speed improvement is maximum with single NFS client.
But WRITE speed improvement drops when Number of NFS clients increase
from 1 to 100.

So, bdi->dirty_background_bytes setting might be useful where we have
only one NFS client(scenario like ours).
But this is not useful for big NFS Servers which host hundreads of NFS clients.

Let me know your opinion.

Thanks.

>> > 3. Following on from the multiple client test, what difference does it
>> > make to file fragmentation rates? Writing more frequently means
>> > smaller allocations and writes, and that tends to lead to higher
>> > fragmentation rates, especially when multiple files are being
>> > written concurrently. Higher fragmentation also means lower
>> > performance over time as fragmentation accelerates filesystem aging
>> > effects on performance.  IOWs, it may be faster when new, but it
>> > will be slower 3 months down the track and that's a bad tradeoff to
>> > make.
>>
>> We agree that there could be bit more framentation. But as you know,
>> we are not changing writeback settings at NFS clients.
>> So, write-back behavior on NFS client will not change - IO requests
>> will be buffered at NFS client as per existing write-back behavior.
>
> I think you misunderstand - writeback settings on the server greatly
> impact the way the server writes data and therefore the way files
> are fragmented. It has nothing to do with client side tuning.
>
> Effectively, what you are presenting is best case numbers - empty
> filesystem, single client, streaming write, no fragmentation, no
> allocation contention, no competing IO load that causes write
> latency occurring.  Testing with lots of clients introduces all of
> these things, and that will greatly impact server behaviour.
> Aggregation in memory isolates a lot of this variation from
> writeback and hence smooths out a lot of the variability that leads
> to fragmentation, seeks, latency spikes and preamture filesystem
> aging.
>
> That is, if you set a 100MB dirty_bytes limit on a bdi it will give
> really good buffering for a single client doing a streaming write.
> If you've got 10 clients, then assuming fair distribution of server
> resources, then that is 10MB per client per writeback trigger.
> That's line ball as to whether it will cause fragmentation severe
> enough to impact server throughput. If you've got 100 clients,then
> that's only 1MB per client per writeback trigger, and that's
> definitely too low to maintain decent writeback behaviour.  i.e.
> you're now writing 100 files 1MB at a time, and that tends towards
> random IO patterns rather than sequential IO patterns. Seek time
> dertermines throughput, not IO bandwidth limits.
>
> IOWs, as the client count goes up, the writeback patterns will tends
> more towards random IO than sequential IO unless the amount of
> buffering allowed before writeback triggers also grows. That's
> important, because random IO is much slower than sequential IO.
> What I'd like to have is some insight into whether this patch
> changes that inflection point, for better or for worse. The only way
> to find that is to run multi-client testing....
>
>> > 5. Are the improvements consistent across different filesystem
>> > types?  We've had writeback changes in the past cause improvements
>> > on one filesystem but significant regressions on others.  I'd
>> > suggest that you need to present results for ext4, XFS and btrfs so
>> > that we have a decent idea of what we can expect from the change to
>> > the generic code.
>>
>> As mentioned in the above Table 1 & 2, performance gain in WRITE speed
>> is different on different file systems i.e. different on NFS client
>> over XFS & EXT4.
>> We also tried BTRFS over NFS, but we could not see any WRITE speed
>> performance gain/degrade on BTRFS over NFS, so we are not posting
>> BTRFS results here.
>
> You should post btrfs numbers even if they show no change. It wasn't
> until I got this far that I even realised that you'd even tested
> BTRFS. I don't know what to make of this, because I don't know what
> the throughput rates compared to XFS and EXT4 are....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
       [not found]             ` <50bec4af.0aad2a0a.2fc9.6fe5SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2012-12-05 10:19               ` Namjae Jeon
  0 siblings, 0 replies; 16+ messages in thread
From: Namjae Jeon @ 2012-12-05 10:19 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Fengguang Wu, Jan Kara, Dave Chinner, linux-kernel, Namjae Jeon,
	Vivek Trivedi, Linux Memory Management List, linux-fsdevel

2012/12/5, Wanpeng Li <liwanp@linux.vnet.ibm.com>:
> Hi Namjae,
>
> How about set bdi->dirty_background_bytes according to bdi_thresh? I found
> an issue during background flush process when review codes, if over
> background
> flush threshold, wb_check_background_flush will kick a work to current
> per-bdi
> flusher, but maybe it is other heavy dirties written in other bdis who
> heavily
> dirty pages instead of current bdi, the worst case is current bdi has many
> frequently used data and flush lead to cache thresh. How about add a check
> in wb_check_background_flush if it is not current bdi who contributes large
>
> number of dirty pages to background flush threshold(over
> bdi->dirty_background_bytes),
> then don't bother it.

Hi Wanpeng.

First, Thanks for your suggestion!
Yes, I think that it looks reasonable.
I will start checking it.

Thanks.
>
> Regards,
> Wanpeng Li
>
> On Tue, Nov 20, 2012 at 08:18:59AM +0900, Namjae Jeon wrote:
>>2012/10/22, Dave Chinner <david@fromorbit.com>:
>>> On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote:
>>>> Hi Dave.
>>>>
>>>> Test Procedure:
>>>>
>>>> 1) Local USB disk WRITE speed on NFS server is ~25 MB/s
>>>>
>>>> 2) Run WRITE test(create 1 GB file) on NFS Client with default
>>>> writeback settings on NFS Server. By default
>>>> bdi->dirty_background_bytes = 0, that means no change in default
>>>> writeback behaviour
>>>>
>>>> 3) Next we change bdi->dirty_background_bytes = 25 MB (almost equal to
>>>> local USB disk write speed on NFS Server)
>>>> *** only on NFS Server - not on NFS Client ***
>>>
>>> Ok, so the results look good, but it's not really addressing what I
>>> was asking, though.  A typical desktop PC has a disk that can do
>>> 100MB/s and GbE, so I was expecting a test that showed throughput
>>> close to GbE maximums at least (ie. around that 100MB/s). I have 3
>>> year old, low end, low power hardware (atom) that hanles twice the
>>> throughput you are testing here, and most current consumer NAS
>>> devices are more powerful than this. IOWs, I think the rates you are
>>> testing at are probably too low even for the consumer NAS market to
>>> consider relevant...
>>>
>>>> ----------------------------------------------------------------------------------
>>>> Multiple NFS Client test:
>>>> -----------------------------------------------------------------------------------
>>>> Sorry - We could not arrange multiple PCs to verify this.
>>>> So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards:
>>>> ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File
>>>
>>> But this really doesn't tells us anything - it's still only 100Mb/s,
>>> which we'd expect is already getting very close to line rate even
>>> with low powered client hardware.
>>>
>>> What I'm concerned about the NFS server "sweet spot" - a $10k server
>>> that exports 20TB of storage and can sustain close to a GB/s of NFS
>>> traffic over a single 10GbE link with tens to hundreds of clients.
>>> 100MB/s and 10 clients is about the minimum needed to be able to
>>> extrapolate a litle and make an informed guess of how it will scale
>>> up....
>>>
>>>> > 1. what's the comparison in performance to typical NFS
>>>> > server writeback parameter tuning? i.e. dirty_background_ratio=5,
>>>> > dirty_ratio=10, dirty_expire_centiseconds=1000,
>>>> > dirty_writeback_centisecs=1? i.e. does this give change give any
>>>> > benefit over the current common practice for configuring NFS
>>>> > servers?
>>>>
>>>> Agreed, that above improvement in write speed can be achieved by
>>>> tuning above write-back parameters.
>>>> But if we change these settings, it will change write-back behavior
>>>> system wide.
>>>> On the other hand, if we change proposed per bdi setting,
>>>> bdi->dirty_background_bytes it will change write-back behavior for the
>>>> block device exported on NFS server.
>>>
>>> I already know what the difference between global vs per-bdi tuning
>>> means.  What I want to know is how your results compare
>>> *numerically* to just having a tweaked global setting on a vanilla
>>> kernel.  i.e. is there really any performance benefit to per-bdi
>>> configuration that cannot be gained by existing methods?
>>>
>>>> > 2. what happens when you have 10 clients all writing to the server
>>>> > at once? Or a 100? NFS servers rarely have a single writer to a
>>>> > single file at a time, so what impact does this change have on
>>>> > multiple concurrent file write performance from multiple clients
>>>>
>>>> Sorry, we could not arrange more than 2 PCs for verifying this.
>>>
>>> Really? Well, perhaps there's some tools that might be useful for
>>> you here:
>>>
>>> http://oss.sgi.com/projects/nfs/testtools/
>>>
>>> "Weber
>>>
>>> Test load generator for NFS. Uses multiple threads, multiple
>>> sockets and multiple IP addresses to simulate loads from many
>>> machines, thus enabling testing of NFS server setups with larger
>>> client counts than can be tested with physical infrastructure (or
>>> Virtual Machine clients). Has been useful in automated NFS testing
>>> and as a pinpoint NFS load generator tool for performance
>>> development."
>>>
>>
>>Hi Dave,
>>We ran "weber" test on below setup:
>>1) SATA HDD - Local WRITE speed ~120 MB/s, NFS WRITE speed ~90 MB/s
>>2) Used 10GbE - network interface to mount NFS
>>
>>We ran "weber" test with  NFS clients ranging from 1 to 100,
>>below is the % GAIN in NFS WRITE speed with
>>bdi->dirty_background_bytes = 100 MB at NFS server
>>
>>-------------------------------------------------
>>| Number of NFS Clients |% GAIN in WRITE Speed  |
>>|-----------------------------------------------|
>>|         1             |     19.83 %           |
>>|-----------------------------------------------|
>>|         2             |      2.97 %           |
>>|-----------------------------------------------|
>>|         3             |      2.01 %           |
>>|-----------------------------------------------|
>>|        10             |      0.25 %           |
>>|-----------------------------------------------|
>>|        20             |      0.23 %           |
>>|-----------------------------------------------|
>>|        30             |      0.13 %           |
>>|-----------------------------------------------|
>>|       100             |    - 0.60 %           |
>>-------------------------------------------------
>>
>>with bdi->dirty_background_bytes setting at NFS server, we observed
>>that NFS WRITE speed improvement is maximum with single NFS client.
>>But WRITE speed improvement drops when Number of NFS clients increase
>>from 1 to 100.
>>
>>So, bdi->dirty_background_bytes setting might be useful where we have
>>only one NFS client(scenario like ours).
>>But this is not useful for big NFS Servers which host hundreads of NFS
>> clients.
>>
>>Let me know your opinion.
>>
>>Thanks.
>>
>>>> > 3. Following on from the multiple client test, what difference does
>>>> > it
>>>> > make to file fragmentation rates? Writing more frequently means
>>>> > smaller allocations and writes, and that tends to lead to higher
>>>> > fragmentation rates, especially when multiple files are being
>>>> > written concurrently. Higher fragmentation also means lower
>>>> > performance over time as fragmentation accelerates filesystem aging
>>>> > effects on performance.  IOWs, it may be faster when new, but it
>>>> > will be slower 3 months down the track and that's a bad tradeoff to
>>>> > make.
>>>>
>>>> We agree that there could be bit more framentation. But as you know,
>>>> we are not changing writeback settings at NFS clients.
>>>> So, write-back behavior on NFS client will not change - IO requests
>>>> will be buffered at NFS client as per existing write-back behavior.
>>>
>>> I think you misunderstand - writeback settings on the server greatly
>>> impact the way the server writes data and therefore the way files
>>> are fragmented. It has nothing to do with client side tuning.
>>>
>>> Effectively, what you are presenting is best case numbers - empty
>>> filesystem, single client, streaming write, no fragmentation, no
>>> allocation contention, no competing IO load that causes write
>>> latency occurring.  Testing with lots of clients introduces all of
>>> these things, and that will greatly impact server behaviour.
>>> Aggregation in memory isolates a lot of this variation from
>>> writeback and hence smooths out a lot of the variability that leads
>>> to fragmentation, seeks, latency spikes and preamture filesystem
>>> aging.
>>>
>>> That is, if you set a 100MB dirty_bytes limit on a bdi it will give
>>> really good buffering for a single client doing a streaming write.
>>> If you've got 10 clients, then assuming fair distribution of server
>>> resources, then that is 10MB per client per writeback trigger.
>>> That's line ball as to whether it will cause fragmentation severe
>>> enough to impact server throughput. If you've got 100 clients,then
>>> that's only 1MB per client per writeback trigger, and that's
>>> definitely too low to maintain decent writeback behaviour.  i.e.
>>> you're now writing 100 files 1MB at a time, and that tends towards
>>> random IO patterns rather than sequential IO patterns. Seek time
>>> dertermines throughput, not IO bandwidth limits.
>>>
>>> IOWs, as the client count goes up, the writeback patterns will tends
>>> more towards random IO than sequential IO unless the amount of
>>> buffering allowed before writeback triggers also grows. That's
>>> important, because random IO is much slower than sequential IO.
>>> What I'd like to have is some insight into whether this patch
>>> changes that inflection point, for better or for worse. The only way
>>> to find that is to run multi-client testing....
>>>
>>>> > 5. Are the improvements consistent across different filesystem
>>>> > types?  We've had writeback changes in the past cause improvements
>>>> > on one filesystem but significant regressions on others.  I'd
>>>> > suggest that you need to present results for ext4, XFS and btrfs so
>>>> > that we have a decent idea of what we can expect from the change to
>>>> > the generic code.
>>>>
>>>> As mentioned in the above Table 1 & 2, performance gain in WRITE speed
>>>> is different on different file systems i.e. different on NFS client
>>>> over XFS & EXT4.
>>>> We also tried BTRFS over NFS, but we could not see any WRITE speed
>>>> performance gain/degrade on BTRFS over NFS, so we are not posting
>>>> BTRFS results here.
>>>
>>> You should post btrfs numbers even if they show no change. It wasn't
>>> until I got this far that I even realised that you'd even tested
>>> BTRFS. I don't know what to make of this, because I don't know what
>>> the throughput rates compared to XFS and EXT4 are....
>>>
>>> Cheers,
>>>
>>> Dave.
>>> --
>>> Dave Chinner
>>> david@fromorbit.com
>>>
>>--
>>To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
>> in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-12-05 10:19 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-16 12:25 [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable Namjae Jeon
2012-09-20  8:44 ` Fengguang Wu
2012-09-24 22:23   ` Jan Kara
2012-09-25  6:50     ` Namjae Jeon
2012-09-26 16:56     ` Fengguang Wu
2012-09-26 20:22       ` Jan Kara
2012-09-27  6:00         ` Namjae Jeon
2012-09-27 11:52           ` Jan Kara
2012-09-28  1:59             ` Namjae Jeon
2012-09-25  1:36   ` Dave Chinner
2012-09-25  6:46     ` Namjae Jeon
2012-09-25  6:54       ` Namjae Jeon
2012-10-19  7:51       ` Namjae Jeon
2012-10-22  1:25         ` Dave Chinner
2012-11-19 23:18           ` Namjae Jeon
     [not found]             ` <50bec4af.0aad2a0a.2fc9.6fe5SMTPIN_ADDED_BROKEN@mx.google.com>
2012-12-05 10:19               ` Namjae Jeon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).