linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 3/3] writeback: add dirty_ratio_time per bdi variable
@ 2012-08-18  9:50 Namjae Jeon
  2012-08-19  2:57 ` [PATCH 3/3] writeback: add dirty_ratio_time per bdi variable (NFS write performance) Fengguang Wu
  0 siblings, 1 reply; 11+ messages in thread
From: Namjae Jeon @ 2012-08-18  9:50 UTC (permalink / raw)
  To: fengguang.wu, akpm; +Cc: linux-kernel, Namjae Jeon, Namjae Jeon

From: Namjae Jeon <namjae.jeon@samsung.com>

This patch is based on suggestion by Wu Fengguang:
https://lkml.org/lkml/2011/8/19/19

kernel has mechanism to do writeback as per dirty_ratio and dirty_background
ratio. It also maintains per task dirty rate limit to keep balance of
dirty pages at any given instance by doing bdi bandwidth estimation.

Kernel also has max_ratio/min_ratio tunables to specify percentage of writecache
to control per bdi dirty limits and task throtelling.

However, there might be a usecase where user wants a writeback tuning
parameter to flush dirty data at desired/tuned time interval.

dirty_background_time provides an interface where user can tune background
writeback start time using /sys/block/sda/bdi/dirty_background_time

dirty_background_time is used alongwith average bdi write bandwidth estimation
to start background writeback.

One of the use case to demonstrate the patch functionality can be
on NFS setup:-
We have a NFS setup with ethernet line of 100Mbps, while the USB
disk is attached to server, which has a local speed of 25MBps. Server
and client both are arm target boards.

Now if we perform a write operation over NFS (client to server), as
per the network speed, data can travel at max speed of 100Mbps. But
if we check the default write speed of USB hdd over NFS it comes
around to 8MB/sec, far below the speed of network.

Reason being is as per the NFS logic, during write operation, initially
pages are dirtied on NFS client side, then after reaching the dirty
threshold/writeback limit (or in case of sync) data is actually sent
to NFS server (so now again pages are dirtied on server side). This
will be done in COMMIT call from client to server i.e if 100MB of data
is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds.

After the data is received, now it will take approx 100/25 ~4 Seconds to
write the data to USB Hdd on server side. Hence making the overall time
to write this much of data ~12 seconds, which in practically comes out to
be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS
client.

However we may improve this write performace by making the use of NFS
server idle time i.e while data is being received from the client,
simultaneously initiate the writeback thread on server side. So instead
of waiting for the complete data to come and then start the writeback,
we can work in parallel while the network is still busy in receiving the
data. Hence in this way overall performace will be improved.

If we tune dirty_background_time, we can see there
is increase in the performace and it comes out to be ~ 11MB/seconds.
Results are:-
==========================================================
Case:1 - Normal setup without any changes
./performancetest_arm ./100MB write

 RecSize  WriteSpeed   RanWriteSpeed

 10485760  7.93MB/sec   8.11MB/sec
  1048576  8.21MB/sec   7.80MB/sec
   524288  8.71MB/sec   8.39MB/sec
   262144  8.91MB/sec   7.83MB/sec
   131072  8.91MB/sec   8.95MB/sec
    65536  8.95MB/sec   8.90MB/sec
    32768  8.76MB/sec   8.93MB/sec
    16384  8.78MB/sec   8.67MB/sec
     8192  8.90MB/sec   8.52MB/sec
     4096  8.89MB/sec   8.28MB/sec

Average speed is near 8MB/seconds.

Case:2 - Modified the dirty_background_time
./performancetest_arm ./100MB write

 RecSize  WriteSpeed   RanWriteSpeed

 10485760  10.56MB/sec  10.37MB/sec
  1048576  10.43MB/sec  10.33MB/sec
   524288  10.32MB/sec  10.02MB/sec
   262144  10.52MB/sec  10.19MB/sec
   131072  10.34MB/sec  10.07MB/sec
    65536  10.31MB/sec  10.06MB/sec
    32768  10.27MB/sec  10.24MB/sec
    16384  10.54MB/sec  10.03MB/sec
     8192  10.41MB/sec  10.38MB/sec
     4096  10.34MB/sec  10.12MB/sec

we can see, average write speed is increased to ~10-11MB/sec.
============================================================

Now to make this working we need to make change in dirty_[wirteback|expire]
_interval so that flusher threads will be awaken up more early. But if we
modify these values it will impact the overall system performace, while our
requirement is to modify these parameters for the device used in NFS interface.

This patch provides the changes per block devices. So that we may modify the
intervals as per the device and overall system is not impacted by the changes
and we get improved

The above mentioned is one of the use case to use this patch.

Original-patch-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Tested-by: Vivek Trivedi <t.vivek@samsung.com>
---
 fs/fs-writeback.c           |   18 ++++++++++++++++--
 include/linux/backing-dev.h |    1 +
 include/linux/writeback.h   |    1 +
 mm/backing-dev.c            |   22 ++++++++++++++++++++++
 mm/page-writeback.c         |    3 ++-
 5 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index be3efc4..75fda1d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -769,6 +769,19 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
 	return false;
 }
 
+bool over_dirty_bground_time(struct backing_dev_info *bdi)
+{
+	unsigned long background_thresh;
+
+	background_thresh = bdi->avg_write_bandwidth *
+		bdi->dirty_background_time / 1000;
+
+	if (bdi_stat(bdi, BDI_RECLAIMABLE) > background_thresh)
+		return true;
+
+	return false;
+}
+
 /*
  * Called under wb->list_lock. If there are multiple wb per bdi,
  * only the flusher working on the first wb should do it.
@@ -828,7 +841,8 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * For background writeout, stop when we are below the
 		 * background dirty threshold
 		 */
-		if (work->for_background && !over_bground_thresh(wb->bdi))
+		if (work->for_background && !over_bground_thresh(wb->bdi) &&
+			!over_dirty_bground_time(wb->bdi))
 			break;
 
 		/*
@@ -920,7 +934,7 @@ static unsigned long get_nr_dirty_pages(void)
 
 static long wb_check_background_flush(struct bdi_writeback *wb)
 {
-	if (over_bground_thresh(wb->bdi)) {
+	if (over_bground_thresh(wb->bdi) || over_dirty_bground_time(wb->bdi)) {
 
 		struct wb_writeback_work work = {
 			.nr_pages	= LONG_MAX,
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 2a9a9ab..ad83783 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -95,6 +95,7 @@ struct backing_dev_info {
 
 	unsigned int min_ratio;
 	unsigned int max_ratio, max_prop_frac;
+	unsigned int dirty_background_time;
 
 	struct bdi_writeback wb;  /* default writeback info for this bdi */
 	spinlock_t wb_lock;	  /* protects work_list */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index b82a83a..433cd09 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -96,6 +96,7 @@ long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
 void inode_wait_for_writeback(struct inode *inode);
+bool over_dirty_bground_time(struct backing_dev_info *bdi);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index b41823c..0f9f798 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -219,12 +219,33 @@ static ssize_t max_ratio_store(struct device *dev,
 }
 BDI_SHOW(max_ratio, bdi->max_ratio)
 
+static ssize_t dirty_background_time_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct backing_dev_info *bdi = dev_get_drvdata(dev);
+	char *end;
+	unsigned int msec;
+	ssize_t ret = -EINVAL;
+
+	msec = simple_strtoul(buf, &end, 10);
+	if (*buf && (end[0] == '\0' || (end[0] == '\n' && end[1] == '\0'))) {
+		bdi->dirty_background_time = msec;
+		ret = count;
+
+		if (over_dirty_bground_time(bdi))
+			bdi_start_background_writeback(bdi);
+	}
+	return ret;
+}
+BDI_SHOW(dirty_background_time, bdi->dirty_background_time)
+
 #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)
 
 static struct device_attribute bdi_dev_attrs[] = {
 	__ATTR_RW(read_ahead_kb),
 	__ATTR_RW(min_ratio),
 	__ATTR_RW(max_ratio),
+	__ATTR_RW(dirty_background_time),
 	__ATTR_NULL,
 };
 
@@ -626,6 +647,7 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->min_ratio = 0;
 	bdi->max_ratio = 100;
 	bdi->max_prop_frac = FPROP_FRAC_BASE;
+	bdi->dirty_background_time = 10000;
 	spin_lock_init(&bdi->wb_lock);
 	INIT_LIST_HEAD(&bdi->bdi_list);
 	INIT_LIST_HEAD(&bdi->work_list);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 73a7a06..f51a252 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1403,7 +1403,8 @@ pause:
 	if (laptop_mode)
 		return;
 
-	if (nr_reclaimable > background_thresh)
+	if (nr_reclaimable > background_thresh ||
+		over_dirty_bground_time(bdi))
 		bdi_start_background_writeback(bdi);
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] writeback: add dirty_ratio_time per bdi variable (NFS write performance)
  2012-08-18  9:50 [PATCH 3/3] writeback: add dirty_ratio_time per bdi variable Namjae Jeon
@ 2012-08-19  2:57 ` Fengguang Wu
  2012-08-20  0:48   ` Namjae Jeon
  2012-08-20  2:00   ` Dave Chinner
  0 siblings, 2 replies; 11+ messages in thread
From: Fengguang Wu @ 2012-08-19  2:57 UTC (permalink / raw)
  To: Namjae Jeon; +Cc: akpm, linux-kernel, Namjae Jeon, linux-fsdevel, linux-nfs

On Sat, Aug 18, 2012 at 05:50:02AM -0400, Namjae Jeon wrote:
> From: Namjae Jeon <namjae.jeon@samsung.com>
> 
> This patch is based on suggestion by Wu Fengguang:
> https://lkml.org/lkml/2011/8/19/19
> 
> kernel has mechanism to do writeback as per dirty_ratio and dirty_background
> ratio. It also maintains per task dirty rate limit to keep balance of
> dirty pages at any given instance by doing bdi bandwidth estimation.
> 
> Kernel also has max_ratio/min_ratio tunables to specify percentage of writecache
> to control per bdi dirty limits and task throtelling.
> 
> However, there might be a usecase where user wants a writeback tuning
> parameter to flush dirty data at desired/tuned time interval.
> 
> dirty_background_time provides an interface where user can tune background
> writeback start time using /sys/block/sda/bdi/dirty_background_time
> 
> dirty_background_time is used alongwith average bdi write bandwidth estimation
> to start background writeback.

Here lies my major concern about dirty_background_time: the write
bandwidth estimation is an _estimation_ and will sure become wildly
wrong in some cases. So the dirty_background_time implementation based
on it will not always work to the user expectations.

One important case is, some users (eg. Dave Chinner) explicitly take
advantage of the existing behavior to quickly create & delete a big
1GB temp file without worrying about triggering unnecessary IOs.

> One of the use case to demonstrate the patch functionality can be
> on NFS setup:-
> We have a NFS setup with ethernet line of 100Mbps, while the USB
> disk is attached to server, which has a local speed of 25MBps. Server
> and client both are arm target boards.
> 
> Now if we perform a write operation over NFS (client to server), as
> per the network speed, data can travel at max speed of 100Mbps. But
> if we check the default write speed of USB hdd over NFS it comes
> around to 8MB/sec, far below the speed of network.
> 
> Reason being is as per the NFS logic, during write operation, initially
> pages are dirtied on NFS client side, then after reaching the dirty
> threshold/writeback limit (or in case of sync) data is actually sent
> to NFS server (so now again pages are dirtied on server side). This
> will be done in COMMIT call from client to server i.e if 100MB of data
> is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds.
> 
> After the data is received, now it will take approx 100/25 ~4 Seconds to
> write the data to USB Hdd on server side. Hence making the overall time
> to write this much of data ~12 seconds, which in practically comes out to
> be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS
> client.
> 
> However we may improve this write performace by making the use of NFS
> server idle time i.e while data is being received from the client,
> simultaneously initiate the writeback thread on server side. So instead
> of waiting for the complete data to come and then start the writeback,
> we can work in parallel while the network is still busy in receiving the
> data. Hence in this way overall performace will be improved.
> 
> If we tune dirty_background_time, we can see there
> is increase in the performace and it comes out to be ~ 11MB/seconds.
> Results are:-
> ==========================================================
> Case:1 - Normal setup without any changes
> ./performancetest_arm ./100MB write
> 
>  RecSize  WriteSpeed   RanWriteSpeed
> 
>  10485760  7.93MB/sec   8.11MB/sec
>   1048576  8.21MB/sec   7.80MB/sec
>    524288  8.71MB/sec   8.39MB/sec
>    262144  8.91MB/sec   7.83MB/sec
>    131072  8.91MB/sec   8.95MB/sec
>     65536  8.95MB/sec   8.90MB/sec
>     32768  8.76MB/sec   8.93MB/sec
>     16384  8.78MB/sec   8.67MB/sec
>      8192  8.90MB/sec   8.52MB/sec
>      4096  8.89MB/sec   8.28MB/sec
> 
> Average speed is near 8MB/seconds.
> 
> Case:2 - Modified the dirty_background_time
> ./performancetest_arm ./100MB write
> 
>  RecSize  WriteSpeed   RanWriteSpeed
> 
>  10485760  10.56MB/sec  10.37MB/sec
>   1048576  10.43MB/sec  10.33MB/sec
>    524288  10.32MB/sec  10.02MB/sec
>    262144  10.52MB/sec  10.19MB/sec
>    131072  10.34MB/sec  10.07MB/sec
>     65536  10.31MB/sec  10.06MB/sec
>     32768  10.27MB/sec  10.24MB/sec
>     16384  10.54MB/sec  10.03MB/sec
>      8192  10.41MB/sec  10.38MB/sec
>      4096  10.34MB/sec  10.12MB/sec
> 
> we can see, average write speed is increased to ~10-11MB/sec.
> ============================================================

The numbers are impressive! FYI, I tried another NFS specific approach
to avoid big NFS COMMITs, which achieved similar performance gains:

nfs: writeback pages wait queue
https://lkml.org/lkml/2011/10/20/235

Thanks,
Fengguang

> Now to make this working we need to make change in dirty_[wirteback|expire]
> _interval so that flusher threads will be awaken up more early. But if we
> modify these values it will impact the overall system performace, while our
> requirement is to modify these parameters for the device used in NFS interface.
> 
> This patch provides the changes per block devices. So that we may modify the
> intervals as per the device and overall system is not impacted by the changes
> and we get improved
> 
> The above mentioned is one of the use case to use this patch.
> 
> Original-patch-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
> Tested-by: Vivek Trivedi <t.vivek@samsung.com>
> ---
>  fs/fs-writeback.c           |   18 ++++++++++++++++--
>  include/linux/backing-dev.h |    1 +
>  include/linux/writeback.h   |    1 +
>  mm/backing-dev.c            |   22 ++++++++++++++++++++++
>  mm/page-writeback.c         |    3 ++-
>  5 files changed, 42 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index be3efc4..75fda1d 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -769,6 +769,19 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
>  	return false;
>  }
>  
> +bool over_dirty_bground_time(struct backing_dev_info *bdi)
> +{
> +	unsigned long background_thresh;
> +
> +	background_thresh = bdi->avg_write_bandwidth *
> +		bdi->dirty_background_time / 1000;
> +
> +	if (bdi_stat(bdi, BDI_RECLAIMABLE) > background_thresh)
> +		return true;
> +
> +	return false;
> +}
> +
>  /*
>   * Called under wb->list_lock. If there are multiple wb per bdi,
>   * only the flusher working on the first wb should do it.
> @@ -828,7 +841,8 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		 * For background writeout, stop when we are below the
>  		 * background dirty threshold
>  		 */
> -		if (work->for_background && !over_bground_thresh(wb->bdi))
> +		if (work->for_background && !over_bground_thresh(wb->bdi) &&
> +			!over_dirty_bground_time(wb->bdi))
>  			break;
>  
>  		/*
> @@ -920,7 +934,7 @@ static unsigned long get_nr_dirty_pages(void)
>  
>  static long wb_check_background_flush(struct bdi_writeback *wb)
>  {
> -	if (over_bground_thresh(wb->bdi)) {
> +	if (over_bground_thresh(wb->bdi) || over_dirty_bground_time(wb->bdi)) {
>  
>  		struct wb_writeback_work work = {
>  			.nr_pages	= LONG_MAX,
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 2a9a9ab..ad83783 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -95,6 +95,7 @@ struct backing_dev_info {
>  
>  	unsigned int min_ratio;
>  	unsigned int max_ratio, max_prop_frac;
> +	unsigned int dirty_background_time;
>  
>  	struct bdi_writeback wb;  /* default writeback info for this bdi */
>  	spinlock_t wb_lock;	  /* protects work_list */
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index b82a83a..433cd09 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -96,6 +96,7 @@ long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
>  long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
>  void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
>  void inode_wait_for_writeback(struct inode *inode);
> +bool over_dirty_bground_time(struct backing_dev_info *bdi);
>  
>  /* writeback.h requires fs.h; it, too, is not included from here. */
>  static inline void wait_on_inode(struct inode *inode)
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index b41823c..0f9f798 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -219,12 +219,33 @@ static ssize_t max_ratio_store(struct device *dev,
>  }
>  BDI_SHOW(max_ratio, bdi->max_ratio)
>  
> +static ssize_t dirty_background_time_store(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t count)
> +{
> +	struct backing_dev_info *bdi = dev_get_drvdata(dev);
> +	char *end;
> +	unsigned int msec;
> +	ssize_t ret = -EINVAL;
> +
> +	msec = simple_strtoul(buf, &end, 10);
> +	if (*buf && (end[0] == '\0' || (end[0] == '\n' && end[1] == '\0'))) {
> +		bdi->dirty_background_time = msec;
> +		ret = count;
> +
> +		if (over_dirty_bground_time(bdi))
> +			bdi_start_background_writeback(bdi);
> +	}
> +	return ret;
> +}
> +BDI_SHOW(dirty_background_time, bdi->dirty_background_time)
> +
>  #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)
>  
>  static struct device_attribute bdi_dev_attrs[] = {
>  	__ATTR_RW(read_ahead_kb),
>  	__ATTR_RW(min_ratio),
>  	__ATTR_RW(max_ratio),
> +	__ATTR_RW(dirty_background_time),
>  	__ATTR_NULL,
>  };
>  
> @@ -626,6 +647,7 @@ int bdi_init(struct backing_dev_info *bdi)
>  	bdi->min_ratio = 0;
>  	bdi->max_ratio = 100;
>  	bdi->max_prop_frac = FPROP_FRAC_BASE;
> +	bdi->dirty_background_time = 10000;
>  	spin_lock_init(&bdi->wb_lock);
>  	INIT_LIST_HEAD(&bdi->bdi_list);
>  	INIT_LIST_HEAD(&bdi->work_list);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 73a7a06..f51a252 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1403,7 +1403,8 @@ pause:
>  	if (laptop_mode)
>  		return;
>  
> -	if (nr_reclaimable > background_thresh)
> +	if (nr_reclaimable > background_thresh ||
> +		over_dirty_bground_time(bdi))
>  		bdi_start_background_writeback(bdi);
>  }
>  
> -- 
> 1.7.9.5

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] writeback: add dirty_ratio_time per bdi variable (NFS write performance)
  2012-08-19  2:57 ` [PATCH 3/3] writeback: add dirty_ratio_time per bdi variable (NFS write performance) Fengguang Wu
@ 2012-08-20  0:48   ` Namjae Jeon
  2012-08-20 14:50     ` Fengguang Wu
  2012-08-20  2:00   ` Dave Chinner
  1 sibling, 1 reply; 11+ messages in thread
From: Namjae Jeon @ 2012-08-20  0:48 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: akpm, linux-kernel, Namjae Jeon, linux-fsdevel, linux-nfs

2012/8/19, Fengguang Wu <fengguang.wu@intel.com>:
> On Sat, Aug 18, 2012 at 05:50:02AM -0400, Namjae Jeon wrote:
>> From: Namjae Jeon <namjae.jeon@samsung.com>
>>
>> This patch is based on suggestion by Wu Fengguang:
>> https://lkml.org/lkml/2011/8/19/19
>>
>> kernel has mechanism to do writeback as per dirty_ratio and
>> dirty_background
>> ratio. It also maintains per task dirty rate limit to keep balance of
>> dirty pages at any given instance by doing bdi bandwidth estimation.
>>
>> Kernel also has max_ratio/min_ratio tunables to specify percentage of
>> writecache
>> to control per bdi dirty limits and task throtelling.
>>
>> However, there might be a usecase where user wants a writeback tuning
>> parameter to flush dirty data at desired/tuned time interval.
>>
>> dirty_background_time provides an interface where user can tune
>> background
>> writeback start time using /sys/block/sda/bdi/dirty_background_time
>>
>> dirty_background_time is used alongwith average bdi write bandwidth
>> estimation
>> to start background writeback.
>
> Here lies my major concern about dirty_background_time: the write
> bandwidth estimation is an _estimation_ and will sure become wildly
> wrong in some cases. So the dirty_background_time implementation based
> on it will not always work to the user expectations.
>
> One important case is, some users (eg. Dave Chinner) explicitly take
> advantage of the existing behavior to quickly create & delete a big
> 1GB temp file without worrying about triggering unnecessary IOs.
>
Hi. Wu.
Okay, I have a question.

If making dirty_writeback_interval per bdi to tune short interval
instead of background_time, We can get similar performance
improvement.
/sys/block/<device>/bdi/dirty_writeback_interval
/sys/block/<device>/bdi/dirty_expire_interval

NFS write performance improvement is just one usecase.

If we can set interval/time per bdi,  other usecases will be created
by applying.

How do you think ?

>The numbers are impressive! FYI, I tried another NFS specific approach
>to avoid big NFS COMMITs, which achieved similar performance gains:

>nfs: writeback pages wait queue
>https://lkml.org/lkml/2011/10/20/235

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] writeback: add dirty_ratio_time per bdi variable (NFS write performance)
  2012-08-19  2:57 ` [PATCH 3/3] writeback: add dirty_ratio_time per bdi variable (NFS write performance) Fengguang Wu
  2012-08-20  0:48   ` Namjae Jeon
@ 2012-08-20  2:00   ` Dave Chinner
  2012-08-20 18:01     ` J. Bruce Fields
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2012-08-20  2:00 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Namjae Jeon, akpm, linux-kernel, Namjae Jeon, linux-fsdevel, linux-nfs

On Sun, Aug 19, 2012 at 10:57:24AM +0800, Fengguang Wu wrote:
> On Sat, Aug 18, 2012 at 05:50:02AM -0400, Namjae Jeon wrote:
> > From: Namjae Jeon <namjae.jeon@samsung.com>
> > 
> > This patch is based on suggestion by Wu Fengguang:
> > https://lkml.org/lkml/2011/8/19/19
> > 
> > kernel has mechanism to do writeback as per dirty_ratio and dirty_background
> > ratio. It also maintains per task dirty rate limit to keep balance of
> > dirty pages at any given instance by doing bdi bandwidth estimation.
> > 
> > Kernel also has max_ratio/min_ratio tunables to specify percentage of writecache
> > to control per bdi dirty limits and task throtelling.
> > 
> > However, there might be a usecase where user wants a writeback tuning
> > parameter to flush dirty data at desired/tuned time interval.
> > 
> > dirty_background_time provides an interface where user can tune background
> > writeback start time using /sys/block/sda/bdi/dirty_background_time
> > 
> > dirty_background_time is used alongwith average bdi write bandwidth estimation
> > to start background writeback.
> 
> Here lies my major concern about dirty_background_time: the write
> bandwidth estimation is an _estimation_ and will sure become wildly
> wrong in some cases. So the dirty_background_time implementation based
> on it will not always work to the user expectations.
> 
> One important case is, some users (eg. Dave Chinner) explicitly take
> advantage of the existing behavior to quickly create & delete a big
> 1GB temp file without worrying about triggering unnecessary IOs.

It's a fairly common use case - short term temp files are used by
lots of applications and avoiding writing them - especially on NFS -
is a big performance win. Forcing immediate writeback will
definitely cause unprdictable changes in performance for many
people...

> > Results are:-
> > ==========================================================
> > Case:1 - Normal setup without any changes
> > ./performancetest_arm ./100MB write
> > 
> >  RecSize  WriteSpeed   RanWriteSpeed
> > 
> >  10485760  7.93MB/sec   8.11MB/sec
> >   1048576  8.21MB/sec   7.80MB/sec
> >    524288  8.71MB/sec   8.39MB/sec
> >    262144  8.91MB/sec   7.83MB/sec
> >    131072  8.91MB/sec   8.95MB/sec
> >     65536  8.95MB/sec   8.90MB/sec
> >     32768  8.76MB/sec   8.93MB/sec
> >     16384  8.78MB/sec   8.67MB/sec
> >      8192  8.90MB/sec   8.52MB/sec
> >      4096  8.89MB/sec   8.28MB/sec
> > 
> > Average speed is near 8MB/seconds.
> > 
> > Case:2 - Modified the dirty_background_time
> > ./performancetest_arm ./100MB write
> > 
> >  RecSize  WriteSpeed   RanWriteSpeed
> > 
> >  10485760  10.56MB/sec  10.37MB/sec
> >   1048576  10.43MB/sec  10.33MB/sec
> >    524288  10.32MB/sec  10.02MB/sec
> >    262144  10.52MB/sec  10.19MB/sec
> >    131072  10.34MB/sec  10.07MB/sec
> >     65536  10.31MB/sec  10.06MB/sec
> >     32768  10.27MB/sec  10.24MB/sec
> >     16384  10.54MB/sec  10.03MB/sec
> >      8192  10.41MB/sec  10.38MB/sec
> >      4096  10.34MB/sec  10.12MB/sec
> > 
> > we can see, average write speed is increased to ~10-11MB/sec.
> > ============================================================
> 
> The numbers are impressive!

All it shows is that avoiding the writeback delay writes a file a
bit faster. i.e. 5s delay + 10s @ 10MB/s vs no delay and 10s
@10MB/s. That's pretty obvious, really, and people have been trying
to make this "optimisation" for NFS clients for years in the
misguided belief that short-cutting writeback caching is beneficial
to application performance.

What these numbers don't show that is whether over-the-wire
writeback speed has improved at all. Or what happens when you have a
network that is faster than the server disk, or even faster than the
client can write into memory? What about when there are multiple
threads, or the network is congested, or the server overloaded? In
those cases the performance differential will disappear and
there's a good chance that the existing code will be significantly
faster because it places less imediate load on the server and
network.D...

If you need immediate dispatch of your data for single threaded
performance then sync_file_range() is your friend.

> FYI, I tried another NFS specific approach
> to avoid big NFS COMMITs, which achieved similar performance gains:
> 
> nfs: writeback pages wait queue
> https://lkml.org/lkml/2011/10/20/235

Which is basically controlling the server IO latency when commits
occur - smaller ranges mean the commit (fsync) is faster, and more
frequent commits mean the data goes to disk sooner. This is
something that will have a positive impact on writeback speeds
because it modifies the NFs client writeback behaviour to be more
server friendly and not stall over the wire. i.e. improving NFS
writeback performance is all about keeping the wire full and the
server happy, not about reducing the writeback delay before we start
writing over the wire.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] writeback: add dirty_ratio_time per bdi variable (NFS write performance)
  2012-08-20  0:48   ` Namjae Jeon
@ 2012-08-20 14:50     ` Fengguang Wu
  2012-08-21  6:00       ` Namjae Jeon
  0 siblings, 1 reply; 11+ messages in thread
From: Fengguang Wu @ 2012-08-20 14:50 UTC (permalink / raw)
  To: Namjae Jeon
  Cc: akpm, linux-kernel, Namjae Jeon, linux-fsdevel, linux-nfs, Dave Chinner

On Mon, Aug 20, 2012 at 09:48:42AM +0900, Namjae Jeon wrote:
> 2012/8/19, Fengguang Wu <fengguang.wu@intel.com>:
> > On Sat, Aug 18, 2012 at 05:50:02AM -0400, Namjae Jeon wrote:
> >> From: Namjae Jeon <namjae.jeon@samsung.com>
> >>
> >> This patch is based on suggestion by Wu Fengguang:
> >> https://lkml.org/lkml/2011/8/19/19
> >>
> >> kernel has mechanism to do writeback as per dirty_ratio and
> >> dirty_background
> >> ratio. It also maintains per task dirty rate limit to keep balance of
> >> dirty pages at any given instance by doing bdi bandwidth estimation.
> >>
> >> Kernel also has max_ratio/min_ratio tunables to specify percentage of
> >> writecache
> >> to control per bdi dirty limits and task throtelling.
> >>
> >> However, there might be a usecase where user wants a writeback tuning
> >> parameter to flush dirty data at desired/tuned time interval.
> >>
> >> dirty_background_time provides an interface where user can tune
> >> background
> >> writeback start time using /sys/block/sda/bdi/dirty_background_time
> >>
> >> dirty_background_time is used alongwith average bdi write bandwidth
> >> estimation
> >> to start background writeback.
> >
> > Here lies my major concern about dirty_background_time: the write
> > bandwidth estimation is an _estimation_ and will sure become wildly
> > wrong in some cases. So the dirty_background_time implementation based
> > on it will not always work to the user expectations.
> >
> > One important case is, some users (eg. Dave Chinner) explicitly take
> > advantage of the existing behavior to quickly create & delete a big
> > 1GB temp file without worrying about triggering unnecessary IOs.
> >
> Hi. Wu.
> Okay, I have a question.
> 
> If making dirty_writeback_interval per bdi to tune short interval
> instead of background_time, We can get similar performance
> improvement.
> /sys/block/<device>/bdi/dirty_writeback_interval
> /sys/block/<device>/bdi/dirty_expire_interval
> 
> NFS write performance improvement is just one usecase.
> 
> If we can set interval/time per bdi,  other usecases will be created
> by applying.

Per-bdi interval/time tunables, if there comes such a need, will in
essential be for data caching and safety. If turning them into some
requirement for better performance, the users will potential be
stretched on choosing the "right" value for balanced data cache,
safety and performance.  Hmm, not a comfortable prospection.

> >The numbers are impressive! FYI, I tried another NFS specific approach
> >to avoid big NFS COMMITs, which achieved similar performance gains:
> 
> >nfs: writeback pages wait queue
> >https://lkml.org/lkml/2011/10/20/235
> 
> Thanks.

The NFS write queue, on the other hand, is directly aimed for
improving NFS performance, latency and responsiveness.

In comparison to the per-bdi interval/time, it's more a guarantee of
smoother NFS writes.  As the tests show in the original email, with
the cost of a little more commits, it gains much better write
throughput and latency.

The NFS write queue is even a requirement, if we want to get
reasonable good responsiveness. Without it, the 20% dirty limit may
well be filled by NFS writeback/unstable pages. This is very bad for
responsiveness. Let me quote contents of two old emails (with small
fixes):

: PG_writeback pages have been the biggest source of
: latency issues in the various parts of the system.
: 
: It's not uncommon for me to see filesystems sleep on PG_writeback
: pages during heavy writeback, within some lock or transaction, which in
: turn stall many tasks that try to do IO or merely dirty some page in
: memory. Random writes are especially susceptible to such stalls. The
: stable page feature also vastly increase the chances of stalls by
: locking the writeback pages.
 
: When there are N seconds worth of writeback pages, it may
: take N/2 seconds on average for wait_on_page_writeback() to finish.
: So the total time cost of running into a random writeback page and
: waiting on it is also O(n^2):

:       E(PG_writeback waits) = P(hit PG_writeback) * E(wait on it)

: That means we can hardly keep more than 1-second worth of writeback
: pages w/o worrying about long waits on PG_writeback in various parts
: of the kernel.

: Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
: the case of direct reclaim, it means blocking random tasks that are
: allocating memory in the system.
: 
: PG_writeback pages are much worse than PG_dirty pages in that they are
: not movable. This makes a big difference for high-order page allocations.
: To make room for a 2MB huge page, vmscan has the option to migrate
: PG_dirty pages, but for PG_writeback it has no better choices than to
: wait for IO completion.
: 
: The difficulty of THP allocation goes up *exponentially* with the
: number of PG_writeback pages. Assume PG_writeback pages are randomly
: distributed in the physical memory space. Then we have formula
: 
:         P(reclaimable for THP) = P(non-PG_writeback)^512
: 
: That's the possibly for a contiguous range of 512 pages to be free of
: PG_writeback, so that it's immediately reclaimable for use by
: transparent huge page. This ruby script shows us the concrete numbers.
: 
: irb> 1.upto(10) { |i| j=i/1000.0; printf "%.3f\t\t\t%.3f\n", j, (1-j)**512 }
: 
:         P(hit PG_writeback)     P(reclaimable for THP)
:         0.001                   0.599
:         0.002                   0.359
:         0.003                   0.215
:         0.004                   0.128
:         0.005                   0.077
:         0.006                   0.046
:         0.007                   0.027
:         0.008                   0.016
:         0.009                   0.010
:         0.010                   0.006
: 
: The numbers show that when the PG_writeback pages go up from 0.1% to
: 1% of system memory, the THP reclaim success ratio drops quickly from
: 60% to 0.6%. It indicates that in order to use THP without constantly
: running into stalls, the reasonable PG_writeback ratio is <= 0.1%.
: Going beyond that threshold, it quickly becomes intolerable.
: 
: That makes a limit of 256MB writeback pages for a mem=256GB system.
: Looking at the real vmstat:nr_writeback numbers in dd write tests:
: 
: JBOD-12SSD-thresh=8G/ext4-1dd-1-3.3.0/vmstat-end:nr_writeback 217009
: JBOD-12SSD-thresh=8G/ext4-10dd-1-3.3.0/vmstat-end:nr_writeback 198335
: JBOD-12SSD-thresh=8G/xfs-1dd-1-3.3.0/vmstat-end:nr_writeback 306026
: JBOD-12SSD-thresh=8G/xfs-10dd-1-3.3.0/vmstat-end:nr_writeback 315099
: JBOD-12SSD-thresh=8G/btrfs-1dd-1-3.3.0/vmstat-end:nr_writeback 1216058
: JBOD-12SSD-thresh=8G/btrfs-10dd-1-3.3.0/vmstat-end:nr_writeback 895335
: 
: Oops btrfs has 4GB writeback pages -- which asks for some bug fixing.
: Even ext4's 800MB still looks way too high, but that's ~1s worth of
: data per queue (or 130ms worth of data for the high performance Intel
: SSD, which is perhaps in danger of queue underruns?). So this system
: would require 512GB memory to comfortably run KVM instances with THP
: support.

The main concern on the NFS write wait queue, however, was that it
might hurt performance for long fat network pipes with large
bandwidth-delay products. If the pipe size can be properly estimated,
we'll be able to set adequate queue size and remove the last obstacle
of that patch.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] writeback: add dirty_ratio_time per bdi variable (NFS write performance)
  2012-08-20  2:00   ` Dave Chinner
@ 2012-08-20 18:01     ` J. Bruce Fields
  2012-08-21  5:48       ` Namjae Jeon
  0 siblings, 1 reply; 11+ messages in thread
From: J. Bruce Fields @ 2012-08-20 18:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Fengguang Wu, Namjae Jeon, akpm, linux-kernel, Namjae Jeon,
	linux-fsdevel, linux-nfs

On Mon, Aug 20, 2012 at 12:00:04PM +1000, Dave Chinner wrote:
> On Sun, Aug 19, 2012 at 10:57:24AM +0800, Fengguang Wu wrote:
> > On Sat, Aug 18, 2012 at 05:50:02AM -0400, Namjae Jeon wrote:
> > > From: Namjae Jeon <namjae.jeon@samsung.com>
> > > 
> > > This patch is based on suggestion by Wu Fengguang:
> > > https://lkml.org/lkml/2011/8/19/19
> > > 
> > > kernel has mechanism to do writeback as per dirty_ratio and dirty_background
> > > ratio. It also maintains per task dirty rate limit to keep balance of
> > > dirty pages at any given instance by doing bdi bandwidth estimation.
> > > 
> > > Kernel also has max_ratio/min_ratio tunables to specify percentage of writecache
> > > to control per bdi dirty limits and task throtelling.
> > > 
> > > However, there might be a usecase where user wants a writeback tuning
> > > parameter to flush dirty data at desired/tuned time interval.
> > > 
> > > dirty_background_time provides an interface where user can tune background
> > > writeback start time using /sys/block/sda/bdi/dirty_background_time
> > > 
> > > dirty_background_time is used alongwith average bdi write bandwidth estimation
> > > to start background writeback.
> > 
> > Here lies my major concern about dirty_background_time: the write
> > bandwidth estimation is an _estimation_ and will sure become wildly
> > wrong in some cases. So the dirty_background_time implementation based
> > on it will not always work to the user expectations.
> > 
> > One important case is, some users (eg. Dave Chinner) explicitly take
> > advantage of the existing behavior to quickly create & delete a big
> > 1GB temp file without worrying about triggering unnecessary IOs.
> 
> It's a fairly common use case - short term temp files are used by
> lots of applications and avoiding writing them - especially on NFS -
> is a big performance win. Forcing immediate writeback will
> definitely cause unprdictable changes in performance for many
> people...
> 
> > > Results are:-
> > > ==========================================================
> > > Case:1 - Normal setup without any changes
> > > ./performancetest_arm ./100MB write
> > > 
> > >  RecSize  WriteSpeed   RanWriteSpeed
> > > 
> > >  10485760  7.93MB/sec   8.11MB/sec
> > >   1048576  8.21MB/sec   7.80MB/sec
> > >    524288  8.71MB/sec   8.39MB/sec
> > >    262144  8.91MB/sec   7.83MB/sec
> > >    131072  8.91MB/sec   8.95MB/sec
> > >     65536  8.95MB/sec   8.90MB/sec
> > >     32768  8.76MB/sec   8.93MB/sec
> > >     16384  8.78MB/sec   8.67MB/sec
> > >      8192  8.90MB/sec   8.52MB/sec
> > >      4096  8.89MB/sec   8.28MB/sec
> > > 
> > > Average speed is near 8MB/seconds.
> > > 
> > > Case:2 - Modified the dirty_background_time
> > > ./performancetest_arm ./100MB write
> > > 
> > >  RecSize  WriteSpeed   RanWriteSpeed
> > > 
> > >  10485760  10.56MB/sec  10.37MB/sec
> > >   1048576  10.43MB/sec  10.33MB/sec
> > >    524288  10.32MB/sec  10.02MB/sec
> > >    262144  10.52MB/sec  10.19MB/sec
> > >    131072  10.34MB/sec  10.07MB/sec
> > >     65536  10.31MB/sec  10.06MB/sec
> > >     32768  10.27MB/sec  10.24MB/sec
> > >     16384  10.54MB/sec  10.03MB/sec
> > >      8192  10.41MB/sec  10.38MB/sec
> > >      4096  10.34MB/sec  10.12MB/sec
> > > 
> > > we can see, average write speed is increased to ~10-11MB/sec.
> > > ============================================================
> > 
> > The numbers are impressive!
> 
> All it shows is that avoiding the writeback delay writes a file a
> bit faster. i.e. 5s delay + 10s @ 10MB/s vs no delay and 10s
> @10MB/s. That's pretty obvious, really, and people have been trying
> to make this "optimisation" for NFS clients for years in the
> misguided belief that short-cutting writeback caching is beneficial
> to application performance.
> 
> What these numbers don't show that is whether over-the-wire
> writeback speed has improved at all. Or what happens when you have a
> network that is faster than the server disk, or even faster than the
> client can write into memory? What about when there are multiple
> threads, or the network is congested, or the server overloaded? In
> those cases the performance differential will disappear and
> there's a good chance that the existing code will be significantly
> faster because it places less imediate load on the server and
> network.D...
> 
> If you need immediate dispatch of your data for single threaded
> performance then sync_file_range() is your friend.
> 
> > FYI, I tried another NFS specific approach
> > to avoid big NFS COMMITs, which achieved similar performance gains:
> > 
> > nfs: writeback pages wait queue
> > https://lkml.org/lkml/2011/10/20/235
> 
> Which is basically controlling the server IO latency when commits
> occur - smaller ranges mean the commit (fsync) is faster, and more
> frequent commits mean the data goes to disk sooner. This is
> something that will have a positive impact on writeback speeds
> because it modifies the NFs client writeback behaviour to be more
> server friendly and not stall over the wire. i.e. improving NFS
> writeback performance is all about keeping the wire full and the
> server happy, not about reducing the writeback delay before we start
> writing over the wire.

Wait, aren't we confusing client and server side here?

If I read Namjae Jeon's post correctly, I understood that it was the
*server* side he was modifying to start writeout sooner, to improve
response time to eventual expected commits from the client.  The
responses above all seem to be about the client.

Maybe it's all the same at some level, but: naively, starting writeout
early would seem a better bet on the server side.  By the time we get
writes, the client has already decided they're worth sending to disk.

And changes to make clients and applications friendlier to the server
are great, but we don't always have that option--there are more clients
out there than servers and the latter may be easier to upgrade than the
former.

--b.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] writeback: add dirty_ratio_time per bdi variable (NFS write performance)
  2012-08-20 18:01     ` J. Bruce Fields
@ 2012-08-21  5:48       ` Namjae Jeon
  2012-08-21 12:57         ` Fengguang Wu
  0 siblings, 1 reply; 11+ messages in thread
From: Namjae Jeon @ 2012-08-21  5:48 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Dave Chinner, Fengguang Wu, akpm, linux-kernel, Namjae Jeon,
	linux-fsdevel, linux-nfs

2012/8/21, J. Bruce Fields <bfields@fieldses.org>:
> On Mon, Aug 20, 2012 at 12:00:04PM +1000, Dave Chinner wrote:
>> On Sun, Aug 19, 2012 at 10:57:24AM +0800, Fengguang Wu wrote:
>> > On Sat, Aug 18, 2012 at 05:50:02AM -0400, Namjae Jeon wrote:
>> > > From: Namjae Jeon <namjae.jeon@samsung.com>
>> > >
>> > > This patch is based on suggestion by Wu Fengguang:
>> > > https://lkml.org/lkml/2011/8/19/19
>> > >
>> > > kernel has mechanism to do writeback as per dirty_ratio and
>> > > dirty_background
>> > > ratio. It also maintains per task dirty rate limit to keep balance of
>> > > dirty pages at any given instance by doing bdi bandwidth estimation.
>> > >
>> > > Kernel also has max_ratio/min_ratio tunables to specify percentage of
>> > > writecache
>> > > to control per bdi dirty limits and task throtelling.
>> > >
>> > > However, there might be a usecase where user wants a writeback tuning
>> > > parameter to flush dirty data at desired/tuned time interval.
>> > >
>> > > dirty_background_time provides an interface where user can tune
>> > > background
>> > > writeback start time using /sys/block/sda/bdi/dirty_background_time
>> > >
>> > > dirty_background_time is used alongwith average bdi write bandwidth
>> > > estimation
>> > > to start background writeback.
>> >
>> > Here lies my major concern about dirty_background_time: the write
>> > bandwidth estimation is an _estimation_ and will sure become wildly
>> > wrong in some cases. So the dirty_background_time implementation based
>> > on it will not always work to the user expectations.
>> >
>> > One important case is, some users (eg. Dave Chinner) explicitly take
>> > advantage of the existing behavior to quickly create & delete a big
>> > 1GB temp file without worrying about triggering unnecessary IOs.
>>
>> It's a fairly common use case - short term temp files are used by
>> lots of applications and avoiding writing them - especially on NFS -
>> is a big performance win. Forcing immediate writeback will
>> definitely cause unprdictable changes in performance for many
>> people...
>>
>> > > Results are:-
>> > > ==========================================================
>> > > Case:1 - Normal setup without any changes
>> > > ./performancetest_arm ./100MB write
>> > >
>> > >  RecSize  WriteSpeed   RanWriteSpeed
>> > >
>> > >  10485760  7.93MB/sec   8.11MB/sec
>> > >   1048576  8.21MB/sec   7.80MB/sec
>> > >    524288  8.71MB/sec   8.39MB/sec
>> > >    262144  8.91MB/sec   7.83MB/sec
>> > >    131072  8.91MB/sec   8.95MB/sec
>> > >     65536  8.95MB/sec   8.90MB/sec
>> > >     32768  8.76MB/sec   8.93MB/sec
>> > >     16384  8.78MB/sec   8.67MB/sec
>> > >      8192  8.90MB/sec   8.52MB/sec
>> > >      4096  8.89MB/sec   8.28MB/sec
>> > >
>> > > Average speed is near 8MB/seconds.
>> > >
>> > > Case:2 - Modified the dirty_background_time
>> > > ./performancetest_arm ./100MB write
>> > >
>> > >  RecSize  WriteSpeed   RanWriteSpeed
>> > >
>> > >  10485760  10.56MB/sec  10.37MB/sec
>> > >   1048576  10.43MB/sec  10.33MB/sec
>> > >    524288  10.32MB/sec  10.02MB/sec
>> > >    262144  10.52MB/sec  10.19MB/sec
>> > >    131072  10.34MB/sec  10.07MB/sec
>> > >     65536  10.31MB/sec  10.06MB/sec
>> > >     32768  10.27MB/sec  10.24MB/sec
>> > >     16384  10.54MB/sec  10.03MB/sec
>> > >      8192  10.41MB/sec  10.38MB/sec
>> > >      4096  10.34MB/sec  10.12MB/sec
>> > >
>> > > we can see, average write speed is increased to ~10-11MB/sec.
>> > > ============================================================
>> >
>> > The numbers are impressive!
>>
>> All it shows is that avoiding the writeback delay writes a file a
>> bit faster. i.e. 5s delay + 10s @ 10MB/s vs no delay and 10s
>> @10MB/s. That's pretty obvious, really, and people have been trying
>> to make this "optimisation" for NFS clients for years in the
>> misguided belief that short-cutting writeback caching is beneficial
>> to application performance.
>>
>> What these numbers don't show that is whether over-the-wire
>> writeback speed has improved at all. Or what happens when you have a
>> network that is faster than the server disk, or even faster than the
>> client can write into memory? What about when there are multiple
>> threads, or the network is congested, or the server overloaded? In
>> those cases the performance differential will disappear and
>> there's a good chance that the existing code will be significantly
>> faster because it places less imediate load on the server and
>> network.D...
>>
>> If you need immediate dispatch of your data for single threaded
>> performance then sync_file_range() is your friend.
>>
>> > FYI, I tried another NFS specific approach
>> > to avoid big NFS COMMITs, which achieved similar performance gains:
>> >
>> > nfs: writeback pages wait queue
>> > https://lkml.org/lkml/2011/10/20/235
>>
>> Which is basically controlling the server IO latency when commits
>> occur - smaller ranges mean the commit (fsync) is faster, and more
>> frequent commits mean the data goes to disk sooner. This is
>> something that will have a positive impact on writeback speeds
>> because it modifies the NFs client writeback behaviour to be more
>> server friendly and not stall over the wire. i.e. improving NFS
>> writeback performance is all about keeping the wire full and the
>> server happy, not about reducing the writeback delay before we start
>> writing over the wire.
>
> Wait, aren't we confusing client and server side here?
>
> If I read Namjae Jeon's post correctly, I understood that it was the
> *server* side he was modifying to start writeout sooner, to improve
> response time to eventual expected commits from the client.  The
> responses above all seem to be about the client.
>
> Maybe it's all the same at some level, but: naively, starting writeout
> early would seem a better bet on the server side.  By the time we get
> writes, the client has already decided they're worth sending to disk.
Hi Bruce.

Yes, right, I have not changed writeback setting on NFS client, It was
changed on NFS Server.
So writeback behaviour on NFS client will work at default. So There
will be no change in data caching behaviour
at NFS client. It will reduce server side wait time for NFS COMMIT by
starting early writeback.
>
> And changes to make clients and applications friendlier to the server
> are great, but we don't always have that option--there are more clients
> out there than servers and the latter may be easier to upgrade than the
> former.
I agree about your opinion..

Thanks.
>
> --b.
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] writeback: add dirty_ratio_time per bdi variable (NFS write performance)
  2012-08-20 14:50     ` Fengguang Wu
@ 2012-08-21  6:00       ` Namjae Jeon
  2012-08-21 13:04         ` Fengguang Wu
  0 siblings, 1 reply; 11+ messages in thread
From: Namjae Jeon @ 2012-08-21  6:00 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: akpm, linux-kernel, Namjae Jeon, linux-fsdevel, linux-nfs, Dave Chinner

2012/8/20, Fengguang Wu <fengguang.wu@intel.com>:
> On Mon, Aug 20, 2012 at 09:48:42AM +0900, Namjae Jeon wrote:
>> 2012/8/19, Fengguang Wu <fengguang.wu@intel.com>:
>> > On Sat, Aug 18, 2012 at 05:50:02AM -0400, Namjae Jeon wrote:
>> >> From: Namjae Jeon <namjae.jeon@samsung.com>
>> >>
>> >> This patch is based on suggestion by Wu Fengguang:
>> >> https://lkml.org/lkml/2011/8/19/19
>> >>
>> >> kernel has mechanism to do writeback as per dirty_ratio and
>> >> dirty_background
>> >> ratio. It also maintains per task dirty rate limit to keep balance of
>> >> dirty pages at any given instance by doing bdi bandwidth estimation.
>> >>
>> >> Kernel also has max_ratio/min_ratio tunables to specify percentage of
>> >> writecache
>> >> to control per bdi dirty limits and task throtelling.
>> >>
>> >> However, there might be a usecase where user wants a writeback tuning
>> >> parameter to flush dirty data at desired/tuned time interval.
>> >>
>> >> dirty_background_time provides an interface where user can tune
>> >> background
>> >> writeback start time using /sys/block/sda/bdi/dirty_background_time
>> >>
>> >> dirty_background_time is used alongwith average bdi write bandwidth
>> >> estimation
>> >> to start background writeback.
>> >
>> > Here lies my major concern about dirty_background_time: the write
>> > bandwidth estimation is an _estimation_ and will sure become wildly
>> > wrong in some cases. So the dirty_background_time implementation based
>> > on it will not always work to the user expectations.
>> >
>> > One important case is, some users (eg. Dave Chinner) explicitly take
>> > advantage of the existing behavior to quickly create & delete a big
>> > 1GB temp file without worrying about triggering unnecessary IOs.
>> >
>> Hi. Wu.
>> Okay, I have a question.
>>
>> If making dirty_writeback_interval per bdi to tune short interval
>> instead of background_time, We can get similar performance
>> improvement.
>> /sys/block/<device>/bdi/dirty_writeback_interval
>> /sys/block/<device>/bdi/dirty_expire_interval
>>
>> NFS write performance improvement is just one usecase.
>>
>> If we can set interval/time per bdi,  other usecases will be created
>> by applying.
>
> Per-bdi interval/time tunables, if there comes such a need, will in
> essential be for data caching and safety. If turning them into some
> requirement for better performance, the users will potential be
> stretched on choosing the "right" value for balanced data cache,
> safety and performance.  Hmm, not a comfortable prospection.
Hi Wu.
First, Thanks for shared information.

I change writeback interval on NFS server only.

I think that this does not affect data cache/page behaviour(caching)
change on NFS client. NFS client will start sending write requests as
per default NFS/writeback logic. So, no change in NFS client data
caching behaviour.

Also, on NFS server it does not make change in system-wide caching
behaviour. It only modifies caching/writeback behaviour of a
particular “bdi” on NFS server so that NFS client could see better
WRITE speed.

I will share several performancetest results as Dave's opinion.

>
>> >The numbers are impressive! FYI, I tried another NFS specific approach
>> >to avoid big NFS COMMITs, which achieved similar performance gains:
>>
>> >nfs: writeback pages wait queue
>> >https://lkml.org/lkml/2011/10/20/235
This patch looks client side optimization to me.(need to check more)
Do we need the optimization of server side as Bruce's opinion ?

Thanks.
>>
>> Thanks.
>
> The NFS write queue, on the other hand, is directly aimed for
> improving NFS performance, latency and responsiveness.
>
> In comparison to the per-bdi interval/time, it's more a guarantee of
> smoother NFS writes.  As the tests show in the original email, with
> the cost of a little more commits, it gains much better write
> throughput and latency.
>
> The NFS write queue is even a requirement, if we want to get
> reasonable good responsiveness. Without it, the 20% dirty limit may
> well be filled by NFS writeback/unstable pages. This is very bad for
> responsiveness. Let me quote contents of two old emails (with small
> fixes):
>
> : PG_writeback pages have been the biggest source of
> : latency issues in the various parts of the system.
> :
> : It's not uncommon for me to see filesystems sleep on PG_writeback
> : pages during heavy writeback, within some lock or transaction, which in
> : turn stall many tasks that try to do IO or merely dirty some page in
> : memory. Random writes are especially susceptible to such stalls. The
> : stable page feature also vastly increase the chances of stalls by
> : locking the writeback pages.
>
> : When there are N seconds worth of writeback pages, it may
> : take N/2 seconds on average for wait_on_page_writeback() to finish.
> : So the total time cost of running into a random writeback page and
> : waiting on it is also O(n^2):
>
> :       E(PG_writeback waits) = P(hit PG_writeback) * E(wait on it)
>
> : That means we can hardly keep more than 1-second worth of writeback
> : pages w/o worrying about long waits on PG_writeback in various parts
> : of the kernel.
>
> : Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
> : the case of direct reclaim, it means blocking random tasks that are
> : allocating memory in the system.
> :
> : PG_writeback pages are much worse than PG_dirty pages in that they are
> : not movable. This makes a big difference for high-order page allocations.
> : To make room for a 2MB huge page, vmscan has the option to migrate
> : PG_dirty pages, but for PG_writeback it has no better choices than to
> : wait for IO completion.
> :
> : The difficulty of THP allocation goes up *exponentially* with the
> : number of PG_writeback pages. Assume PG_writeback pages are randomly
> : distributed in the physical memory space. Then we have formula
> :
> :         P(reclaimable for THP) = P(non-PG_writeback)^512
> :
> : That's the possibly for a contiguous range of 512 pages to be free of
> : PG_writeback, so that it's immediately reclaimable for use by
> : transparent huge page. This ruby script shows us the concrete numbers.
> :
> : irb> 1.upto(10) { |i| j=i/1000.0; printf "%.3f\t\t\t%.3f\n", j, (1-j)**512
> }
> :
> :         P(hit PG_writeback)     P(reclaimable for THP)
> :         0.001                   0.599
> :         0.002                   0.359
> :         0.003                   0.215
> :         0.004                   0.128
> :         0.005                   0.077
> :         0.006                   0.046
> :         0.007                   0.027
> :         0.008                   0.016
> :         0.009                   0.010
> :         0.010                   0.006
> :
> : The numbers show that when the PG_writeback pages go up from 0.1% to
> : 1% of system memory, the THP reclaim success ratio drops quickly from
> : 60% to 0.6%. It indicates that in order to use THP without constantly
> : running into stalls, the reasonable PG_writeback ratio is <= 0.1%.
> : Going beyond that threshold, it quickly becomes intolerable.
> :
> : That makes a limit of 256MB writeback pages for a mem=256GB system.
> : Looking at the real vmstat:nr_writeback numbers in dd write tests:
> :
> : JBOD-12SSD-thresh=8G/ext4-1dd-1-3.3.0/vmstat-end:nr_writeback 217009
> : JBOD-12SSD-thresh=8G/ext4-10dd-1-3.3.0/vmstat-end:nr_writeback 198335
> : JBOD-12SSD-thresh=8G/xfs-1dd-1-3.3.0/vmstat-end:nr_writeback 306026
> : JBOD-12SSD-thresh=8G/xfs-10dd-1-3.3.0/vmstat-end:nr_writeback 315099
> : JBOD-12SSD-thresh=8G/btrfs-1dd-1-3.3.0/vmstat-end:nr_writeback 1216058
> : JBOD-12SSD-thresh=8G/btrfs-10dd-1-3.3.0/vmstat-end:nr_writeback 895335
> :
> : Oops btrfs has 4GB writeback pages -- which asks for some bug fixing.
> : Even ext4's 800MB still looks way too high, but that's ~1s worth of
> : data per queue (or 130ms worth of data for the high performance Intel
> : SSD, which is perhaps in danger of queue underruns?). So this system
> : would require 512GB memory to comfortably run KVM instances with THP
> : support.
>
> The main concern on the NFS write wait queue, however, was that it
> might hurt performance for long fat network pipes with large
> bandwidth-delay products. If the pipe size can be properly estimated,
> we'll be able to set adequate queue size and remove the last obstacle
> of that patch.
>
> Thanks,
> Fengguang
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] writeback: add dirty_ratio_time per bdi variable (NFS write performance)
  2012-08-21  5:48       ` Namjae Jeon
@ 2012-08-21 12:57         ` Fengguang Wu
  0 siblings, 0 replies; 11+ messages in thread
From: Fengguang Wu @ 2012-08-21 12:57 UTC (permalink / raw)
  To: Namjae Jeon
  Cc: J. Bruce Fields, Dave Chinner, akpm, linux-kernel, Namjae Jeon,
	linux-fsdevel, linux-nfs

On Tue, Aug 21, 2012 at 02:48:35PM +0900, Namjae Jeon wrote:
> 2012/8/21, J. Bruce Fields <bfields@fieldses.org>:
> > On Mon, Aug 20, 2012 at 12:00:04PM +1000, Dave Chinner wrote:
> >> On Sun, Aug 19, 2012 at 10:57:24AM +0800, Fengguang Wu wrote:
> >> > On Sat, Aug 18, 2012 at 05:50:02AM -0400, Namjae Jeon wrote:
> >> > > From: Namjae Jeon <namjae.jeon@samsung.com>
> >> > >
> >> > > This patch is based on suggestion by Wu Fengguang:
> >> > > https://lkml.org/lkml/2011/8/19/19
> >> > >
> >> > > kernel has mechanism to do writeback as per dirty_ratio and
> >> > > dirty_background
> >> > > ratio. It also maintains per task dirty rate limit to keep balance of
> >> > > dirty pages at any given instance by doing bdi bandwidth estimation.
> >> > >
> >> > > Kernel also has max_ratio/min_ratio tunables to specify percentage of
> >> > > writecache
> >> > > to control per bdi dirty limits and task throtelling.
> >> > >
> >> > > However, there might be a usecase where user wants a writeback tuning
> >> > > parameter to flush dirty data at desired/tuned time interval.
> >> > >
> >> > > dirty_background_time provides an interface where user can tune
> >> > > background
> >> > > writeback start time using /sys/block/sda/bdi/dirty_background_time
> >> > >
> >> > > dirty_background_time is used alongwith average bdi write bandwidth
> >> > > estimation
> >> > > to start background writeback.
> >> >
> >> > Here lies my major concern about dirty_background_time: the write
> >> > bandwidth estimation is an _estimation_ and will sure become wildly
> >> > wrong in some cases. So the dirty_background_time implementation based
> >> > on it will not always work to the user expectations.
> >> >
> >> > One important case is, some users (eg. Dave Chinner) explicitly take
> >> > advantage of the existing behavior to quickly create & delete a big
> >> > 1GB temp file without worrying about triggering unnecessary IOs.
> >>
> >> It's a fairly common use case - short term temp files are used by
> >> lots of applications and avoiding writing them - especially on NFS -
> >> is a big performance win. Forcing immediate writeback will
> >> definitely cause unprdictable changes in performance for many
> >> people...
> >>
> >> > > Results are:-
> >> > > ==========================================================
> >> > > Case:1 - Normal setup without any changes
> >> > > ./performancetest_arm ./100MB write
> >> > >
> >> > >  RecSize  WriteSpeed   RanWriteSpeed
> >> > >
> >> > >  10485760  7.93MB/sec   8.11MB/sec
> >> > >   1048576  8.21MB/sec   7.80MB/sec
> >> > >    524288  8.71MB/sec   8.39MB/sec
> >> > >    262144  8.91MB/sec   7.83MB/sec
> >> > >    131072  8.91MB/sec   8.95MB/sec
> >> > >     65536  8.95MB/sec   8.90MB/sec
> >> > >     32768  8.76MB/sec   8.93MB/sec
> >> > >     16384  8.78MB/sec   8.67MB/sec
> >> > >      8192  8.90MB/sec   8.52MB/sec
> >> > >      4096  8.89MB/sec   8.28MB/sec
> >> > >
> >> > > Average speed is near 8MB/seconds.
> >> > >
> >> > > Case:2 - Modified the dirty_background_time
> >> > > ./performancetest_arm ./100MB write
> >> > >
> >> > >  RecSize  WriteSpeed   RanWriteSpeed
> >> > >
> >> > >  10485760  10.56MB/sec  10.37MB/sec
> >> > >   1048576  10.43MB/sec  10.33MB/sec
> >> > >    524288  10.32MB/sec  10.02MB/sec
> >> > >    262144  10.52MB/sec  10.19MB/sec
> >> > >    131072  10.34MB/sec  10.07MB/sec
> >> > >     65536  10.31MB/sec  10.06MB/sec
> >> > >     32768  10.27MB/sec  10.24MB/sec
> >> > >     16384  10.54MB/sec  10.03MB/sec
> >> > >      8192  10.41MB/sec  10.38MB/sec
> >> > >      4096  10.34MB/sec  10.12MB/sec
> >> > >
> >> > > we can see, average write speed is increased to ~10-11MB/sec.
> >> > > ============================================================
> >> >
> >> > The numbers are impressive!
> >>
> >> All it shows is that avoiding the writeback delay writes a file a
> >> bit faster. i.e. 5s delay + 10s @ 10MB/s vs no delay and 10s
> >> @10MB/s. That's pretty obvious, really, and people have been trying
> >> to make this "optimisation" for NFS clients for years in the
> >> misguided belief that short-cutting writeback caching is beneficial
> >> to application performance.
> >>
> >> What these numbers don't show that is whether over-the-wire
> >> writeback speed has improved at all. Or what happens when you have a
> >> network that is faster than the server disk, or even faster than the
> >> client can write into memory? What about when there are multiple
> >> threads, or the network is congested, or the server overloaded? In
> >> those cases the performance differential will disappear and
> >> there's a good chance that the existing code will be significantly
> >> faster because it places less imediate load on the server and
> >> network.D...
> >>
> >> If you need immediate dispatch of your data for single threaded
> >> performance then sync_file_range() is your friend.
> >>
> >> > FYI, I tried another NFS specific approach
> >> > to avoid big NFS COMMITs, which achieved similar performance gains:
> >> >
> >> > nfs: writeback pages wait queue
> >> > https://lkml.org/lkml/2011/10/20/235
> >>
> >> Which is basically controlling the server IO latency when commits
> >> occur - smaller ranges mean the commit (fsync) is faster, and more
> >> frequent commits mean the data goes to disk sooner. This is
> >> something that will have a positive impact on writeback speeds
> >> because it modifies the NFs client writeback behaviour to be more
> >> server friendly and not stall over the wire. i.e. improving NFS
> >> writeback performance is all about keeping the wire full and the
> >> server happy, not about reducing the writeback delay before we start
> >> writing over the wire.
> >
> > Wait, aren't we confusing client and server side here?
> >
> > If I read Namjae Jeon's post correctly, I understood that it was the
> > *server* side he was modifying to start writeout sooner, to improve
> > response time to eventual expected commits from the client.  The
> > responses above all seem to be about the client.
> >
> > Maybe it's all the same at some level, but: naively, starting writeout
> > early would seem a better bet on the server side.  By the time we get
> > writes, the client has already decided they're worth sending to disk.
> Hi Bruce.
> 
> Yes, right, I have not changed writeback setting on NFS client, It was
> changed on NFS Server.

Ah OK, I'm very supportive to lower the NFS server's background
writeback threshold. This will obviously help reduce disk idle time as
well as turning a good amount of SYNC writes to ASYNC ones.

> So writeback behaviour on NFS client will work at default. So There
> will be no change in data caching behaviour
> at NFS client. It will reduce server side wait time for NFS COMMIT by
> starting early writeback.

Agreed.

> >
> > And changes to make clients and applications friendlier to the server
> > are great, but we don't always have that option--there are more clients
> > out there than servers and the latter may be easier to upgrade than the
> > former.
> I agree about your opinion..

Agreed.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] writeback: add dirty_ratio_time per bdi variable (NFS write performance)
  2012-08-21  6:00       ` Namjae Jeon
@ 2012-08-21 13:04         ` Fengguang Wu
  2012-08-22  1:10           ` Namjae Jeon
  0 siblings, 1 reply; 11+ messages in thread
From: Fengguang Wu @ 2012-08-21 13:04 UTC (permalink / raw)
  To: Namjae Jeon
  Cc: akpm, linux-kernel, Namjae Jeon, linux-fsdevel, linux-nfs, Dave Chinner

On Tue, Aug 21, 2012 at 03:00:13PM +0900, Namjae Jeon wrote:
> 2012/8/20, Fengguang Wu <fengguang.wu@intel.com>:
> > On Mon, Aug 20, 2012 at 09:48:42AM +0900, Namjae Jeon wrote:
> >> 2012/8/19, Fengguang Wu <fengguang.wu@intel.com>:
> >> > On Sat, Aug 18, 2012 at 05:50:02AM -0400, Namjae Jeon wrote:
> >> >> From: Namjae Jeon <namjae.jeon@samsung.com>
> >> >>
> >> >> This patch is based on suggestion by Wu Fengguang:
> >> >> https://lkml.org/lkml/2011/8/19/19
> >> >>
> >> >> kernel has mechanism to do writeback as per dirty_ratio and
> >> >> dirty_background
> >> >> ratio. It also maintains per task dirty rate limit to keep balance of
> >> >> dirty pages at any given instance by doing bdi bandwidth estimation.
> >> >>
> >> >> Kernel also has max_ratio/min_ratio tunables to specify percentage of
> >> >> writecache
> >> >> to control per bdi dirty limits and task throtelling.
> >> >>
> >> >> However, there might be a usecase where user wants a writeback tuning
> >> >> parameter to flush dirty data at desired/tuned time interval.
> >> >>
> >> >> dirty_background_time provides an interface where user can tune
> >> >> background
> >> >> writeback start time using /sys/block/sda/bdi/dirty_background_time
> >> >>
> >> >> dirty_background_time is used alongwith average bdi write bandwidth
> >> >> estimation
> >> >> to start background writeback.
> >> >
> >> > Here lies my major concern about dirty_background_time: the write
> >> > bandwidth estimation is an _estimation_ and will sure become wildly
> >> > wrong in some cases. So the dirty_background_time implementation based
> >> > on it will not always work to the user expectations.
> >> >
> >> > One important case is, some users (eg. Dave Chinner) explicitly take
> >> > advantage of the existing behavior to quickly create & delete a big
> >> > 1GB temp file without worrying about triggering unnecessary IOs.
> >> >
> >> Hi. Wu.
> >> Okay, I have a question.
> >>
> >> If making dirty_writeback_interval per bdi to tune short interval
> >> instead of background_time, We can get similar performance
> >> improvement.
> >> /sys/block/<device>/bdi/dirty_writeback_interval
> >> /sys/block/<device>/bdi/dirty_expire_interval
> >>
> >> NFS write performance improvement is just one usecase.
> >>
> >> If we can set interval/time per bdi,  other usecases will be created
> >> by applying.
> >
> > Per-bdi interval/time tunables, if there comes such a need, will in
> > essential be for data caching and safety. If turning them into some
> > requirement for better performance, the users will potential be
> > stretched on choosing the "right" value for balanced data cache,
> > safety and performance.  Hmm, not a comfortable prospection.
> Hi Wu.
> First, Thanks for shared information.
> 
> I change writeback interval on NFS server only.

OK..sorry for missing that part!

> I think that this does not affect data cache/page behaviour(caching)
> change on NFS client. NFS client will start sending write requests as
> per default NFS/writeback logic. So, no change in NFS client data
> caching behaviour.
> 
> Also, on NFS server it does not make change in system-wide caching
> behaviour. It only modifies caching/writeback behaviour of a
> particular “bdi” on NFS server so that NFS client could see better
> WRITE speed.

But would you default to dirty_background_time=0, where the special
value 0 means no change of the original behavior? That will address
David's very reasonable concern. Otherwise quite a few users are going
to be surprised by the new behavior after upgrading kernel.

> I will share several performancetest results as Dave's opinion.
> 
> >
> >> >The numbers are impressive! FYI, I tried another NFS specific approach
> >> >to avoid big NFS COMMITs, which achieved similar performance gains:
> >>
> >> >nfs: writeback pages wait queue
> >> >https://lkml.org/lkml/2011/10/20/235
> This patch looks client side optimization to me.(need to check more)

Yes.

> Do we need the optimization of server side as Bruce's opinion ?

Sure.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] writeback: add dirty_ratio_time per bdi variable (NFS write performance)
  2012-08-21 13:04         ` Fengguang Wu
@ 2012-08-22  1:10           ` Namjae Jeon
  0 siblings, 0 replies; 11+ messages in thread
From: Namjae Jeon @ 2012-08-22  1:10 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: akpm, linux-kernel, Namjae Jeon, linux-fsdevel, linux-nfs, Dave Chinner

2012/8/21, Fengguang Wu <fengguang.wu@intel.com>:
> On Tue, Aug 21, 2012 at 03:00:13PM +0900, Namjae Jeon wrote:
>> 2012/8/20, Fengguang Wu <fengguang.wu@intel.com>:
>> > On Mon, Aug 20, 2012 at 09:48:42AM +0900, Namjae Jeon wrote:
>> >> 2012/8/19, Fengguang Wu <fengguang.wu@intel.com>:
>> >> > On Sat, Aug 18, 2012 at 05:50:02AM -0400, Namjae Jeon wrote:
>> >> >> From: Namjae Jeon <namjae.jeon@samsung.com>
>> >> >>
>> >> >> This patch is based on suggestion by Wu Fengguang:
>> >> >> https://lkml.org/lkml/2011/8/19/19
>> >> >>
>> >> >> kernel has mechanism to do writeback as per dirty_ratio and
>> >> >> dirty_background
>> >> >> ratio. It also maintains per task dirty rate limit to keep balance
>> >> >> of
>> >> >> dirty pages at any given instance by doing bdi bandwidth
>> >> >> estimation.
>> >> >>
>> >> >> Kernel also has max_ratio/min_ratio tunables to specify percentage
>> >> >> of
>> >> >> writecache
>> >> >> to control per bdi dirty limits and task throtelling.
>> >> >>
>> >> >> However, there might be a usecase where user wants a writeback
>> >> >> tuning
>> >> >> parameter to flush dirty data at desired/tuned time interval.
>> >> >>
>> >> >> dirty_background_time provides an interface where user can tune
>> >> >> background
>> >> >> writeback start time using /sys/block/sda/bdi/dirty_background_time
>> >> >>
>> >> >> dirty_background_time is used alongwith average bdi write bandwidth
>> >> >> estimation
>> >> >> to start background writeback.
>> >> >
>> >> > Here lies my major concern about dirty_background_time: the write
>> >> > bandwidth estimation is an _estimation_ and will sure become wildly
>> >> > wrong in some cases. So the dirty_background_time implementation
>> >> > based
>> >> > on it will not always work to the user expectations.
>> >> >
>> >> > One important case is, some users (eg. Dave Chinner) explicitly take
>> >> > advantage of the existing behavior to quickly create & delete a big
>> >> > 1GB temp file without worrying about triggering unnecessary IOs.
>> >> >
>> >> Hi. Wu.
>> >> Okay, I have a question.
>> >>
>> >> If making dirty_writeback_interval per bdi to tune short interval
>> >> instead of background_time, We can get similar performance
>> >> improvement.
>> >> /sys/block/<device>/bdi/dirty_writeback_interval
>> >> /sys/block/<device>/bdi/dirty_expire_interval
>> >>
>> >> NFS write performance improvement is just one usecase.
>> >>
>> >> If we can set interval/time per bdi,  other usecases will be created
>> >> by applying.
>> >
>> > Per-bdi interval/time tunables, if there comes such a need, will in
>> > essential be for data caching and safety. If turning them into some
>> > requirement for better performance, the users will potential be
>> > stretched on choosing the "right" value for balanced data cache,
>> > safety and performance.  Hmm, not a comfortable prospection.
>> Hi Wu.
>> First, Thanks for shared information.
>>
>> I change writeback interval on NFS server only.
>
> OK..sorry for missing that part!
>
>> I think that this does not affect data cache/page behaviour(caching)
>> change on NFS client. NFS client will start sending write requests as
>> per default NFS/writeback logic. So, no change in NFS client data
>> caching behaviour.
>>
>> Also, on NFS server it does not make change in system-wide caching
>> behaviour. It only modifies caching/writeback behaviour of a
>> particular “bdi” on NFS server so that NFS client could see better
>> WRITE speed.
>
> But would you default to dirty_background_time=0, where the special
> value 0 means no change of the original behavior? That will address
> David's very reasonable concern. Otherwise quite a few users are going
> to be surprised by the new behavior after upgrading kernel.
Hi. Wu.
Okay, I will resend v2 patch included your
comment(dirty_background_time=0 at default)
Thanks a lot.

>
>> I will share several performancetest results as Dave's opinion.
>>
>> >
>> >> >The numbers are impressive! FYI, I tried another NFS specific
>> >> > approach
>> >> >to avoid big NFS COMMITs, which achieved similar performance gains:
>> >>
>> >> >nfs: writeback pages wait queue
>> >> >https://lkml.org/lkml/2011/10/20/235
>> This patch looks client side optimization to me.(need to check more)
>
> Yes.
>
>> Do we need the optimization of server side as Bruce's opinion ?
>
> Sure.
>
> Thanks,
> Fengguang
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-08-22  1:10 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-18  9:50 [PATCH 3/3] writeback: add dirty_ratio_time per bdi variable Namjae Jeon
2012-08-19  2:57 ` [PATCH 3/3] writeback: add dirty_ratio_time per bdi variable (NFS write performance) Fengguang Wu
2012-08-20  0:48   ` Namjae Jeon
2012-08-20 14:50     ` Fengguang Wu
2012-08-21  6:00       ` Namjae Jeon
2012-08-21 13:04         ` Fengguang Wu
2012-08-22  1:10           ` Namjae Jeon
2012-08-20  2:00   ` Dave Chinner
2012-08-20 18:01     ` J. Bruce Fields
2012-08-21  5:48       ` Namjae Jeon
2012-08-21 12:57         ` Fengguang Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).