[PATCH v4 0/3] btrfs: read_policy types latency, device and round-robin

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/3] btrfs: read_policy types latency, device and round-robin
@ 2021-01-20  7:52 Anand Jain
  2021-01-20  7:52 ` [PATCH v4 1/3] btrfs: add read_policy latency Anand Jain
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Anand Jain @ 2021-01-20  7:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, josef

v4:
Add rb from Josef in patch 1 and 3.
In patch 1/3, use fs_info instead of device->fs_devices->fs_info.
Drop round-robin policy because my workload (fio random) shows no performance
 gains due to fewer merges at the block layer.

v3:
The block layer commit 0d02129e76ed (block: merge struct block_device and
struct hd_struct) has changed the first argument in the function
part_stat_read_all() in 5.11-rc1. So trickle down its changes in the patch 1/4.

v2:
Fixes as per review comments, as in the individual patches.

rfc->v1:
Drop the tracing patch.
Drop the factor associated with the inflight commands (because there
were too many unnecessary switches).
Few C styles fix.

-----

This patchset adds read policy types latency, device, and round-robin, for the
mirrored raid profiles such as raid1, raid1c3, raid1c4, and raid10. The default
read policy remains as PID, as of now.

Read policy types:
Latency:

Latency policy routes the read IO based on the historical average
wait time experienced by the read IOs on the individual device.

Device:

With the device policy along with the read_preferred flag, you can
set the device for reading manually. Useful to test mirrors in a
deterministic way and helps advance system administrations.

Round-robin (RFC patch):

Alternates striped device in a round-robin loop for reading. To achieve
this first we put the stripes in an array, sort it by devid and pick the
next device.

Test scripts:
=============

I have included a few scripts which were useful for testing.

-------------------8<--------------------------------
Set latency policy on the btrfs mounted at /mnt

Usage example:
  $ readpolicyset /mnt latency

Anand Jain (3):
  btrfs: add read_policy latency
  btrfs: introduce new device-state read_preferred
  btrfs: introduce new read_policy device

 fs/btrfs/sysfs.c   | 57 ++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  5 ++++
 3 files changed, 121 insertions(+), 1 deletion(-)

-- 
2.28.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v4 1/3] btrfs: add read_policy latency
  2021-01-20  7:52 [PATCH v4 0/3] btrfs: read_policy types latency, device and round-robin Anand Jain
@ 2021-01-20  7:52 ` Anand Jain
  2021-01-20 12:14   ` David Sterba
  2021-01-20  7:52 ` [PATCH v4 2/3] btrfs: introduce new device-state read_preferred Anand Jain
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2021-01-20  7:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, josef

The read policy type latency routes the read IO based on the historical
average wait-time experienced by the read IOs through the individual
device. This patch obtains the historical read IO stats from the kernel
block layer and calculates its average.

Example usage:
 echo "latency" > /sys/fs/btrfs/$uuid/read_policy

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
v4: For btrfs_debug_rl() use fs_info instead of
    device->fs_devices->fs_info.

v3: The block layer commit 0d02129e76ed (block: merge struct block_device and
    struct hd_struct) has changed the first argument in the function
    part_stat_read_all() in 5.11-rc1. So the compilation will fail. This patch
    fixes it.
    Commit log updated.

v2: Use btrfs_debug_rl() instead of btrfs_info_rl()
    It is better we have this debug until we test this on at least few
    hardwares.
    Drop the unrelated changes.
    Update change log.

rfc->v1: Drop part_stat_read_all instead use part_stat_read
    Drop inflight

 fs/btrfs/sysfs.c   |  3 ++-
 fs/btrfs/volumes.c | 38 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  2 ++
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 4522a1c4cd08..7c0324fe97b2 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -915,7 +915,8 @@ static bool strmatch(const char *buffer, const char *string)
 	return false;
 }
 
-static const char * const btrfs_read_policy_name[] = { "pid" };
+/* Must follow the order as in enum btrfs_read_policy */
+static const char * const btrfs_read_policy_name[] = { "pid", "latency" };
 
 static ssize_t btrfs_read_policy_show(struct kobject *kobj,
 				      struct kobj_attribute *a, char *buf)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 62d6a890fc50..f361f1c87eb6 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -14,6 +14,7 @@
 #include <linux/semaphore.h>
 #include <linux/uuid.h>
 #include <linux/list_sort.h>
+#include <linux/part_stat.h>
 #include "misc.h"
 #include "ctree.h"
 #include "extent_map.h"
@@ -5490,6 +5491,39 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 	return ret;
 }
 
+static int btrfs_find_best_stripe(struct btrfs_fs_info *fs_info,
+				  struct map_lookup *map, int first,
+				  int num_stripe)
+{
+	u64 est_wait = 0;
+	int best_stripe = 0;
+	int index;
+
+	for (index = first; index < first + num_stripe; index++) {
+		u64 read_wait;
+		u64 avg_wait = 0;
+		unsigned long read_ios;
+		struct btrfs_device *device = map->stripes[index].dev;
+
+		read_wait = part_stat_read(device->bdev, nsecs[READ]);
+		read_ios = part_stat_read(device->bdev, ios[READ]);
+
+		if (read_wait && read_ios && read_wait >= read_ios)
+			avg_wait = div_u64(read_wait, read_ios);
+		else
+			btrfs_debug_rl(fs_info,
+			"devid: %llu avg_wait ZERO read_wait %llu read_ios %lu",
+				       device->devid, read_wait, read_ios);
+
+		if (est_wait == 0 || est_wait > avg_wait) {
+			est_wait = avg_wait;
+			best_stripe = index;
+		}
+	}
+
+	return best_stripe;
+}
+
 static int find_live_mirror(struct btrfs_fs_info *fs_info,
 			    struct map_lookup *map, int first,
 			    int dev_replace_is_ongoing)
@@ -5519,6 +5553,10 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 	case BTRFS_READ_POLICY_PID:
 		preferred_mirror = first + (current->pid % num_stripes);
 		break;
+	case BTRFS_READ_POLICY_LATENCY:
+		preferred_mirror = btrfs_find_best_stripe(fs_info, map, first,
+							  num_stripes);
+		break;
 	}
 
 	if (dev_replace_is_ongoing &&
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 1997a4649a66..71ba1f0e93f4 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -222,6 +222,8 @@ enum btrfs_chunk_allocation_policy {
 enum btrfs_read_policy {
 	/* Use process PID to choose the stripe */
 	BTRFS_READ_POLICY_PID,
+	/* Find and use device with the lowest latency */
+	BTRFS_READ_POLICY_LATENCY,
 	BTRFS_NR_READ_POLICY,
 };
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/3] btrfs: add read_policy latency
  2021-01-20  7:52 ` [PATCH v4 1/3] btrfs: add read_policy latency Anand Jain
@ 2021-01-20 12:14   ` David Sterba
  2021-01-21 10:10     ` Anand Jain
  0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2021-01-20 12:14 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs, dsterba, josef

On Tue, Jan 19, 2021 at 11:52:05PM -0800, Anand Jain wrote:
> The read policy type latency routes the read IO based on the historical
> average wait-time experienced by the read IOs through the individual
> device. This patch obtains the historical read IO stats from the kernel
> block layer and calculates its average.

This does not say how the stripe is selected using the gathered numbers.
Ie. what is the criteria like minimum average time, "based on" is too
vague.

> Example usage:
>  echo "latency" > /sys/fs/btrfs/$uuid/read_policy

Do you have some sample results? I remember you posted something but it
would be good to have that in the changelog too.

> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> ---
> v4: For btrfs_debug_rl() use fs_info instead of
>     device->fs_devices->fs_info.
> 
> v3: The block layer commit 0d02129e76ed (block: merge struct block_device and
>     struct hd_struct) has changed the first argument in the function
>     part_stat_read_all() in 5.11-rc1. So the compilation will fail. This patch
>     fixes it.
>     Commit log updated.
> 
> v2: Use btrfs_debug_rl() instead of btrfs_info_rl()
>     It is better we have this debug until we test this on at least few
>     hardwares.
>     Drop the unrelated changes.
>     Update change log.
> 
> rfc->v1: Drop part_stat_read_all instead use part_stat_read
>     Drop inflight
> 
>  fs/btrfs/sysfs.c   |  3 ++-
>  fs/btrfs/volumes.c | 38 ++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/volumes.h |  2 ++
>  3 files changed, 42 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 4522a1c4cd08..7c0324fe97b2 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -915,7 +915,8 @@ static bool strmatch(const char *buffer, const char *string)
>  	return false;
>  }
>  
> -static const char * const btrfs_read_policy_name[] = { "pid" };
> +/* Must follow the order as in enum btrfs_read_policy */
> +static const char * const btrfs_read_policy_name[] = { "pid", "latency" };
>  
>  static ssize_t btrfs_read_policy_show(struct kobject *kobj,
>  				      struct kobj_attribute *a, char *buf)
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 62d6a890fc50..f361f1c87eb6 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -14,6 +14,7 @@
>  #include <linux/semaphore.h>
>  #include <linux/uuid.h>
>  #include <linux/list_sort.h>
> +#include <linux/part_stat.h>
>  #include "misc.h"
>  #include "ctree.h"
>  #include "extent_map.h"
> @@ -5490,6 +5491,39 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
>  	return ret;
>  }
>  
> +static int btrfs_find_best_stripe(struct btrfs_fs_info *fs_info,

The name btrfs_find_best_stripe should be more descriptive about the
selection criteria.

> +				  struct map_lookup *map, int first,
> +				  int num_stripe)
> +{
> +	u64 est_wait = 0;
> +	int best_stripe = 0;
> +	int index;
> +
> +	for (index = first; index < first + num_stripe; index++) {
> +		u64 read_wait;
> +		u64 avg_wait = 0;
> +		unsigned long read_ios;
> +		struct btrfs_device *device = map->stripes[index].dev;
> +
> +		read_wait = part_stat_read(device->bdev, nsecs[READ]);

This should use STAT_READ as this is supposed to be indexing the stats
members. READ is some generic constant with the same value.

> +		read_ios = part_stat_read(device->bdev, ios[READ]);
> +
> +		if (read_wait && read_ios && read_wait >= read_ios)
> +			avg_wait = div_u64(read_wait, read_ios);
> +		else
> +			btrfs_debug_rl(fs_info,
> +			"devid: %llu avg_wait ZERO read_wait %llu read_ios %lu",
> +				       device->devid, read_wait, read_ios);
> +
> +		if (est_wait == 0 || est_wait > avg_wait) {
> +			est_wait = avg_wait;
> +			best_stripe = index;
> +		}
> +	}
> +
> +	return best_stripe;
> +}
> +
>  static int find_live_mirror(struct btrfs_fs_info *fs_info,
>  			    struct map_lookup *map, int first,
>  			    int dev_replace_is_ongoing)
> @@ -5519,6 +5553,10 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
>  	case BTRFS_READ_POLICY_PID:
>  		preferred_mirror = first + (current->pid % num_stripes);
>  		break;
> +	case BTRFS_READ_POLICY_LATENCY:
> +		preferred_mirror = btrfs_find_best_stripe(fs_info, map, first,
> +							  num_stripes);
> +		break;
>  	}
>  
>  	if (dev_replace_is_ongoing &&
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 1997a4649a66..71ba1f0e93f4 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -222,6 +222,8 @@ enum btrfs_chunk_allocation_policy {
>  enum btrfs_read_policy {
>  	/* Use process PID to choose the stripe */
>  	BTRFS_READ_POLICY_PID,
> +	/* Find and use device with the lowest latency */
> +	BTRFS_READ_POLICY_LATENCY,
>  	BTRFS_NR_READ_POLICY,
>  };
>  
> -- 
> 2.28.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/3] btrfs: add read_policy latency
  2021-01-20 12:14   ` David Sterba
@ 2021-01-21 10:10     ` Anand Jain
  2021-01-21 17:52       ` David Sterba
  0 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2021-01-21 10:10 UTC (permalink / raw)
  To: dsterba, linux-btrfs, dsterba, josef



On 20/1/21 8:14 pm, David Sterba wrote:
> On Tue, Jan 19, 2021 at 11:52:05PM -0800, Anand Jain wrote:
>> The read policy type latency routes the read IO based on the historical
>> average wait-time experienced by the read IOs through the individual
>> device. This patch obtains the historical read IO stats from the kernel
>> block layer and calculates its average.
> 
> This does not say how the stripe is selected using the gathered numbers.
> Ie. what is the criteria like minimum average time, "based on" is too
> vague.
> 


Could you please add the following in the change log. Hope this will 
suffice.

----------
This patch adds new read policy Latency. This policy routes the read
I/Os based on the device's average wait time for read requests.
The average is calculated by dividing the total wait time for read
requests by the total read I/Os processed by the device.
This policy uses kernel disk stat to calculate the average, so it needs
the kernel stat to be enabled. If in case the kernel stat is disabled
the policy uses the stripe 0.
This policy can be set through the read_policy sysfs interface as shown
below.

     $ echo latency > /sys/fs/btrfs/<uuid>/read_policy
     $ cat /sys/fs/btrfs/<uuid>/read_policy
          pid [latency] device roundrobin

This policy won't persist across reboot or mount unmount recycle as of
now.

Here below are few performance test results with latency compared with 
pid policy.

raid1 fio read 500m
-----------------------------------------------------
dev types   | nvme+ssd  nvme+ssd   all-nvme  all-nvme
read type   | random    sequential random    sequential
------------+------------------------------------------
pid         | 744MiB/s  809MiB/s  2225MiB/s 2155MiB/s
latency     | 2072MiB/s 2008MiB/s  1999MiB/s 1961MiB/s


raid10 fio read 500m
-----------------------------------------------------
dev types   | nvme+ssd  nvme+ssd   all-nvme  all-nvme
read type   | random    sequential random    sequential
------------+------------------------------------------
pid         | 1282MiB/s 1427MiB/s 2152MiB/s 1969MiB/s
latency     | 2073MiB/s 1871MiB/s 1975MiB/s 1984MiB/s


raid1c3 fio read 500m
-----------------------------------------------------
dev types   | nvme+ssd  nvme+ssd   all-nvme  all-nvme
read type   | random    sequential random    sequential
------------+------------------------------------------
pid         |  973MiB/s  955MiB/s 2144MiB/s 1962MiB/s
latency     | 2005MiB/s 1924MiB/s 2083MiB/s 1980MiB/s


raid1c4 fio read 500m
-----------------------------------------------------
dev types   | nvme+ssd  nvme+ssd   all-nvme  all-nvme
read type   | random    sequential random    sequential
------------+------------------------------------------
pid         | 1204MiB/s 1221MiB/s 2065MiB/s 1878MiB/s
latency     | 1990MiB/s 1920MiB/s 1945MiB/s 1865MiB/s


In the given fio I/O workload above, it is found that there are fewer 
I/O merges in case of latency as compared to pid. So in the case of all 
homogeneous devices pid performance little better.
The latency is a better choice in the case of mixed types of devices. 
Also if any one of the devices is under performing due to intermittent 
I/Os retries, then the latency policy will automatically use the best 
available.
-----------




>> Example usage:
>>   echo "latency" > /sys/fs/btrfs/$uuid/read_policy
> 
> Do you have some sample results? I remember you posted something but it
> would be good to have that in the changelog too.

Thanks for suggesting now I have included it, as above.
Also, I can generate a patch reroll with this change log if needed.
Please, let me know.

Thanks, Anand


> 
>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>> v4: For btrfs_debug_rl() use fs_info instead of
>>      device->fs_devices->fs_info.
>>
>> v3: The block layer commit 0d02129e76ed (block: merge struct block_device and
>>      struct hd_struct) has changed the first argument in the function
>>      part_stat_read_all() in 5.11-rc1. So the compilation will fail. This patch
>>      fixes it.
>>      Commit log updated.
>>
>> v2: Use btrfs_debug_rl() instead of btrfs_info_rl()
>>      It is better we have this debug until we test this on at least few
>>      hardwares.
>>      Drop the unrelated changes.
>>      Update change log.
>>
>> rfc->v1: Drop part_stat_read_all instead use part_stat_read
>>      Drop inflight
>>
>>   fs/btrfs/sysfs.c   |  3 ++-
>>   fs/btrfs/volumes.c | 38 ++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/volumes.h |  2 ++
>>   3 files changed, 42 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
>> index 4522a1c4cd08..7c0324fe97b2 100644
>> --- a/fs/btrfs/sysfs.c
>> +++ b/fs/btrfs/sysfs.c
>> @@ -915,7 +915,8 @@ static bool strmatch(const char *buffer, const char *string)
>>   	return false;
>>   }
>>   
>> -static const char * const btrfs_read_policy_name[] = { "pid" };
>> +/* Must follow the order as in enum btrfs_read_policy */
>> +static const char * const btrfs_read_policy_name[] = { "pid", "latency" };
>>   
>>   static ssize_t btrfs_read_policy_show(struct kobject *kobj,
>>   				      struct kobj_attribute *a, char *buf)
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 62d6a890fc50..f361f1c87eb6 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -14,6 +14,7 @@
>>   #include <linux/semaphore.h>
>>   #include <linux/uuid.h>
>>   #include <linux/list_sort.h>
>> +#include <linux/part_stat.h>
>>   #include "misc.h"
>>   #include "ctree.h"
>>   #include "extent_map.h"
>> @@ -5490,6 +5491,39 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
>>   	return ret;
>>   }
>>   
>> +static int btrfs_find_best_stripe(struct btrfs_fs_info *fs_info,
> 
> The name btrfs_find_best_stripe should be more descriptive about the
> selection criteria.
> 
>> +				  struct map_lookup *map, int first,
>> +				  int num_stripe)
>> +{
>> +	u64 est_wait = 0;
>> +	int best_stripe = 0;
>> +	int index;
>> +
>> +	for (index = first; index < first + num_stripe; index++) {
>> +		u64 read_wait;
>> +		u64 avg_wait = 0;
>> +		unsigned long read_ios;
>> +		struct btrfs_device *device = map->stripes[index].dev;
>> +
>> +		read_wait = part_stat_read(device->bdev, nsecs[READ]);
> 
> This should use STAT_READ as this is supposed to be indexing the stats
> members. READ is some generic constant with the same value.
> 
>> +		read_ios = part_stat_read(device->bdev, ios[READ]);
>> +
>> +		if (read_wait && read_ios && read_wait >= read_ios)
>> +			avg_wait = div_u64(read_wait, read_ios);
>> +		else
>> +			btrfs_debug_rl(fs_info,
>> +			"devid: %llu avg_wait ZERO read_wait %llu read_ios %lu",
>> +				       device->devid, read_wait, read_ios);
>> +
>> +		if (est_wait == 0 || est_wait > avg_wait) {
>> +			est_wait = avg_wait;
>> +			best_stripe = index;
>> +		}
>> +	}
>> +
>> +	return best_stripe;
>> +}
>> +
>>   static int find_live_mirror(struct btrfs_fs_info *fs_info,
>>   			    struct map_lookup *map, int first,
>>   			    int dev_replace_is_ongoing)
>> @@ -5519,6 +5553,10 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
>>   	case BTRFS_READ_POLICY_PID:
>>   		preferred_mirror = first + (current->pid % num_stripes);
>>   		break;
>> +	case BTRFS_READ_POLICY_LATENCY:
>> +		preferred_mirror = btrfs_find_best_stripe(fs_info, map, first,
>> +							  num_stripes);
>> +		break;
>>   	}
>>   
>>   	if (dev_replace_is_ongoing &&
>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>> index 1997a4649a66..71ba1f0e93f4 100644
>> --- a/fs/btrfs/volumes.h
>> +++ b/fs/btrfs/volumes.h
>> @@ -222,6 +222,8 @@ enum btrfs_chunk_allocation_policy {
>>   enum btrfs_read_policy {
>>   	/* Use process PID to choose the stripe */
>>   	BTRFS_READ_POLICY_PID,
>> +	/* Find and use device with the lowest latency */
>> +	BTRFS_READ_POLICY_LATENCY,
>>   	BTRFS_NR_READ_POLICY,
>>   };
>>   
>> -- 
>> 2.28.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/3] btrfs: add read_policy latency
  2021-01-21 10:10     ` Anand Jain
@ 2021-01-21 17:52       ` David Sterba
  2021-01-22  8:10         ` Anand Jain
  0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2021-01-21 17:52 UTC (permalink / raw)
  To: Anand Jain; +Cc: dsterba, linux-btrfs, dsterba, josef

On Thu, Jan 21, 2021 at 06:10:36PM +0800, Anand Jain wrote:
> 
> 
> On 20/1/21 8:14 pm, David Sterba wrote:
> > On Tue, Jan 19, 2021 at 11:52:05PM -0800, Anand Jain wrote:
> >> The read policy type latency routes the read IO based on the historical
> >> average wait-time experienced by the read IOs through the individual
> >> device. This patch obtains the historical read IO stats from the kernel
> >> block layer and calculates its average.
> > 
> > This does not say how the stripe is selected using the gathered numbers.
> > Ie. what is the criteria like minimum average time, "based on" is too
> > vague.
> > 
> 
> 
> Could you please add the following in the change log. Hope this will 
> suffice.
> 
> ----------
> This patch adds new read policy Latency. This policy routes the read
> I/Os based on the device's average wait time for read requests.

'wait time' means the time from io submission to completion

> The average is calculated by dividing the total wait time for read
> requests by the total read I/Os processed by the device.

So this is based on numbers from the entire lifetime of the device?  The
numbers are IMHO not a reliable source. If unrelated writes increase the
read wait time then the device will not be selected until the average
is lower than of the other devices.

The average can only decrease after there are some fast reads, which is
not guaranted to happen and there's no good estimate how long it could
take to happen.

The tests we all probably do are on a fresh mkfs and with a small
workload but the mirror selection logic must work long term.

The part_stat numbers could be used but must reflect the time factor,
ie. it needs to be some a rolling average or collecting a sample for
last N seconds.

Bear in mind that this is only a heuristic and we don't need perfect
results nor we want to replace io scheduling, so the amont of collected
data or the logic should be straightforward.

> This policy uses kernel disk stat to calculate the average, so it needs
> the kernel stat to be enabled.

What is needed to enable it? I see it's always compiled in in
block/blk-core.c.

> If in case the kernel stat is disabled
> the policy uses the stripe 0.
> This policy can be set through the read_policy sysfs interface as shown
> below.
> 
>      $ echo latency > /sys/fs/btrfs/<uuid>/read_policy
>      $ cat /sys/fs/btrfs/<uuid>/read_policy
>           pid [latency] device roundrobin
> 
> This policy won't persist across reboot or mount unmount recycle as of
> now.
> 
> Here below are few performance test results with latency compared with 
> pid policy.
> 
> raid1 fio read 500m

500m is really small data size for such measurement

> -----------------------------------------------------
> dev types   | nvme+ssd  nvme+ssd   all-nvme  all-nvme
> read type   | random    sequential random    sequential
> ------------+------------------------------------------
> pid         | 744MiB/s  809MiB/s  2225MiB/s 2155MiB/s
> latency     | 2072MiB/s 2008MiB/s  1999MiB/s 1961MiB/s

Namely when the device bandwidth is 4x higher. The data size should be
scaled up so the whole run takes at least 30 seconds if not a few
minutes.

Other missing information about the load is the number of threads and if
it's buffered or direct io.

> raid10 fio read 500m
> -----------------------------------------------------
> dev types   | nvme+ssd  nvme+ssd   all-nvme  all-nvme
> read type   | random    sequential random    sequential
> ------------+------------------------------------------
> pid         | 1282MiB/s 1427MiB/s 2152MiB/s 1969MiB/s
> latency     | 2073MiB/s 1871MiB/s 1975MiB/s 1984MiB/s
> 
> 
> raid1c3 fio read 500m
> -----------------------------------------------------
> dev types   | nvme+ssd  nvme+ssd   all-nvme  all-nvme
> read type   | random    sequential random    sequential
> ------------+------------------------------------------
> pid         |  973MiB/s  955MiB/s 2144MiB/s 1962MiB/s
> latency     | 2005MiB/s 1924MiB/s 2083MiB/s 1980MiB/s
> 
> 
> raid1c4 fio read 500m
> -----------------------------------------------------
> dev types   | nvme+ssd  nvme+ssd   all-nvme  all-nvme
> read type   | random    sequential random    sequential
> ------------+------------------------------------------
> pid         | 1204MiB/s 1221MiB/s 2065MiB/s 1878MiB/s
> latency     | 1990MiB/s 1920MiB/s 1945MiB/s 1865MiB/s
> 
> 
> In the given fio I/O workload above, it is found that there are fewer 
> I/O merges in case of latency as compared to pid. So in the case of all 
> homogeneous devices pid performance little better.

Yeah switching the device in the middle of a contiguous range could slow
it down but as long as it's not "too much", then it's ok.

The pid selection is good for multiple threads workload but we also want
to make it work with single thread reads, like a simple 'cp'.

I tested this policy and with 2G file 'cat file' utilizes only one
device, so this is no improvement to the pid policy.

A policy based on read latency makes sense but the current
implementation does not cover enough workloads.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/3] btrfs: add read_policy latency
  2021-01-21 17:52       ` David Sterba
@ 2021-01-22  8:10         ` Anand Jain
  2021-01-30  1:08           ` Anand Jain
  0 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2021-01-22  8:10 UTC (permalink / raw)
  To: dsterba, linux-btrfs, dsterba, josef; +Cc: anand.jain

On 22/1/21 1:52 am, David Sterba wrote:
> On Thu, Jan 21, 2021 at 06:10:36PM +0800, Anand Jain wrote:
>>
>>
>> On 20/1/21 8:14 pm, David Sterba wrote:
>>> On Tue, Jan 19, 2021 at 11:52:05PM -0800, Anand Jain wrote:
>>>> The read policy type latency routes the read IO based on the historical
>>>> average wait-time experienced by the read IOs through the individual
>>>> device. This patch obtains the historical read IO stats from the kernel
>>>> block layer and calculates its average.
>>>
>>> This does not say how the stripe is selected using the gathered numbers.
>>> Ie. what is the criteria like minimum average time, "based on" is too
>>> vague.
>>>
>>
>>
>> Could you please add the following in the change log. Hope this will
>> suffice.
>>
>> ----------
>> This patch adds new read policy Latency. This policy routes the read
>> I/Os based on the device's average wait time for read requests.
> 
> 'wait time' means the time from io submission to completion
> 
  Yes, at the block layer.

>> The average is calculated by dividing the total wait time for read
>> requests by the total read I/Os processed by the device.
> 
> So this is based on numbers from the entire lifetime of the device?

  No,  Kernel stats are in memory only, so it is since boot.

>  The
> numbers are IMHO not a reliable source. If unrelated writes increase the
> read wait time then the device will not be selected until the average
> is lower than of the other devices.

  I think it is fair. Because comparison is between the performance of

  1. disk-type-A   VS    disk-type-A

  OR

  2. disk-type-A    VS   disk-type-B

  In the scenario #1 above, it does not matter which disk, as both of
  them provides the same performance (theoretically), which is the most
  common config.

  In scenario 2# the user can check the read I/O on the devices, if it
  is _not_ going to the best performing device by theory, either a reboot
  or iostat-reset (which I think should be there) shall help.
  Or if they can't reboot or if iostat-reset is not available, then
  switching to the read-policy 'device' shall help until they
  reboot, which is a better alternative than PID, which is unpredictable.
  Unfortunately, this switching is not automatic (more below).

  There are drawbacks to this policy.
  At any point in time, momentarily, a device may get too busy due to 
_external factors_ such as - multiple partitions on
  the same device, multiple LUNs on the same HBA, OR if the IRQ is shared
  by the disk's HBA and the gigabit network card (which has better IRQ
  priority) so whenever the network is busy, the I/O on the disk slows
  down (I had an opportunity to investigate such an issue before).
  So now the latency policy shall switch to the better performing device
  at such a time. But if the theoretically better performing device is
  back to its normal speed, yes, unless the device gets the read I/O by
  some operation (for example, scrub), the policy won't know. This
  scenario is more crucial for the config type #2 (above).

  Also, there may be a better alternative to the average wait time (for
  example, another type of mean-values?) Which I think can be tweaked in
  the long term when we understand the usage of this policy better. If we
  account for the inflight commands, there will be more switching in
  config type #1 (above). More switching leads to fewer I/O mergers and
  higher cache misses (DMA and storage level) leading to poorer
  performance. So switching back and forth between devices is not good as
  well. So stay where they are helps until it performs worst than its
  mirrored device.

> The average can only decrease after there are some fast reads, which is
> not guaranted to happen and there's no good estimate how long it could
> take to happen.

  True. Also, there isn't any kernel part-stat reset. Not sure if the
  block layer will entertain such a patch, but worth a try IMO. What
  What do you think?

  However, even if I reset, it's not guaranteed that temporary bad stats
  can not happen again. Also it's a bit uncertain how to know when will
  the theoretically better performing device will be back to its good
  performance.

> The tests we all probably do are on a fresh mkfs and with a small
> workload but the mirror selection logic must work long term.
> 

  I totally agree. So I am not yet recommending this policy for the
  default. But ut does solve some of the problems very well.

> The part_stat numbers could be used but must reflect the time factor,
> ie. it needs to be some a rolling average or collecting a sample for
> last N seconds.

  But, I think the problem here is to know when will the
  theoretically better performing device will be back to its good
  performance. So for that purpose, the theoretically better performing
  device must be probed periodically. And there will be cost.

> 
> Bear in mind that this is only a heuristic and we don't need perfect
> results nor we want to replace io scheduling, so the amont of collected
> data or the logic should be straightforward.
> 
  Yeah. If part_stat can provide stat only for past N-mins or so, it will
  be simpler. During this patch, I looked into the part_stat code it is
  not straightforward.

>> This policy uses kernel disk stat to calculate the average, so it needs
>> the kernel stat to be enabled.
> 
> What is needed to enable it? I see it's always compiled in in
> block/blk-core.c.
> 

  It is enabled by default. But the user may disable part_stat
  collection at the run time.

    echo 0 > /sys/block/sdx/queue/iostat

>> If in case the kernel stat is disabled
>> the policy uses the stripe 0.
>> This policy can be set through the read_policy sysfs interface as shown
>> below.
>>
>>       $ echo latency > /sys/fs/btrfs/<uuid>/read_policy
>>       $ cat /sys/fs/btrfs/<uuid>/read_policy
>>            pid [latency] device roundrobin
>>
>> This policy won't persist across reboot or mount unmount recycle as of
>> now.
>>
>> Here below are few performance test results with latency compared with
>> pid policy.
>>
>> raid1 fio read 500m
> 
> 500m is really small data size for such measurement
> 

  Pls see below about this.

>> -----------------------------------------------------
>> dev types   | nvme+ssd  nvme+ssd   all-nvme  all-nvme
>> read type   | random    sequential random    sequential
>> ------------+------------------------------------------
>> pid         | 744MiB/s  809MiB/s  2225MiB/s 2155MiB/s
>> latency     | 2072MiB/s 2008MiB/s  1999MiB/s 1961MiB/s
> 
> Namely when the device bandwidth is 4x higher. The data size should be
> scaled up so the whole run takes at least 30 seconds if not a few
> minutes.
> > Other missing information about the load is the number of threads and if
> it's buffered or direct io.
> 

  The cover letter has the fio command used. The output from the guest VM
  is there. From it, I notice the I/Os performed were ~16.8G. I can run
  the scripts again. Pls, do share with me if you have any ideas for
  testing.

  READ: bw=87.0MiB/s (91.2MB/s), 87.0MiB/s-87.0MiB/s 
(91.2MB/s-91.2MB/s), io=15.6GiB (16.8GB), run=183884-183884msec

>> raid10 fio read 500m
>> -----------------------------------------------------
>> dev types   | nvme+ssd  nvme+ssd   all-nvme  all-nvme
>> read type   | random    sequential random    sequential
>> ------------+------------------------------------------
>> pid         | 1282MiB/s 1427MiB/s 2152MiB/s 1969MiB/s
>> latency     | 2073MiB/s 1871MiB/s 1975MiB/s 1984MiB/s
>>
>>
>> raid1c3 fio read 500m
>> -----------------------------------------------------
>> dev types   | nvme+ssd  nvme+ssd   all-nvme  all-nvme
>> read type   | random    sequential random    sequential
>> ------------+------------------------------------------
>> pid         |  973MiB/s  955MiB/s 2144MiB/s 1962MiB/s
>> latency     | 2005MiB/s 1924MiB/s 2083MiB/s 1980MiB/s
>>
>>
>> raid1c4 fio read 500m
>> -----------------------------------------------------
>> dev types   | nvme+ssd  nvme+ssd   all-nvme  all-nvme
>> read type   | random    sequential random    sequential
>> ------------+------------------------------------------
>> pid         | 1204MiB/s 1221MiB/s 2065MiB/s 1878MiB/s
>> latency     | 1990MiB/s 1920MiB/s 1945MiB/s 1865MiB/s
>>
>>
>> In the given fio I/O workload above, it is found that there are fewer
>> I/O merges in case of latency as compared to pid. So in the case of all
>> homogeneous devices pid performance little better.
> 
> Yeah switching the device in the middle of a contiguous range could slow
> it down but as long as it's not "too much", then it's ok.
> 

  Yep.

> The pid selection is good for multiple threads workload but we also want
> to make it work with single thread reads, like a simple 'cp'.
> 
> I tested this policy and with 2G file 'cat file' utilizes only one
> device, so this is no improvement to the pid policy.
> 
  In the 'cat file' test case above, all the read IOs will go to a
  single stripe id. But, it does not mean that it will go to the same
  device. As of now, our chunk allocation is based on the device's free
  size. So the better thing to do is to have raid 1 on disks of
  different sizes like, for example, 50G and 100G. Then it guarantees
  that stripe 0 will be always on the 100G disk. Then it is fair to
  measure the pid policy.

  And still, pid policy may perform better, as reading from a single disk
  is not a bad idea. The read_policy type 'device' proved it.

  All the policy depends on the workload, so is pid policy. But on top
  of it the pid policy is non-deterministic which makes it hard to say
  how it shall be in a known workload.

> A policy based on read latency makes sense but the current
> implementation does not cover enough workloads.
> 

  Yeah. The performances of any policy here (including PID and round-
  robin) are workload-dependent. IMHO it can't be like one-size-fits
  and meant to be tuned.

Thanks, Anand

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/3] btrfs: add read_policy latency
  2021-01-22  8:10         ` Anand Jain
@ 2021-01-30  1:08           ` Anand Jain
  2021-02-04 12:30             ` Anand Jain
  0 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2021-01-30  1:08 UTC (permalink / raw)
  To: dsterba, linux-btrfs, dsterba, josef

>> 500m is really small data size for such measurement

I reran the read policy tests with some changes in the fio command
options. Mainly to measure IOPS throughput and latency on the filesystem
with latency-policy and pid-policy.

Each of these tests was run for 3 iterations and the best and worst of
those 3 iterations are shown below.

These workloads are performing read-write which is the most commonly
used workload, on a single type of device (which is nvme here) and two
devices are configured for RAID1.

In all these read-write workloads, pid-policy performed ~25% better than
the latency-policy for both throughput and IOPS, and 3% better on the
latency parameter.

I haven't analyzed these read-write workloads on RAID1c3/RAID1c4 yet,
but RAID1 is more common than other types, IMO.

So I think pid-policy should remain as our default read policy.

However as shown before, pid-policy perform worst in the case of special
configs such as volumes with mixed types of devices. For those special
mixed types of devices, latency-policy performs better than pid-policy.
As tested before typically latency-policy provided ~175% better
throughput performance in the case of mixed types of devices (SSD and
nvme).

Feedbacks welcome.

Fio logs below.

IOPS focused readwrite workload:
fio --filename=/btrfs/foo --size=500GB --direct=1 --rw=randrw --bs=4k 
--ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based 
--group_reporting --name=iops-randomreadwrite --eta-newline=1

pid [latency] device roundrobin ( 00)
   read: IOPS=40.6k, BW=159MiB/s (166MB/s)(18.6GiB/120002msec)

[pid] latency device roundrobin ( 00)
   read: IOPS=50.7k, BW=198MiB/s (208MB/s)(23.2GiB/120001msec)

IOPS is 25% better with pid policy.

Throughput focused readwrite workload:
fio --filename=/btrfs/foo --size=500GB --direct=1 --rw=randrw --bs=64k 
--ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 --time_based 
--group_reporting --name=throughput-randomreadwrite --eta-newline=1

pid [latency] device roundrobin ( 00)
   read: IOPS=8525, BW=533MiB/s (559MB/s)(62.4GiB/120003msec)

[pid] latency device roundrobin ( 00)
   read: IOPS=10.7k, BW=670MiB/s (702MB/s)(78.5GiB/120005msec)

Throughput is 25% better with pid policy

Latency focused readwrite workload:
fio --filename=/btrfs/foo --size=500GB --direct=1 --rw=randrw --bs=4k 
--ioengine=libaio --iodepth=1 --runtime=120 --numjobs=4 --time_based 
--group_reporting --name=latency-randomreadwrite --eta-newline=1
pid [latency] device roundrobin ( 00)
   read: IOPS=59.8k, BW=234MiB/s (245MB/s)(27.4GiB/120003msec)
      lat (usec): min=68, max=826930, avg=1917.20, stdev=4210.90

[pid] latency device roundrobin ( 00)
   read: IOPS=61.9k, BW=242MiB/s (253MB/s)(28.3GiB/120001msec)
      lat (usec): min=64, max=751557, avg=1846.07, stdev=4082.97

Latency is 3% better with pid policy.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/3] btrfs: add read_policy latency
  2021-01-30  1:08           ` Anand Jain
@ 2021-02-04 12:30             ` Anand Jain
  2021-02-09 21:12               ` Michal Rostecki
  0 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2021-02-04 12:30 UTC (permalink / raw)
  To: dsterba, linux-btrfs, dsterba, josef, mrostecki


Hi Michal,

  Did you get any chance to run the evaluation with this patchset?

Thanks, Anand

On 1/30/2021 9:08 AM, Anand Jain wrote:
> 
>>> 500m is really small data size for such measurement
> 
> I reran the read policy tests with some changes in the fio command
> options. Mainly to measure IOPS throughput and latency on the filesystem
> with latency-policy and pid-policy.
> 
> Each of these tests was run for 3 iterations and the best and worst of
> those 3 iterations are shown below.
> 
> These workloads are performing read-write which is the most commonly
> used workload, on a single type of device (which is nvme here) and two
> devices are configured for RAID1.
> 
> In all these read-write workloads, pid-policy performed ~25% better than
> the latency-policy for both throughput and IOPS, and 3% better on the
> latency parameter.
> 
> I haven't analyzed these read-write workloads on RAID1c3/RAID1c4 yet,
> but RAID1 is more common than other types, IMO.
> 
> So I think pid-policy should remain as our default read policy.
> 
> However as shown before, pid-policy perform worst in the case of special
> configs such as volumes with mixed types of devices. For those special
> mixed types of devices, latency-policy performs better than pid-policy.
> As tested before typically latency-policy provided ~175% better
> throughput performance in the case of mixed types of devices (SSD and
> nvme).
> 
> Feedbacks welcome.
> 
> Fio logs below.
> 
> 
> IOPS focused readwrite workload:
> fio --filename=/btrfs/foo --size=500GB --direct=1 --rw=randrw --bs=4k 
> --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based 
> --group_reporting --name=iops-randomreadwrite --eta-newline=1
> 
> pid [latency] device roundrobin ( 00)
>    read: IOPS=40.6k, BW=159MiB/s (166MB/s)(18.6GiB/120002msec)
> 
> [pid] latency device roundrobin ( 00)
>    read: IOPS=50.7k, BW=198MiB/s (208MB/s)(23.2GiB/120001msec)
> 
> IOPS is 25% better with pid policy.
> 
> 
> Throughput focused readwrite workload:
> fio --filename=/btrfs/foo --size=500GB --direct=1 --rw=randrw --bs=64k 
> --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 --time_based 
> --group_reporting --name=throughput-randomreadwrite --eta-newline=1
> 
> pid [latency] device roundrobin ( 00)
>    read: IOPS=8525, BW=533MiB/s (559MB/s)(62.4GiB/120003msec)
> 
> [pid] latency device roundrobin ( 00)
>    read: IOPS=10.7k, BW=670MiB/s (702MB/s)(78.5GiB/120005msec)
> 
> Throughput is 25% better with pid policy
> 
> Latency focused readwrite workload:
> fio --filename=/btrfs/foo --size=500GB --direct=1 --rw=randrw --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=120 --numjobs=4 --time_based 
> --group_reporting --name=latency-randomreadwrite --eta-newline=1
> pid [latency] device roundrobin ( 00)
>    read: IOPS=59.8k, BW=234MiB/s (245MB/s)(27.4GiB/120003msec)
>       lat (usec): min=68, max=826930, avg=1917.20, stdev=4210.90
> 
> [pid] latency device roundrobin ( 00)
>    read: IOPS=61.9k, BW=242MiB/s (253MB/s)(28.3GiB/120001msec)
>       lat (usec): min=64, max=751557, avg=1846.07, stdev=4082.97
> 
> Latency is 3% better with pid policy.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/3] btrfs: add read_policy latency
  2021-02-04 12:30             ` Anand Jain
@ 2021-02-09 21:12               ` Michal Rostecki
  2021-02-10  6:14                 ` Anand Jain
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Rostecki @ 2021-02-09 21:12 UTC (permalink / raw)
  To: Anand Jain; +Cc: dsterba, linux-btrfs, dsterba, josef

On Thu, Feb 04, 2021 at 08:30:01PM +0800, Anand Jain wrote:
> 
> Hi Michal,
> 
>  Did you get any chance to run the evaluation with this patchset?
> 
> Thanks, Anand
> 

Hi Anand,

Yes, I tested your policies now. Sorry for late response.

For the singlethreaded test:

  [global]
  name=btrfs-raid1-seqread
  filename=btrfs-raid1-seqread
  rw=read
  bs=64k
  direct=0
  numjobs=1
  time_based=0

  [file1]
  size=10G
  ioengine=libaio

results are:

- raid1c3 with 3 HDDs:
  3 x Segate Barracuda ST2000DM008 (2TB)
  * pid policy
    READ: bw=215MiB/s (226MB/s), 215MiB/s-215MiB/s (226MB/s-226MB/s),
    io=10.0GiB (10.7GB), run=47537-47537msec
  * latency policy
    READ: bw=219MiB/s (229MB/s), 219MiB/s-219MiB/s (229MB/s-229MB/s),
    io=10.0GiB (10.7GB), run=46852-46852msec
  * device policy - didn't test it here, I guess it doesn't make sense
    to check it on non-mixed arrays ;)
- raid1c3 with 2 HDDs and 1 SSD:
  2 x Segate Barracuda ST2000DM008 (2TB)
  1 x Crucial CT256M550SSD1 (256GB)
  * pid policy
    READ: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s),
    io=10.0GiB (10.7GB), run=46749-46749msec
  * latency policy
    READ: bw=517MiB/s (542MB/s), 517MiB/s-517MiB/s (542MB/s-542MB/s),
    io=10.0GiB (10.7GB), run=19823-19823msec
  * device policy
    READ: bw=517MiB/s (542MB/s), 517MiB/s-517MiB/s (542MB/s-542MB/s),
    io=10.0GiB (10.7GB), run=19810-19810msec

For the multithreaded test:

  [global]
  name=btrfs-raid1-seqread
  filename=btrfs-raid1-seqread
  rw=read
  bs=64k
  direct=0
  numjobs=1
  time_based=0

  [file1]
  size=10G
  ioengine=libaio

results are:

- raid1c3 with 3 HDDs:
  3 x Segate Barracuda ST2000DM008 (2TB)
  * pid policy
    READ: bw=1608MiB/s (1686MB/s), 201MiB/s-201MiB/s (211MB/s-211MB/s),
    io=80.0GiB (85.9GB), run=50948-50949msec
  * latency policy
    READ: bw=1515MiB/s (1588MB/s), 189MiB/s-189MiB/s (199MB/s-199MB/s),
    io=80.0GiB (85.9GB), run=54081-54084msec
- raid1c3 with 2 HDDs and 1 SSD:
  2 x Segate Barracuda ST2000DM008 (2TB)
  1 x Crucial CT256M550SSD1 (256GB)
  * pid policy
    READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s),
    io=80.0GiB (85.9GB), run=44449-44450msec
  * latency policy
    READ: bw=4213MiB/s (4417MB/s), 527MiB/s-527MiB/s (552MB/s-552MB/s),
    io=80.0GiB (85.9GB), run=19444-19446msec
  * device policy
    READ: bw=4196MiB/s (4400MB/s), 525MiB/s-525MiB/s (550MB/s-550MB/s),
    io=80.0GiB (85.9GB), run=19522-19522msec

To sum it up - I think that your policies are indeed a very good match
for mixed (nonrot and rot) arrays.

They perform either slightly better or worse (depending on the test)
than pid policy on all-HDD arrays.

I've just sent out my proposal of roundrobin policy, which seems to give
better performance for all-HDD than your policies (and better than pid
policy in all cases):

https://patchwork.kernel.org/project/linux-btrfs/patch/20210209203041.21493-7-mrostecki@suse.de/

Cheers,
Michal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/3] btrfs: add read_policy latency
  2021-02-09 21:12               ` Michal Rostecki
@ 2021-02-10  6:14                 ` Anand Jain
  0 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2021-02-10  6:14 UTC (permalink / raw)
  To: Michal Rostecki; +Cc: dsterba, linux-btrfs, dsterba, josef



On 10/02/2021 05:12, Michal Rostecki wrote:
> On Thu, Feb 04, 2021 at 08:30:01PM +0800, Anand Jain wrote:
>>
>> Hi Michal,
>>
>>   Did you get any chance to run the evaluation with this patchset?
>>
>> Thanks, Anand
>>
> 
> Hi Anand,
> 
> Yes, I tested your policies now. Sorry for late response.
> 
> For the singlethreaded test:
> 
>    [global]
>    name=btrfs-raid1-seqread
>    filename=btrfs-raid1-seqread
>    rw=read
>    bs=64k
>    direct=0
>    numjobs=1
>    time_based=0
> 
>    [file1]
>    size=10G
>    ioengine=libaio
> 
> results are:
> 
> - raid1c3 with 3 HDDs:
>    3 x Segate Barracuda ST2000DM008 (2TB)
>    * pid policy
>      READ: bw=215MiB/s (226MB/s), 215MiB/s-215MiB/s (226MB/s-226MB/s),
>      io=10.0GiB (10.7GB), run=47537-47537msec
>    * latency policy
>      READ: bw=219MiB/s (229MB/s), 219MiB/s-219MiB/s (229MB/s-229MB/s),
>      io=10.0GiB (10.7GB), run=46852-46852msec


>    * device policy - didn't test it here, I guess it doesn't make sense
>      to check it on non-mixed arrays ;)


Hum. device policy provided best performance in non-mixed arrays with 
fio sequential workload.

raid1c3 Read 500m (time = 60sec)
-----------------------------------------------------
             | nvme+ssd  nvme+ssd  all-nvme  all-nvme
             | random    seq       random    seq
------------+-----------------------------------------
pid         |  973MiB/s  955MiB/s 2144MiB/s 1962MiB/s
latency     | 2005MiB/s 1924MiB/s 2083MiB/s 1980MiB/s
device(nvme)| 2021MiB/s 2034MiB/s 1920MiB/s 2132MiB/s
roundrobin  |  707MiB/s  701MiB/s 1760MiB/s 1990MiB/s



> - raid1c3 with 2 HDDs and 1 SSD:
>    2 x Segate Barracuda ST2000DM008 (2TB)
>    1 x Crucial CT256M550SSD1 (256GB)
>    * pid policy
>      READ: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s),
>      io=10.0GiB (10.7GB), run=46749-46749msec
>    * latency policy
>      READ: bw=517MiB/s (542MB/s), 517MiB/s-517MiB/s (542MB/s-542MB/s),
>      io=10.0GiB (10.7GB), run=19823-19823msec
>    * device policy
>      READ: bw=517MiB/s (542MB/s), 517MiB/s-517MiB/s (542MB/s-542MB/s),
>      io=10.0GiB (10.7GB), run=19810-19810msec
> 
> For the multithreaded test:
> 
>    [global]
>    name=btrfs-raid1-seqread
>    filename=btrfs-raid1-seqread
>    rw=read
>    bs=64k
>    direct=0
>    numjobs=1
>    time_based=0
> 
>    [file1]
>    size=10G
>    ioengine=libaio
> 
> results are:
> 
> - raid1c3 with 3 HDDs:
>    3 x Segate Barracuda ST2000DM008 (2TB)
>    * pid policy
>      READ: bw=1608MiB/s (1686MB/s), 201MiB/s-201MiB/s (211MB/s-211MB/s),
>      io=80.0GiB (85.9GB), run=50948-50949msec
>    * latency policy
>      READ: bw=1515MiB/s (1588MB/s), 189MiB/s-189MiB/s (199MB/s-199MB/s),
>      io=80.0GiB (85.9GB), run=54081-54084msec
> - raid1c3 with 2 HDDs and 1 SSD:
>    2 x Segate Barracuda ST2000DM008 (2TB)
>    1 x Crucial CT256M550SSD1 (256GB)
>    * pid policy
>      READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s),
>      io=80.0GiB (85.9GB), run=44449-44450msec
>    * latency policy
>      READ: bw=4213MiB/s (4417MB/s), 527MiB/s-527MiB/s (552MB/s-552MB/s),
>      io=80.0GiB (85.9GB), run=19444-19446msec
>    * device policy
>      READ: bw=4196MiB/s (4400MB/s), 525MiB/s-525MiB/s (550MB/s-550MB/s),
>      io=80.0GiB (85.9GB), run=19522-19522msec
> 
> To sum it up - I think that your policies are indeed a very good match
> for mixed (nonrot and rot) arrays.
> 
> They perform either slightly better or worse (depending on the test)
> than pid policy on all-HDD arrays.

Theoretically, latency would perform better, as the latency parameter
works as a feedback loop. Dynamically adjusting itself to the delivered
performance. But there is overhead to calculate the latency.

Thanks, Anand

> I've just sent out my proposal of roundrobin policy, which seems to give
> better performance for all-HDD than your policies (and better than pid
> policy in all cases):
> 
> https://patchwork.kernel.org/project/linux-btrfs/patch/20210209203041.21493-7-mrostecki@suse.de/
> 
> Cheers,
> Michal
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v4 2/3] btrfs: introduce new device-state read_preferred
  2021-01-20  7:52 [PATCH v4 0/3] btrfs: read_policy types latency, device and round-robin Anand Jain
  2021-01-20  7:52 ` [PATCH v4 1/3] btrfs: add read_policy latency Anand Jain
@ 2021-01-20  7:52 ` Anand Jain
  2021-01-21 10:19   ` Anand Jain
  2021-01-20  7:52 ` [PATCH v4 3/3] btrfs: introduce new read_policy device Anand Jain
  2021-01-20 12:34 ` [PATCH v4 0/3, full-cover-letter] btrfs: read_policy types latency, device and round-robin Anand Jain
  3 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2021-01-20  7:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, josef

This is a preparatory patch and introduces a new device flag
'read_preferred', RW-able using sysfs interface.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
v4: -
v2: C style fixes. Drop space in between '! test_bit' and extra lines
    after it.

 fs/btrfs/sysfs.c   | 53 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  1 +
 2 files changed, 54 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 7c0324fe97b2..5888e15e3d14 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1422,11 +1422,64 @@ static ssize_t btrfs_devinfo_writeable_show(struct kobject *kobj,
 }
 BTRFS_ATTR(devid, writeable, btrfs_devinfo_writeable_show);
 
+static ssize_t btrfs_devinfo_read_pref_show(struct kobject *kobj,
+					    struct kobj_attribute *a, char *buf)
+{
+	int val;
+	struct btrfs_device *device = container_of(kobj, struct btrfs_device,
+						   devid_kobj);
+
+	val = !!test_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", val);
+}
+
+static ssize_t btrfs_devinfo_read_pref_store(struct kobject *kobj,
+					     struct kobj_attribute *a,
+					     const char *buf, size_t len)
+{
+	int ret;
+	unsigned long val;
+	struct btrfs_device *device;
+
+	ret = kstrtoul(skip_spaces(buf), 0, &val);
+	if (ret)
+		return ret;
+
+	if (val != 0 && val != 1)
+		return -EINVAL;
+
+	/*
+	 * lock is not required, the btrfs_device struct can't be freed while
+	 * its kobject btrfs_device::devid_kobj is still open.
+	 */
+	device = container_of(kobj, struct btrfs_device, devid_kobj);
+
+	if (val &&
+	    !test_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state)) {
+		set_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state);
+		btrfs_info(device->fs_devices->fs_info,
+			   "set read preferred on devid %llu (%d)",
+			   device->devid, task_pid_nr(current));
+	} else if (!val &&
+		   test_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state)) {
+		clear_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state);
+		btrfs_info(device->fs_devices->fs_info,
+			   "reset read preferred on devid %llu (%d)",
+			   device->devid, task_pid_nr(current));
+	}
+
+	return len;
+}
+BTRFS_ATTR_RW(devid, read_preferred, btrfs_devinfo_read_pref_show,
+	      btrfs_devinfo_read_pref_store);
+
 static struct attribute *devid_attrs[] = {
 	BTRFS_ATTR_PTR(devid, in_fs_metadata),
 	BTRFS_ATTR_PTR(devid, missing),
 	BTRFS_ATTR_PTR(devid, replace_target),
 	BTRFS_ATTR_PTR(devid, writeable),
+	BTRFS_ATTR_PTR(devid, read_preferred),
 	NULL
 };
 ATTRIBUTE_GROUPS(devid);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 71ba1f0e93f4..ea786864b903 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -51,6 +51,7 @@ struct btrfs_io_geometry {
 #define BTRFS_DEV_STATE_REPLACE_TGT	(3)
 #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
 #define BTRFS_DEV_STATE_NO_READA	(5)
+#define BTRFS_DEV_STATE_READ_PREFERRED	(6)
 
 struct btrfs_zoned_device_info;
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 2/3] btrfs: introduce new device-state read_preferred
  2021-01-20  7:52 ` [PATCH v4 2/3] btrfs: introduce new device-state read_preferred Anand Jain
@ 2021-01-21 10:19   ` Anand Jain
  0 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2021-01-21 10:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, josef



On 20/1/21 3:52 pm, Anand Jain wrote:
> This is a preparatory patch and introduces a new device flag
> 'read_preferred', RW-able using sysfs interface.
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com> > ---
> v4: -

There is rb from Josef for this patch in v3.
Could you please add it?

Thanks, Anand


> v2: C style fixes. Drop space in between '! test_bit' and extra lines
>      after it.
> 
>   fs/btrfs/sysfs.c   | 53 ++++++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/volumes.h |  1 +
>   2 files changed, 54 insertions(+)
> 
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 7c0324fe97b2..5888e15e3d14 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -1422,11 +1422,64 @@ static ssize_t btrfs_devinfo_writeable_show(struct kobject *kobj,
>   }
>   BTRFS_ATTR(devid, writeable, btrfs_devinfo_writeable_show);
>   
> +static ssize_t btrfs_devinfo_read_pref_show(struct kobject *kobj,
> +					    struct kobj_attribute *a, char *buf)
> +{
> +	int val;
> +	struct btrfs_device *device = container_of(kobj, struct btrfs_device,
> +						   devid_kobj);
> +
> +	val = !!test_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state);
> +
> +	return snprintf(buf, PAGE_SIZE, "%d\n", val);
> +}
> +
> +static ssize_t btrfs_devinfo_read_pref_store(struct kobject *kobj,
> +					     struct kobj_attribute *a,
> +					     const char *buf, size_t len)
> +{
> +	int ret;
> +	unsigned long val;
> +	struct btrfs_device *device;
> +
> +	ret = kstrtoul(skip_spaces(buf), 0, &val);
> +	if (ret)
> +		return ret;
> +
> +	if (val != 0 && val != 1)
> +		return -EINVAL;
> +
> +	/*
> +	 * lock is not required, the btrfs_device struct can't be freed while
> +	 * its kobject btrfs_device::devid_kobj is still open.
> +	 */
> +	device = container_of(kobj, struct btrfs_device, devid_kobj);
> +
> +	if (val &&
> +	    !test_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state)) {
> +		set_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state);
> +		btrfs_info(device->fs_devices->fs_info,
> +			   "set read preferred on devid %llu (%d)",
> +			   device->devid, task_pid_nr(current));
> +	} else if (!val &&
> +		   test_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state)) {
> +		clear_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state);
> +		btrfs_info(device->fs_devices->fs_info,
> +			   "reset read preferred on devid %llu (%d)",
> +			   device->devid, task_pid_nr(current));
> +	}
> +
> +	return len;
> +}
> +BTRFS_ATTR_RW(devid, read_preferred, btrfs_devinfo_read_pref_show,
> +	      btrfs_devinfo_read_pref_store);
> +
>   static struct attribute *devid_attrs[] = {
>   	BTRFS_ATTR_PTR(devid, in_fs_metadata),
>   	BTRFS_ATTR_PTR(devid, missing),
>   	BTRFS_ATTR_PTR(devid, replace_target),
>   	BTRFS_ATTR_PTR(devid, writeable),
> +	BTRFS_ATTR_PTR(devid, read_preferred),
>   	NULL
>   };
>   ATTRIBUTE_GROUPS(devid);
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 71ba1f0e93f4..ea786864b903 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -51,6 +51,7 @@ struct btrfs_io_geometry {
>   #define BTRFS_DEV_STATE_REPLACE_TGT	(3)
>   #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
>   #define BTRFS_DEV_STATE_NO_READA	(5)
> +#define BTRFS_DEV_STATE_READ_PREFERRED	(6)
>   
>   struct btrfs_zoned_device_info;
>   
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v4 3/3] btrfs: introduce new read_policy device
  2021-01-20  7:52 [PATCH v4 0/3] btrfs: read_policy types latency, device and round-robin Anand Jain
  2021-01-20  7:52 ` [PATCH v4 1/3] btrfs: add read_policy latency Anand Jain
  2021-01-20  7:52 ` [PATCH v4 2/3] btrfs: introduce new device-state read_preferred Anand Jain
@ 2021-01-20  7:52 ` Anand Jain
  2021-01-20 12:34 ` [PATCH v4 0/3, full-cover-letter] btrfs: read_policy types latency, device and round-robin Anand Jain
  3 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2021-01-20  7:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, josef

Read-policy type 'device' and device flag 'read-preferred':

The read-policy type device picks the device(s) flagged as
read-preferred for reading stripes of type raid1, raid10,
raid1c3 and raid1c4.

A system might contain SSD, nvme, iscsi, or san lun, and which are all
a non-rotational device, so it is not a good idea to set the read-preferred
automatically. Instead, device read-policy along with the read-preferred
flag provides an ability to do it manually. This advanced tuning is useful
in more than one situation, for example,
 - In heterogeneous-disk volume, it provides an ability to manually choose
    the low latency disks for reading.
 - Useful for more accurate testing.
 - Avoid known problematic device from reading the chunk until it is
   replaced (by marking the other good devices as read-preferred).

Note:

If the read-policy type is set to 'device', but there isn't any device
which is flagged as read-preferred, then stripe 0 is used for reading.

The device replacement won't migrate the read-preferred flag to the new
replace the target device.

As of now, this is an in-memory only feature.

It's pointless to set the read-preferred flag on the missing device, as
IOs aren't submitted to the missing device.

If there is more than one read-preferred device in a chunk, the read IO
shall go to the stripe 0 as of now.

Usage example:

Consider a typical two disks raid1.

Configure devid1 for reading.

$ echo 1 > devinfo/1/read_preferred
$ cat devinfo/1/read_preferred
1
$ cat devinfo/2/read_preferred
0

$ pwd
/sys/fs/btrfs/12345678-1234-1234-1234-123456789abc

$ cat read_policy
[pid] device
$ echo device > ./read_policy
$ cat read_policy
pid [device]

Now read IOs are sent to devid 1 (sdb).

$ echo 3 > /proc/sys/vm/drop_caches
$ md5sum /btrfs/YkZI

$ iostat -zy 1 | egrep 'sdb|sdc' (from another terminal)
sdb              50.00     40048.00         0.00      40048          0

Change the read-preferred device from devid 1 to devid 2 (sdc).

$ echo 0 > ./devinfo/1/read_preferred

[ 3343.918658] BTRFS info (device sdb): reset read preferred on devid 1 (1334)

$ echo 1 > ./devinfo/2/read_preferred

[ 3343.919876] BTRFS info (device sdb): set read preferred on devid 2 (1334)

$ echo 3 > /proc/sys/vm/drop_caches
$ md5sum /btrfs/YkZI

Further read ios are sent to devid 2 (sdc).

$ iostat -zy 1 | egrep 'sdb|sdc' (from another terminal)
sdc              49.00     40048.00         0.00      40048          0

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
v4: add Josef rb.
v3: update the change log.
v2: -
rfc->v1: -

 fs/btrfs/sysfs.c   |  3 ++-
 fs/btrfs/volumes.c | 22 ++++++++++++++++++++++
 fs/btrfs/volumes.h |  2 ++
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 5888e15e3d14..dd1835a2a7ab 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -916,7 +916,8 @@ static bool strmatch(const char *buffer, const char *string)
 }
 
 /* Must follow the order as in enum btrfs_read_policy */
-static const char * const btrfs_read_policy_name[] = { "pid", "latency" };
+static const char * const btrfs_read_policy_name[] = { "pid", "latency",
+						       "device" };
 
 static ssize_t btrfs_read_policy_show(struct kobject *kobj,
 				      struct kobj_attribute *a, char *buf)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f361f1c87eb6..d942260f8d2c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5524,6 +5524,25 @@ static int btrfs_find_best_stripe(struct btrfs_fs_info *fs_info,
 	return best_stripe;
 }
 
+static int btrfs_find_read_preferred(struct map_lookup *map, int first, int num_stripe)
+{
+	int stripe_index;
+	int last = first + num_stripe;
+
+	/*
+	 * If there are more than one read preferred devices, then just pick the
+	 * first found read preferred device as of now.
+	 */
+	for (stripe_index = first; stripe_index < last; stripe_index++) {
+		if (test_bit(BTRFS_DEV_STATE_READ_PREFERRED,
+			     &map->stripes[stripe_index].dev->dev_state))
+			return stripe_index;
+	}
+
+	/* If there is no read preferred device then just use the first stripe */
+	return first;
+}
+
 static int find_live_mirror(struct btrfs_fs_info *fs_info,
 			    struct map_lookup *map, int first,
 			    int dev_replace_is_ongoing)
@@ -5557,6 +5576,9 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 		preferred_mirror = btrfs_find_best_stripe(fs_info, map, first,
 							  num_stripes);
 		break;
+	case BTRFS_READ_POLICY_DEVICE:
+		preferred_mirror = btrfs_find_read_preferred(map, first, num_stripes);
+		break;
 	}
 
 	if (dev_replace_is_ongoing &&
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index ea786864b903..8d5a2cddc0ab 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -225,6 +225,8 @@ enum btrfs_read_policy {
 	BTRFS_READ_POLICY_PID,
 	/* Find and use device with the lowest latency */
 	BTRFS_READ_POLICY_LATENCY,
+	/* Use the device marked with READ_PREFERRED state */
+	BTRFS_READ_POLICY_DEVICE,
 	BTRFS_NR_READ_POLICY,
 };
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v4 0/3, full-cover-letter] btrfs: read_policy types latency, device and round-robin
  2021-01-20  7:52 [PATCH v4 0/3] btrfs: read_policy types latency, device and round-robin Anand Jain
                   ` (2 preceding siblings ...)
  2021-01-20  7:52 ` [PATCH v4 3/3] btrfs: introduce new read_policy device Anand Jain
@ 2021-01-20 12:34 ` Anand Jain
  2021-01-22  5:52   ` Anand Jain
  3 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2021-01-20 12:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: josef, dsterba

[Only some parts of the cover-letter went through, tying again.].

v4:
Add rb from Josef in patch 1 and 3.
In patch 1/3, use fs_info instead of device->fs_devices->fs_info.
Drop round-robin policy because my workload (fio random) shows no performance
 gains due to fewer merges at the block layer.

v3:
The block layer commit 0d02129e76ed (block: merge struct block_device and
struct hd_struct) has changed the first argument in the function
part_stat_read_all() in 5.11-rc1. So trickle down its changes in the patch 1/4.

v2:
Fixes as per review comments, as in the individual patches.

rfc->v1:
Drop the tracing patch.
Drop the factor associated with the inflight commands (because there
were too many unnecessary switches).
Few C styles fix.

-----

This patchset adds read policy types latency, device, and round-robin, for the
mirrored raid profiles such as raid1, raid1c3, raid1c4, and raid10. The default
read policy remains as PID, as of now.

Read policy types:
Latency:

Latency policy routes the read IO based on the historical average
wait time experienced by the read IOs on the individual device.

Device:

With the device policy along with the read_preferred flag, you can
set the device for reading manually. Useful to test mirrors in a
deterministic way and helps advance system administrations.

Round-robin (RFC patch):

Alternates striped device in a round-robin loop for reading. To achieve
this first we put the stripes in an array, sort it by devid and pick the
next device.

Test scripts:
=============

I have included a few scripts which were useful for testing.

-------------------8<--------------------------------
Set latency policy on the btrfs mounted at /mnt

Usage example:
  $ readpolicyset /mnt latency

Anand Jain (3):
  btrfs: add read_policy latency
  btrfs: introduce new device-state read_preferred
  btrfs: introduce new read_policy device

 fs/btrfs/sysfs.c   | 57 ++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  5 ++++
 3 files changed, 121 insertions(+), 1 deletion(-)

-- 
2.28.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 0/3, full-cover-letter] btrfs: read_policy types latency, device and round-robin
  2021-01-20 12:34 ` [PATCH v4 0/3, full-cover-letter] btrfs: read_policy types latency, device and round-robin Anand Jain
@ 2021-01-22  5:52   ` Anand Jain
  0 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2021-01-22  5:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: josef, dsterba

[Oops. A part of the cover letter is missing again. The cover-letter
  file has it all. I am not sure why it happened.
  Here below, I am just sending it by email].

v4:
Add rb from Josef in patch 1 and 3.
In patch 1/3, use fs_info instead of device->fs_devices->fs_info.
Drop round-robin policy because my workload (fio random) shows no
performance gains due to fewer merges at the block layer.

v3:
The block layer commit 0d02129e76ed (block: merge struct block_device
and struct hd_struct) has changed the first argument in the function
part_stat_read_all() in 5.11-rc1. So trickle down its changes in the 
patch 1/4.

v2:
Fixes as per review comments, as in the individual patches.

rfc->v1:
Drop the tracing patch.
Drop the factor associated with the inflight commands (because there
were too many unnecessary switches).
Few C styles fix.

-----

This patchset adds read policy types latency, device, and round-robin,
for the mirrored raid profiles such as raid1, raid1c3, raid1c4, and 
raid10. The default read policy remains as PID, as of now.

Read policy types:
Latency:

Latency policy routes the read IO based on the historical average
wait time experienced by the read IOs on the individual device.

Device:

With the device policy along with the read_preferred flag, you can
set the device for reading manually. Useful to test mirrors in a
deterministic way and helps advance system administrations.

Round-robin (RFC patch, removed in v4):

Alternates striped device in a round-robin loop for reading. To achieve
this first we put the stripes in an array, sort it by devid and pick the
next device.

Test scripts:
=============

I have included a few scripts which were useful for testing.

-------------------8<--------------------------------
Set latency policy on the btrfs mounted at /mnt

Usage example:
   $ readpolicyset /mnt latency

$ cat readpolicyset
#!/bin/bash

: ${1?"arg1 <mnt> missing"}
: ${2?"arg2 <pid|latency|device|roundrobin> missing"}

mnt=$1
policy=$2
[ $policy == "device" ] && { : ${3?"arg3 <devid> missing"}; }
devid=$3

uuid=$(btrfs fi show -m /btrfs | grep uuid | awk '{print $4}')
p=/sys/fs/btrfs/$uuid/read_policy
q=/sys/fs/btrfs/$uuid/devinfo

[ $policy == "device" ] && { echo 1 > ${q}/$devid/read_preferred || exit 
$?; }

echo $policy > $p
exit $?

-------------------8<--------------------------------

Read policy type from the btrfs mounted at /mnt

Usage example:
   $ readpolicy /mnt

$ cat readpolicy
#!/bin/bash

: ${1?"arg1 <mnt> missing"}
mnt=$1

uuid=$(btrfs fi show -m /btrfs | grep uuid | awk '{print $4}')
p=/sys/fs/btrfs/$uuid/read_policy
q=/sys/fs/btrfs/$uuid/devinfo

policy=$(cat $p)
echo -n "$policy ( "

for i in $(find $q -type f -name read_preferred | xargs cat)
do
	echo -n "$i"
done
echo ")"
-------------------8<--------------------------------

Show the number of read IO per devices for the give command.

Usage example:
    $ readstat /mnt fioread

$ cat readstat
#!/bin/bash

: ${1?"arg1 <mnt> is missing"}
: ${2?"arg2 <cmd-to-run> is missing"}

mnt=$1; shift
mountpoint -q $mnt || { echo "ERROR: $mnt is not mounted"; exit 1; }

declare -A devread

for dev in $(btrfs filesystem show -m $mnt | grep devid |awk '{print $8}')
do
	prefix=$(echo $dev | rev | cut -d"/" -f1 | rev)
	sysfs_path=$(find /sys | grep $prefix/stat$)

	devread[$sysfs_path]=$(cat $sysfs_path | awk '{print $1}')
done

"$@" | grep "READ: bw"

echo
echo
for sysfs_path in ${!devread[@]}
do
	dev=$(echo $sysfs_path | rev | cut -d"/" -f2 | rev)
	new=$(cat $sysfs_path | awk '{print $1}')
	old=${devread[$sysfs_path]}
	echo "$dev $((new - old))"
done
-------------------8<--------------------------------

Run fio read command

Usage example:
     $ touch /mnt/largefile
     $ fioread /mnt/largefile 500m

$ cat fioread
#!/bin/bash

: ${1?"arg1 </mnt/file> is missing"}
: ${2?"arg2 <1Gi|50Gi> is missing"}

tf=$1
sz=$2
mnt=$(stat -c '%m' $tf)

fio \
--filename=$tf \
--directory=$mnt \
--filesize=$sz \
--size=$sz \
--rw=randread \
--bs=64k \
--ioengine=libaio \
--direct=1 \
--numjobs=32 \
--group_reporting \
--thread \
--name iops-test-job
-------------------8<--------------------------------


Testing on guest VM
~~~~~~~~~~~~~~~~~~~

The test results from my VM with 2 devices of type sata and 2 devices of
type virtio, are below. Performance results are for raid1c4, raid10, and
raid1 are as below.

The workload is fio read 32 threads, 500m random reads.

Fio is passed to the script called readstat, which returns the number of
read IOs per device sent during the fio.

Supporting fio logs are below. And readstat shows the number of read IOs
to the devices (excluding the merges).

raid1c4
=======

pid
----

$ readpolicyset /btrfs pid && readpolicy /btrfs && dropcache && readstat 
/btrfs fioread /btrfs/largefile 500m

[pid] latency device roundrobin ( 0000)
  READ: bw=87.0MiB/s (91.2MB/s), 87.0MiB/s-87.0MiB/s 
(91.2MB/s-91.2MB/s), io=15.6GiB (16.8GB), run=183884-183884msec

vdb 64060
vdc 64053
sdb 64072
sda 64054


latency
-------

(All devices are non-rotational, but sda and sdb are of type sata and 
vdb and vdc are of type virtio).

$ readpolicyset /btrfs latency && readpolicy /btrfs && dropcache && 
readstat /btrfs fioread /btrfs/largefile 500m

pid [latency] device roundrobin ( 0000)
  READ: bw=87.1MiB/s (91.3MB/s), 87.1MiB/s-87.1MiB/s 
(91.3MB/s-91.3MB/s), io=15.6GiB (16.8GB), run=183774-183774msec

vdb 255844
vdc 559
sdb 0
sda 93


roundrobin
----------

$ readpolicyset /btrfs roundrobin && readpolicy /btrfs && dropcache && 
readstat /btrfs fioread /btrfs/largefile 500m

pid latency device [roundrobin] ( 0000)
  READ: bw=51.0MiB/s (54.5MB/s), 51.0MiB/s-51.0MiB/s 
(54.5MB/s-54.5MB/s), io=15.6GiB (16.8GB), run=307755-307755msec

vdb 866859
vdc 866651
sdb 864139
sda 865533


raid10
======

pid
---

$ readpolicyset /btrfs pid && readpolicy /btrfs && dropcache && readstat 
/btrfs fioread /btrfs/largefile 500m

[pid] latency device roundrobin ( 0000)
  READ: bw=85.2MiB/s (89.3MB/s), 85.2MiB/s-85.2MiB/s 
(89.3MB/s-89.3MB/s), io=15.6GiB (16.8GB), run=187864-187864msec


sdf 64053
sde 64036
sdd 64043
sdc 64038


latency
-------

$ readpolicyset /btrfs latency && readpolicy /btrfs && dropcache && 
readstat /btrfs fioread /btrfs/largefile 500m

pid [latency] device roundrobin ( 0000)
  READ: bw=85.4MiB/s (89.5MB/s), 85.4MiB/s-85.4MiB/s 
(89.5MB/s-89.5MB/s), io=15.6GiB (16.8GB), run=187370-187370msec


sdf 117494
sde 10748
sdd 125247
sdc 2921

roundrobin
----------

$ readpolicyset /btrfs roundrobin && readpolicy /btrfs && dropcache && 
readstat /btrfs fioread /btrfs/largefile 500m

pid latency device [roundrobin] ( 0000)
  READ: bw=55.4MiB/s (58.1MB/s), 55.4MiB/s-55.4MiB/s 
(58.1MB/s-58.1MB/s), io=15.6GiB (16.8GB), run=288701-288701msec

sdf 617593
sde 617381
sdd 618486
sdc 618633


raid1
=====

pid
----

$ readpolicyset /btrfs pid && readpolicy /btrfs && dropcache && readstat 
/btrfs fioread /btrfs/largefile 500m

[pid] latency device roundrobin ( 00)
  READ: bw=78.8MiB/s (82.6MB/s), 78.8MiB/s-78.8MiB/s 
(82.6MB/s-82.6MB/s), io=15.6GiB (16.8GB), run=203158-203158msec

sdb 128087
sda 128090


latency
--------

$ readpolicyset /btrfs latency && readpolicy /btrfs && dropcache && 
readstat /btrfs fioread /btrfs/largefile 500m

pid [latency] device roundrobin ( 00)
  READ: bw=86.5MiB/s (90.7MB/s), 86.5MiB/s-86.5MiB/s 
(90.7MB/s-90.7MB/s), io=15.6GiB (16.8GB), run=185023-185023msec

sdb 567
sda 255942


device
-------

(From the latency test results (above) we know sda is providing low 
latency read
IO. So set sda as read preferred device.)

$ readpolicyset /btrfs device 1 && readpolicy /btrfs && dropcache && 
readstat /btrfs fioread /btrfs/largefile 500m

pid latency [device] roundrobin ( 10)
  READ: bw=88.2MiB/s (92.5MB/s), 88.2MiB/s-88.2MiB/s 
(92.5MB/s-92.5MB/s), io=15.6GiB (16.8GB), run=181374-181374msec

sdb 0
sda 256191


roundrobin
-----------

$ readpolicyset /btrfs roundrobin && readpolicy /btrfs && dropcache && 
readstat /btrfs fioread /btrfs/largefile 500m

pid latency device [roundrobin] ( 00)
  READ: bw=54.1MiB/s (56.7MB/s), 54.1MiB/s-54.1MiB/s 
(56.7MB/s-56.7MB/s), io=15.6GiB (16.8GB), run=295693-295693msec

sdb 1252584
sda 1254258


Testing on real hardware:
~~~~~~~~~~~~~~~~~~~~~~~~

raid1 Read 500m
-----------------------------------------------------
             |nvme+ssd  nvme+ssd   all-nvme  all-nvme
             |random    sequential random    sequential
------------+------------------------------------------
pid         | 744MiB/s  809MiB/s  2225MiB/s 2155MiB/s
latency     |2072MiB/s 2008MiB/s  1999MiB/s 1961MiB/s
device(nvme)|2187MiB/s 2063MiB/s  2125MiB/s 2080MiB/s
roundrobin  | 527MiB/s  519MiB/s  2137MiB/s 1876MiB/s


raid10 Read 500m
-----------------------------------------------------
             | nvme+ssd  nvme+ssd  all-nvme  all-nvme
             | random    seq       random    seq
------------+-----------------------------------------
pid         | 1282MiB/s 1427MiB/s 2152MiB/s 1969MiB/s
latency     | 2073MiB/s 1871MiB/s 1975MiB/s 1984MiB/s
device(nvme)| 2447MiB/s 1873MiB/s 2184MiB/s 2015MiB/s
roundrobin  | 1117MiB/s 1076MiB/s 2020MiB/s 2030MiB/s


raid1c3 Read 500m
-----------------------------------------------------
             | nvme+ssd  nvme+ssd  all-nvme  all-nvme
             | random    seq       random    seq
------------+-----------------------------------------
pid         |  973MiB/s  955MiB/s 2144MiB/s 1962MiB/s
latency     | 2005MiB/s 1924MiB/s 2083MiB/s 1980MiB/s
device(nvme)| 2021MiB/s 2034MiB/s 1920MiB/s 2132MiB/s
roundrobin  |  707MiB/s  701MiB/s 1760MiB/s 1990MiB/s


raid1c4 Read 500m
-----------------------------------------------------
             | nvme+ssd  nvme+ssd  all-nvme  all-nvme
             | random    seq       random    seq
------------+----------------------------------------
pid         | 1204MiB/s 1221MiB/s 2065MiB/s 1878MiB/s
latency     | 1990MiB/s 1920MiB/s 1945MiB/s 1865MiB/s
device(nvme)| 2109MiB/s 1935MiB/s 2153MiB/s 1991MiB/s
roundrobin  |  887MiB/s  865MiB/s 1948MiB/s 1796MiB/s


Observations:
=============

1.
As our chunk allocation is based on the device's available size
at that time. So stripe 0 may be circulating among the devices.
So a single-threaded process running with a constant PID, may balance
the read IO among devices. But it is not guaranteed to work in all the
cases, and it might not work very well in the case of raid1c3/4. Further,
PID provides terrible performance if the devices are heterogeneous in
terms of either type, speed, or size.

2.
Latency provides performance equal to PID if all devices are of same
type. Latency needs iostat be enabled and includes cost of calculating
the avg. wait time. So if you factor in a similar cost of calculating the
avg. wait time in case of PID policy (using the debug code [2]) then the
Latency performance is better than PID. This proves that read IO
distribution as per latency is working, but there is a cost to it. And
moreover, latency works for any type of devices.

3.
Round robin is worst (unless there is a bug in my patch). The total
number of new IOs issued is almost double when compared with the PID and
Latency read_policy, that's because there were fewer number of IO merges
in the block layer due to constant switching of devices in the btrfs.

4.
4.
Device read_policy is useful in testing and provides advanced sysadmin
capabilities. When known how to use, the policy could help avert
performance degradation due to csum/IO errors at production.

Thanks, Anand

------------------
[2] Debug patch to factor the cost of calculating the latency per IO.

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d3023879bdf6..72ec633e9063 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5665,6 +5665,12 @@ static int find_live_mirror(struct btrfs_fs_info 
*fs_info,
  fs_info->fs_devices->read_policy = BTRFS_READ_POLICY_PID;
  fallthrough;
  case BTRFS_READ_POLICY_PID:
+ /*
+ * Just to factor in the cost of calculating the avg wait using
+ * iostat, call btrfs_find_best_stripe() here for the PID policy
+ * and drop its results on the floor.
+ */
+ btrfs_find_best_stripe(fs_info, map, first, num_stripes, log,
+ logsz);
  preferred_mirror = first + current->pid % num_stripes;
  scnprintf(log, logsz,
  "first %d num_stripe %d %s (%d) preferred %d",
-------------------------



On 20/1/21 8:34 pm, Anand Jain wrote:
> [Only some parts of the cover-letter went through, tying again.].
> 
> v4:
> Add rb from Josef in patch 1 and 3.
> In patch 1/3, use fs_info instead of device->fs_devices->fs_info.
> Drop round-robin policy because my workload (fio random) shows no performance
>   gains due to fewer merges at the block layer.
> 
> v3:
> The block layer commit 0d02129e76ed (block: merge struct block_device and
> struct hd_struct) has changed the first argument in the function
> part_stat_read_all() in 5.11-rc1. So trickle down its changes in the patch 1/4.
> 
> v2:
> Fixes as per review comments, as in the individual patches.
> 
> rfc->v1:
> Drop the tracing patch.
> Drop the factor associated with the inflight commands (because there
> were too many unnecessary switches).
> Few C styles fix.
> 
> -----
> 
> This patchset adds read policy types latency, device, and round-robin, for the
> mirrored raid profiles such as raid1, raid1c3, raid1c4, and raid10. The default
> read policy remains as PID, as of now.
> 
> Read policy types:
> Latency:
> 
> Latency policy routes the read IO based on the historical average
> wait time experienced by the read IOs on the individual device.
> 
> Device:
> 
> With the device policy along with the read_preferred flag, you can
> set the device for reading manually. Useful to test mirrors in a
> deterministic way and helps advance system administrations.
> 
> Round-robin (RFC patch):
> 
> Alternates striped device in a round-robin loop for reading. To achieve
> this first we put the stripes in an array, sort it by devid and pick the
> next device.
> 
> Test scripts:
> =============
> 
> I have included a few scripts which were useful for testing.
> 
> -------------------8<--------------------------------
> Set latency policy on the btrfs mounted at /mnt
> 
> Usage example:
>    $ readpolicyset /mnt latency
> 
> Anand Jain (3):
>    btrfs: add read_policy latency
>    btrfs: introduce new device-state read_preferred
>    btrfs: introduce new read_policy device
> 
>   fs/btrfs/sysfs.c   | 57 ++++++++++++++++++++++++++++++++++++++++++-
>   fs/btrfs/volumes.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/volumes.h |  5 ++++
>   3 files changed, 121 insertions(+), 1 deletion(-)
> 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-02-10  6:16 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-20  7:52 [PATCH v4 0/3] btrfs: read_policy types latency, device and round-robin Anand Jain
2021-01-20  7:52 ` [PATCH v4 1/3] btrfs: add read_policy latency Anand Jain
2021-01-20 12:14   ` David Sterba
2021-01-21 10:10     ` Anand Jain
2021-01-21 17:52       ` David Sterba
2021-01-22  8:10         ` Anand Jain
2021-01-30  1:08           ` Anand Jain
2021-02-04 12:30             ` Anand Jain
2021-02-09 21:12               ` Michal Rostecki
2021-02-10  6:14                 ` Anand Jain
2021-01-20  7:52 ` [PATCH v4 2/3] btrfs: introduce new device-state read_preferred Anand Jain
2021-01-21 10:19   ` Anand Jain
2021-01-20  7:52 ` [PATCH v4 3/3] btrfs: introduce new read_policy device Anand Jain
2021-01-20 12:34 ` [PATCH v4 0/3, full-cover-letter] btrfs: read_policy types latency, device and round-robin Anand Jain
2021-01-22  5:52   ` Anand Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).