All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Policy to balance read across mirrored devices
@ 2018-01-30  6:30 Anand Jain
  2018-01-30  6:30 ` [PATCH 1/2] btrfs: add mount option read_mirror_policy Anand Jain
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Anand Jain @ 2018-01-30  6:30 UTC (permalink / raw)
  To: linux-btrfs

In case of RAID1 and RAID10 devices are mirror-ed, a read IO can
pick any device for reading. This choice of picking a device for
reading should be configurable. In short not one policy would
satisfy all types of workload and configs.

So before we add more policies, this patch-set makes existing
$pid policy configurable from the mount option and adds a devid
based read_mirror device policy. And keeps $pid based policy as
the default option for now. So this mount option helps to try out
different read mirror load balances.

Further we can add more policies on top of it, for example..

  mount -o read_mirror_policy=pid (current, default) [1]
  mount -o read_mirror_policy=<devid> [2]
  mount -o read_mirror_policy=lba (illustration only) [3]
  mount -o read_mirror_policy=ssd (illustration only) [4]
  mount -o read_mirror_policy=io (illustration only) [5]
  mount -o read_mirror_policy=<heuristic1> (illustration only) [6]

[1]
 Current PID based read mirror balance.

[2]
 Set the devid of the device which should be used for read. That
 means all the normal read will go to that particular device only.
 This also helps testing and gives a better control for the test
 scripts including mount context reads.

[3]
 In case of SAN storage, some storage prefers that host access the
 LUN based on the LBA so that there won't be duplications of
 caching on the storage.

[4]
 In case of mix of SSD and HD we may want to use SSD as the primary
 read device.

[5]
 If storage caching is not the bottleneck but the transport layer
 is then read load should be tuned based on the IO load.

[6]
 Or a combination of any of above as per the priority.
 Timofey should consider to base his patch on top of this.
    Btrfs: enchanse raid1/10 balance heuristic

This patch set is on top of the preparatory patch set:
  [PATCH 0/2] Preparatory to add read_mirror mount option


Anand Jain (2):
  btrfs: add mount option read_mirror_policy
  btrfs: add read_mirror_policy parameter devid

 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/super.c   | 31 +++++++++++++++++++++++++++++++
 fs/btrfs/volumes.c | 18 +++++++++++++++++-
 fs/btrfs/volumes.h |  7 +++++++
 4 files changed, 57 insertions(+), 1 deletion(-)

-- 
2.7.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/2] btrfs: add mount option read_mirror_policy
  2018-01-30  6:30 [PATCH 0/2] Policy to balance read across mirrored devices Anand Jain
@ 2018-01-30  6:30 ` Anand Jain
  2018-01-31  8:06   ` Nikolay Borisov
  2018-01-30  6:30 ` [PATCH 2/2] btrfs: add read_mirror_policy parameter devid Anand Jain
  2018-01-31  7:51 ` [PATCH 0/2] Policy to balance read across mirrored devices Peter Becker
  2 siblings, 1 reply; 23+ messages in thread
From: Anand Jain @ 2018-01-30  6:30 UTC (permalink / raw)
  To: linux-btrfs

In case of RAID1 and RAID10 devices are mirror-ed, a read IO can
pick any device for reading. This choice of picking a device for
reading should be configurable. In short not one policy would
satisfy all types of workload and configs.

So before we add more policies, this patch-set makes existing
$pid policy configurable from the mount option.

For example..
  mount -o read_mirror_policy=pid (which is also default)

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/super.c   | 10 ++++++++++
 fs/btrfs/volumes.c |  8 +++++++-
 fs/btrfs/volumes.h |  5 +++++
 4 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1a462ab85c49..4759e988b0df 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1100,6 +1100,8 @@ struct btrfs_fs_info {
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;
 #endif
+	/* Policy to balance read across mirrored devices */
+	int read_mirror_policy;
 };
 
 static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 367ecbf477b9..dfe6b3c67df3 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -329,6 +329,7 @@ enum {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	Opt_ref_verify,
 #endif
+	Opt_read_mirror_policy,
 	Opt_err,
 };
 
@@ -393,6 +394,7 @@ static const match_table_t tokens = {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	{Opt_ref_verify, "ref_verify"},
 #endif
+	{Opt_read_mirror_policy, "read_mirror_policy=%s"},
 	{Opt_err, NULL},
 };
 
@@ -839,6 +841,14 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 			btrfs_set_opt(info->mount_opt, REF_VERIFY);
 			break;
 #endif
+		case Opt_read_mirror_policy:
+			if (strcmp(args[0].from, "pid") == 0) {
+				info->read_mirror_policy =
+					BTRFS_READ_MIRROR_BY_PID;
+				break;
+			}
+			ret = -EINVAL;
+			goto out;
 		case Opt_err:
 			btrfs_info(info, "unrecognized mount option '%s'", p);
 			ret = -EINVAL;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a61715677b67..39ba59832f38 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5269,7 +5269,13 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 	else
 		num = map->num_stripes;
 
-	optimal = first + current->pid % num;
+	switch(fs_info->read_mirror_policy) {
+	case BTRFS_READ_MIRROR_DEFAULT:
+	case BTRFS_READ_MIRROR_BY_PID:
+	default:
+		optimal = first + current->pid % num;
+		break;
+	}
 
 	if (dev_replace_is_ongoing &&
 	    fs_info->dev_replace.cont_reading_from_srcdev_mode ==
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 28c28eeadff3..78f35d299a61 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -47,6 +47,11 @@ struct btrfs_pending_bios {
 #define btrfs_device_data_ordered_init(device) do { } while (0)
 #endif
 
+enum btrfs_read_mirror_type {
+	BTRFS_READ_MIRROR_DEFAULT,
+	BTRFS_READ_MIRROR_BY_PID,
+};
+
 #define BTRFS_DEV_STATE_WRITEABLE	(0)
 #define BTRFS_DEV_STATE_IN_FS_METADATA	(1)
 #define BTRFS_DEV_STATE_MISSING		(2)
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
  2018-01-30  6:30 [PATCH 0/2] Policy to balance read across mirrored devices Anand Jain
  2018-01-30  6:30 ` [PATCH 1/2] btrfs: add mount option read_mirror_policy Anand Jain
@ 2018-01-30  6:30 ` Anand Jain
  2018-01-31  8:38   ` Nikolay Borisov
  2018-01-31  7:51 ` [PATCH 0/2] Policy to balance read across mirrored devices Peter Becker
  2 siblings, 1 reply; 23+ messages in thread
From: Anand Jain @ 2018-01-30  6:30 UTC (permalink / raw)
  To: linux-btrfs

Adds the mount option:
  mount -o read_mirror_policy=<devid>

To set the devid of the device which should be used for read. That
means all the normal reads will go to that particular device only.

This also helps testing and gives a better control for the test
scripts including mount context reads.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/super.c   | 21 +++++++++++++++++++++
 fs/btrfs/volumes.c | 10 ++++++++++
 fs/btrfs/volumes.h |  2 ++
 3 files changed, 33 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index dfe6b3c67df3..d3aad8cccc7e 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -847,6 +847,27 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 					BTRFS_READ_MIRROR_BY_PID;
 				break;
 			}
+
+			intarg = 0;
+			if (match_int(&args[0], &intarg) == 0) {
+				struct btrfs_device *device;
+
+				device = btrfs_find_device(info, intarg,
+							   NULL, NULL);
+				if (!device) {
+					btrfs_err(info,
+					  "read_mirror_policy: invalid devid %d",
+					  intarg);
+					ret = -EINVAL;
+					goto out;
+				}
+				info->read_mirror_policy =
+						BTRFS_READ_MIRROR_BY_DEV;
+				set_bit(BTRFS_DEV_STATE_READ_MIRROR,
+					&device->dev_state);
+				break;
+			}
+
 			ret = -EINVAL;
 			goto out;
 		case Opt_err:
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 39ba59832f38..478623e6e074 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5270,6 +5270,16 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 		num = map->num_stripes;
 
 	switch(fs_info->read_mirror_policy) {
+	case BTRFS_READ_MIRROR_BY_DEV:
+		optimal = first;
+		if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
+			     &map->stripes[optimal].dev->dev_state))
+			break;
+		if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
+			     &map->stripes[++optimal].dev->dev_state))
+			break;
+		optimal = first;
+		break;
 	case BTRFS_READ_MIRROR_DEFAULT:
 	case BTRFS_READ_MIRROR_BY_PID:
 	default:
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 78f35d299a61..7281f55dea05 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -50,6 +50,7 @@ struct btrfs_pending_bios {
 enum btrfs_read_mirror_type {
 	BTRFS_READ_MIRROR_DEFAULT,
 	BTRFS_READ_MIRROR_BY_PID,
+	BTRFS_READ_MIRROR_BY_DEV,
 };
 
 #define BTRFS_DEV_STATE_WRITEABLE	(0)
@@ -57,6 +58,7 @@ enum btrfs_read_mirror_type {
 #define BTRFS_DEV_STATE_MISSING		(2)
 #define BTRFS_DEV_STATE_REPLACE_TGT	(3)
 #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
+#define BTRFS_DEV_STATE_READ_MIRROR	(5)
 
 struct btrfs_device {
 	struct list_head dev_list;
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] Policy to balance read across mirrored devices
  2018-01-30  6:30 [PATCH 0/2] Policy to balance read across mirrored devices Anand Jain
  2018-01-30  6:30 ` [PATCH 1/2] btrfs: add mount option read_mirror_policy Anand Jain
  2018-01-30  6:30 ` [PATCH 2/2] btrfs: add read_mirror_policy parameter devid Anand Jain
@ 2018-01-31  7:51 ` Peter Becker
  2018-01-31  9:01   ` Anand Jain
  2 siblings, 1 reply; 23+ messages in thread
From: Peter Becker @ 2018-01-31  7:51 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

A little question about mount -o read_mirror_policy=<devid>.

How would this work with RAID1 over 3 or 4 HDD's?
In particular, if the desired block is not available on device <devid>.
Could i repeat this option like the device-option to specify a
order/priority like this:

mount -o read_mirror_policy= 1,read_mirror_policy= 3

2018-01-30 7:30 GMT+01:00 Anand Jain <anand.jain@oracle.com>:
> In case of RAID1 and RAID10 devices are mirror-ed, a read IO can
> pick any device for reading. This choice of picking a device for
> reading should be configurable. In short not one policy would
> satisfy all types of workload and configs.
>
> So before we add more policies, this patch-set makes existing
> $pid policy configurable from the mount option and adds a devid
> based read_mirror device policy. And keeps $pid based policy as
> the default option for now. So this mount option helps to try out
> different read mirror load balances.
>
> Further we can add more policies on top of it, for example..
>
>   mount -o read_mirror_policy=pid (current, default) [1]
>   mount -o read_mirror_policy=<devid> [2]
>   mount -o read_mirror_policy=lba (illustration only) [3]
>   mount -o read_mirror_policy=ssd (illustration only) [4]
>   mount -o read_mirror_policy=io (illustration only) [5]
>   mount -o read_mirror_policy=<heuristic1> (illustration only) [6]
>
> [1]
>  Current PID based read mirror balance.
>
> [2]
>  Set the devid of the device which should be used for read. That
>  means all the normal read will go to that particular device only.
>  This also helps testing and gives a better control for the test
>  scripts including mount context reads.
>
> [3]
>  In case of SAN storage, some storage prefers that host access the
>  LUN based on the LBA so that there won't be duplications of
>  caching on the storage.
>
> [4]
>  In case of mix of SSD and HD we may want to use SSD as the primary
>  read device.
>
> [5]
>  If storage caching is not the bottleneck but the transport layer
>  is then read load should be tuned based on the IO load.
>
> [6]
>  Or a combination of any of above as per the priority.
>  Timofey should consider to base his patch on top of this.
>     Btrfs: enchanse raid1/10 balance heuristic
>
> This patch set is on top of the preparatory patch set:
>   [PATCH 0/2] Preparatory to add read_mirror mount option
>
>
> Anand Jain (2):
>   btrfs: add mount option read_mirror_policy
>   btrfs: add read_mirror_policy parameter devid
>
>  fs/btrfs/ctree.h   |  2 ++
>  fs/btrfs/super.c   | 31 +++++++++++++++++++++++++++++++
>  fs/btrfs/volumes.c | 18 +++++++++++++++++-
>  fs/btrfs/volumes.h |  7 +++++++
>  4 files changed, 57 insertions(+), 1 deletion(-)
>
> --
> 2.7.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] btrfs: add mount option read_mirror_policy
  2018-01-30  6:30 ` [PATCH 1/2] btrfs: add mount option read_mirror_policy Anand Jain
@ 2018-01-31  8:06   ` Nikolay Borisov
  2018-01-31  9:06     ` Anand Jain
  0 siblings, 1 reply; 23+ messages in thread
From: Nikolay Borisov @ 2018-01-31  8:06 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs



On 30.01.2018 08:30, Anand Jain wrote:
> In case of RAID1 and RAID10 devices are mirror-ed, a read IO can
> pick any device for reading. This choice of picking a device for
> reading should be configurable. In short not one policy would
> satisfy all types of workload and configs.
> 
> So before we add more policies, this patch-set makes existing
> $pid policy configurable from the mount option.
> 
> For example..
>   mount -o read_mirror_policy=pid (which is also default)
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
>  fs/btrfs/ctree.h   |  2 ++
>  fs/btrfs/super.c   | 10 ++++++++++
>  fs/btrfs/volumes.c |  8 +++++++-
>  fs/btrfs/volumes.h |  5 +++++
>  4 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 1a462ab85c49..4759e988b0df 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1100,6 +1100,8 @@ struct btrfs_fs_info {
>  	spinlock_t ref_verify_lock;
>  	struct rb_root block_tree;
>  #endif
> +	/* Policy to balance read across mirrored devices */
> +	int read_mirror_policy;

make that member enum btrfs_read_mirror_type

>  };
>  
>  static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 367ecbf477b9..dfe6b3c67df3 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -329,6 +329,7 @@ enum {
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>  	Opt_ref_verify,
>  #endif
> +	Opt_read_mirror_policy,
>  	Opt_err,
>  };
>  
> @@ -393,6 +394,7 @@ static const match_table_t tokens = {
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>  	{Opt_ref_verify, "ref_verify"},
>  #endif
> +	{Opt_read_mirror_policy, "read_mirror_policy=%s"},
>  	{Opt_err, NULL},
>  };
>  
> @@ -839,6 +841,14 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>  			btrfs_set_opt(info->mount_opt, REF_VERIFY);
>  			break;
>  #endif
> +		case Opt_read_mirror_policy:
> +			if (strcmp(args[0].from, "pid") == 0) {
> +				info->read_mirror_policy =
> +					BTRFS_READ_MIRROR_BY_PID;
> +				break;
> +			}
> +			ret = -EINVAL;
> +			goto out;
>  		case Opt_err:
>  			btrfs_info(info, "unrecognized mount option '%s'", p);
>  			ret = -EINVAL;
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index a61715677b67..39ba59832f38 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5269,7 +5269,13 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
>  	else
>  		num = map->num_stripes;
>  
> -	optimal = first + current->pid % num;
> +	switch(fs_info->read_mirror_policy) {
> +	case BTRFS_READ_MIRROR_DEFAULT:
> +	case BTRFS_READ_MIRROR_BY_PID:
> +	default:
> +		optimal = first + current->pid % num;
> +		break;
> +	}

Why not factor out this code in a separate function with descriptive
name and some documentation. It seems you have plans how to extend this
mechanism further so let's try and make it maintainable from the get-go.

>  
>  	if (dev_replace_is_ongoing &&
>  	    fs_info->dev_replace.cont_reading_from_srcdev_mode ==
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 28c28eeadff3..78f35d299a61 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -47,6 +47,11 @@ struct btrfs_pending_bios {
>  #define btrfs_device_data_ordered_init(device) do { } while (0)
>  #endif
>  
> +enum btrfs_read_mirror_type {
> +	BTRFS_READ_MIRROR_DEFAULT,
> +	BTRFS_READ_MIRROR_BY_PID,
> +};
> +
>  #define BTRFS_DEV_STATE_WRITEABLE	(0)
>  #define BTRFS_DEV_STATE_IN_FS_METADATA	(1)
>  #define BTRFS_DEV_STATE_MISSING		(2)
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
  2018-01-30  6:30 ` [PATCH 2/2] btrfs: add read_mirror_policy parameter devid Anand Jain
@ 2018-01-31  8:38   ` Nikolay Borisov
  2018-01-31  9:28     ` Anand Jain
  0 siblings, 1 reply; 23+ messages in thread
From: Nikolay Borisov @ 2018-01-31  8:38 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs



On 30.01.2018 08:30, Anand Jain wrote:
> Adds the mount option:
>   mount -o read_mirror_policy=<devid>
> 
> To set the devid of the device which should be used for read. That
> means all the normal reads will go to that particular device only.
> 
> This also helps testing and gives a better control for the test
> scripts including mount context reads.

Some code comments below. OTOH, does such policy really make sense, what
happens if the selected device fails, will the other mirror be retried?
If the answer to the previous question is positive then why do we really
care which device is going to be tried first?

> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
>  fs/btrfs/super.c   | 21 +++++++++++++++++++++
>  fs/btrfs/volumes.c | 10 ++++++++++
>  fs/btrfs/volumes.h |  2 ++
>  3 files changed, 33 insertions(+)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index dfe6b3c67df3..d3aad8cccc7e 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -847,6 +847,27 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>  					BTRFS_READ_MIRROR_BY_PID;
>  				break;
>  			}
> +
> +			intarg = 0;
> +			if (match_int(&args[0], &intarg) == 0) {
> +				struct btrfs_device *device;
> +
> +				device = btrfs_find_device(info, intarg,
> +							   NULL, NULL);
> +				if (!device) {
> +					btrfs_err(info,
> +					  "read_mirror_policy: invalid devid %d",
> +					  intarg);
> +					ret = -EINVAL;
> +					goto out;
> +				}
> +				info->read_mirror_policy =
> +						BTRFS_READ_MIRROR_BY_DEV;
> +				set_bit(BTRFS_DEV_STATE_READ_MIRROR,
> +					&device->dev_state);
> +				break;
> +			}
> +
>  			ret = -EINVAL;
>  			goto out;
>  		case Opt_err:
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 39ba59832f38..478623e6e074 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
>  		num = map->num_stripes;
>  
>  	switch(fs_info->read_mirror_policy) {
> +	case BTRFS_READ_MIRROR_BY_DEV:
> +		optimal = first;
> +		if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
> +			     &map->stripes[optimal].dev->dev_state))
> +			break;
> +		if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
> +			     &map->stripes[++optimal].dev->dev_state))
> +			break;
> +		optimal = first;

you set optimal 2 times, the second one seems redundant. Alongside this
patch it makes sense to also send a patch to btrfs(5) man page
describing the mount option + description of each implemented allocation
policy.

Another thing which I don't see here is how you are handling the case
when you have more than 2 devices in the RAID1 case. As it stands
currently you assume there are two devices and first test device 0 and
then device 1 and completely ignore any other devices.

> +		break;
>  	case BTRFS_READ_MIRROR_DEFAULT:
>  	case BTRFS_READ_MIRROR_BY_PID:
>  	default:
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 78f35d299a61..7281f55dea05 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -50,6 +50,7 @@ struct btrfs_pending_bios {
>  enum btrfs_read_mirror_type {
>  	BTRFS_READ_MIRROR_DEFAULT,
>  	BTRFS_READ_MIRROR_BY_PID,
> +	BTRFS_READ_MIRROR_BY_DEV,
>  };
>  
>  #define BTRFS_DEV_STATE_WRITEABLE	(0)
> @@ -57,6 +58,7 @@ enum btrfs_read_mirror_type {
>  #define BTRFS_DEV_STATE_MISSING		(2)
>  #define BTRFS_DEV_STATE_REPLACE_TGT	(3)
>  #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
> +#define BTRFS_DEV_STATE_READ_MIRROR	(5)
>  
>  struct btrfs_device {
>  	struct list_head dev_list;
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] Policy to balance read across mirrored devices
  2018-01-31  7:51 ` [PATCH 0/2] Policy to balance read across mirrored devices Peter Becker
@ 2018-01-31  9:01   ` Anand Jain
  2018-01-31 10:47     ` Peter Becker
  0 siblings, 1 reply; 23+ messages in thread
From: Anand Jain @ 2018-01-31  9:01 UTC (permalink / raw)
  To: Peter Becker; +Cc: linux-btrfs


On 01/31/2018 03:51 PM, Peter Becker wrote:

> A little question about mount -o read_mirror_policy=<devid>.
> 
> How would this work with RAID1 over 3 or 4 HDD's?
> In particular, if the desired block is not available on device <devid>.

  When a stripe is not present on the read optimized disk it will just
  use the lower devid disk containing the stripe (instead of failing back
  to the pid based random disk).

> Could i repeat this option like the device-option to specify a
> order/priority like this:
> 
> mount -o read_mirror_policy= 1,read_mirror_policy= 3

  Yes.

Thanks, Anand

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] btrfs: add mount option read_mirror_policy
  2018-01-31  8:06   ` Nikolay Borisov
@ 2018-01-31  9:06     ` Anand Jain
  0 siblings, 0 replies; 23+ messages in thread
From: Anand Jain @ 2018-01-31  9:06 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs



On 01/31/2018 04:06 PM, Nikolay Borisov wrote:

>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 1a462ab85c49..4759e988b0df 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1100,6 +1100,8 @@ struct btrfs_fs_info {
>>   	spinlock_t ref_verify_lock;
>>   	struct rb_root block_tree;
>>   #endif
>> +	/* Policy to balance read across mirrored devices */
>> +	int read_mirror_policy;
> 
> make that member enum btrfs_read_mirror_type

  yep. Will do.

::

>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index a61715677b67..39ba59832f38 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -5269,7 +5269,13 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
>>   	else
>>   		num = map->num_stripes;
>>   
>> -	optimal = first + current->pid % num;
>> +	switch(fs_info->read_mirror_policy) {
>> +	case BTRFS_READ_MIRROR_DEFAULT:
>> +	case BTRFS_READ_MIRROR_BY_PID:
>> +	default:
>> +		optimal = first + current->pid % num;
>> +		break;
>> +	}
> 
> Why not factor out this code in a separate function with descriptive
> name and some documentation. It seems you have plans how to extend this
> mechanism further so let's try and make it maintainable from the get-go.

   This is in fact restoring the original design, will add comments.
   In the long term we may have up couple of more choices (like LBA),
   will move it to a function.

Thanks, Anand

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
  2018-01-31  8:38   ` Nikolay Borisov
@ 2018-01-31  9:28     ` Anand Jain
  2018-01-31  9:54       ` Nikolay Borisov
  0 siblings, 1 reply; 23+ messages in thread
From: Anand Jain @ 2018-01-31  9:28 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs



On 01/31/2018 04:38 PM, Nikolay Borisov wrote:
> 
> 
> On 30.01.2018 08:30, Anand Jain wrote:
>> Adds the mount option:
>>    mount -o read_mirror_policy=<devid>
>>
>> To set the devid of the device which should be used for read. That
>> means all the normal reads will go to that particular device only.
>>
>> This also helps testing and gives a better control for the test
>> scripts including mount context reads.
> 
> Some code comments below. OTOH, does such policy really make sense, what
> happens if the selected device fails, will the other mirror be retried?

  Everything as usual, read_mirror_policy=devid just lets the user to
  specify his read optimized disk, so that we don't depend on the pid
  to pick a stripe mirrored disk, and instead we would pick as suggested
  by the user, and if that disk fails then we go back to the other mirror
  which may not be the read optimized disk as we have no other choice.

> If the answer to the previous question is positive then why do we really
> care which device is going to be tried first?

  It matters.
    - If you are reading from both disks alternatively, then it
      duplicates the LUN cache on the storage.
    - Some disks are read-optimized and using that for reading and going
      back to the other disk only when this disk fails provides a better
      overall read performance.

::
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 39ba59832f38..478623e6e074 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
>>   		num = map->num_stripes;
>>   
>>   	switch(fs_info->read_mirror_policy) {
>> +	case BTRFS_READ_MIRROR_BY_DEV:
>> +		optimal = first;
>> +		if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>> +			     &map->stripes[optimal].dev->dev_state))
>> +			break;
>> +		if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>> +			     &map->stripes[++optimal].dev->dev_state))
>> +			break;
>> +		optimal = first;
> 
> you set optimal 2 times, the second one seems redundant.

  No actually. When both the disks containing the stripe does not
  have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to
  use first found stripe.

> Alongside this
> patch it makes sense to also send a patch to btrfs(5) man page
> describing the mount option + description of each implemented allocation
> policy.

  Yep. Will do.

> Another thing which I don't see here is how you are handling the case
> when you have more than 2 devices in the RAID1 case. As it stands
> currently you assume there are two devices and first test device 0 and
> then device 1 and completely ignore any other devices.

  Not really. That part is already handled by the extent mapping.
  As the number of stripe for raid1 is two, the extent mapping will
  manage put related two devices of this stripe.

Thanks, Anand

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
  2018-01-31  9:28     ` Anand Jain
@ 2018-01-31  9:54       ` Nikolay Borisov
  2018-01-31 13:38         ` Anand Jain
  0 siblings, 1 reply; 23+ messages in thread
From: Nikolay Borisov @ 2018-01-31  9:54 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs



On 31.01.2018 11:28, Anand Jain wrote:
> 
> 
> On 01/31/2018 04:38 PM, Nikolay Borisov wrote:
>>
>>
>> On 30.01.2018 08:30, Anand Jain wrote:
>>> Adds the mount option:
>>>    mount -o read_mirror_policy=<devid>
>>>
>>> To set the devid of the device which should be used for read. That
>>> means all the normal reads will go to that particular device only.
>>>
>>> This also helps testing and gives a better control for the test
>>> scripts including mount context reads.
>>
>> Some code comments below. OTOH, does such policy really make sense, what
>> happens if the selected device fails, will the other mirror be retried?
> 
>  Everything as usual, read_mirror_policy=devid just lets the user to
>  specify his read optimized disk, so that we don't depend on the pid
>  to pick a stripe mirrored disk, and instead we would pick as suggested
>  by the user, and if that disk fails then we go back to the other mirror
>  which may not be the read optimized disk as we have no other choice.
> 
>> If the answer to the previous question is positive then why do we really
>> care which device is going to be tried first?
> 
>  It matters.
>    - If you are reading from both disks alternatively, then it
>      duplicates the LUN cache on the storage.
>    - Some disks are read-optimized and using that for reading and going
>      back to the other disk only when this disk fails provides a better
>      overall read performance.

So usually this should be functionality handled by the raid/san
controller I guess, but given that btrfs is playing the role of a
controller here at what point are we drawing the line of not
implementing block-level functionality into the filesystem ?

> 
> ::
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index 39ba59832f38..478623e6e074 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct
>>> btrfs_fs_info *fs_info,
>>>           num = map->num_stripes;
>>>         switch(fs_info->read_mirror_policy) {
>>> +    case BTRFS_READ_MIRROR_BY_DEV:
>>> +        optimal = first;
>>> +        if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>>> +                 &map->stripes[optimal].dev->dev_state))
>>> +            break;
>>> +        if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>>> +                 &map->stripes[++optimal].dev->dev_state))
>>> +            break;
>>> +        optimal = first;
>>
>> you set optimal 2 times, the second one seems redundant.
> 
>  No actually. When both the disks containing the stripe does not
>  have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to
>  use first found stripe.

Yes, and the fact that you've already set optimal = first right after
BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again set
optimal right before the final break? What am I missing here?

> 
>> Alongside this
>> patch it makes sense to also send a patch to btrfs(5) man page
>> describing the mount option + description of each implemented allocation
>> policy.
> 
>  Yep. Will do.
> 
>> Another thing which I don't see here is how you are handling the case
>> when you have more than 2 devices in the RAID1 case. As it stands
>> currently you assume there are two devices and first test device 0 and
>> then device 1 and completely ignore any other devices.
> 
>  Not really. That part is already handled by the extent mapping.
>  As the number of stripe for raid1 is two, the extent mapping will
>  manage put related two devices of this stripe.
> 
> Thanks, Anand
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] Policy to balance read across mirrored devices
  2018-01-31  9:01   ` Anand Jain
@ 2018-01-31 10:47     ` Peter Becker
  2018-01-31 14:26       ` Anand Jain
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Becker @ 2018-01-31 10:47 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

2018-01-31 10:01 GMT+01:00 Anand Jain <anand.jain@oracle.com>:
>  When a stripe is not present on the read optimized disk it will just
>  use the lower devid disk containing the stripe (instead of failing back
>  to the pid based random disk).

Is this a good behavior? beause this would eliminate every performance
benefit of the pid base random disk pick if the requested stripe is
not present on the read optimized disk.
Wouldn't it be better to specify a fallback and use the pid base
random pick as default for the fallback.

For example:

RAID 1 over 4 disk's

devid | rpm | size
------------------------
1 | 7200 rpm | 3 TB
2 | 7200 rpm | 3 TB
3 | 5400 rpm | 4 TB
4 | 5400 rpm | 4 TB

mount -o read_mirror_policy=1,read_mirror_policy=2

Cases:
1. if the requested stripe is on devid 3 and 4 the algorithm should
choise on of both randomly to incresse performance instead of read
everytime from 3 and never from 4
2. if the requested stripe is on devid 1 and 3, all is fine ( in case
of the queue deep of 1 isn't mutch larger then the queue deep of 3 )
3. if the requested stripe is on devid 1 and 2, the algorithm should
choise on of both randomly to incresse performance instead of read
everytime from 1 and never from 2

And all randomly picks of a device should be replaced by a heuristic
algorithm wo respect the queue deep and sequential reads in the
future.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
  2018-01-31  9:54       ` Nikolay Borisov
@ 2018-01-31 13:38         ` Anand Jain
  2018-01-31 13:42           ` Nikolay Borisov
  0 siblings, 1 reply; 23+ messages in thread
From: Anand Jain @ 2018-01-31 13:38 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs



On 01/31/2018 05:54 PM, Nikolay Borisov wrote:
> 
> 
> On 31.01.2018 11:28, Anand Jain wrote:
>>
>>
>> On 01/31/2018 04:38 PM, Nikolay Borisov wrote:
>>>
>>>
>>> On 30.01.2018 08:30, Anand Jain wrote:
>>>> Adds the mount option:
>>>>     mount -o read_mirror_policy=<devid>
>>>>
>>>> To set the devid of the device which should be used for read. That
>>>> means all the normal reads will go to that particular device only.
>>>>
>>>> This also helps testing and gives a better control for the test
>>>> scripts including mount context reads.
>>>
>>> Some code comments below. OTOH, does such policy really make sense, what
>>> happens if the selected device fails, will the other mirror be retried?
>>
>>   Everything as usual, read_mirror_policy=devid just lets the user to
>>   specify his read optimized disk, so that we don't depend on the pid
>>   to pick a stripe mirrored disk, and instead we would pick as suggested
>>   by the user, and if that disk fails then we go back to the other mirror
>>   which may not be the read optimized disk as we have no other choice.
>>
>>> If the answer to the previous question is positive then why do we really
>>> care which device is going to be tried first?
>>
>>   It matters.
>>     - If you are reading from both disks alternatively, then it
>>       duplicates the LUN cache on the storage.
>>     - Some disks are read-optimized and using that for reading and going
>>       back to the other disk only when this disk fails provides a better
>>       overall read performance.
> 
> So usually this should be functionality handled by the raid/san
> controller I guess, > but given that btrfs is playing the role of a
> controller here at what point are we drawing the line of not
> implementing block-level functionality into the filesystem ?

  Don't worry this is not invading into the block layer. How
  can you even build this functionality in the block layer ?
  Block layer even won't know that disks are mirrored. RAID
  does or BTRFS in our case.

>> ::
>>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>>> index 39ba59832f38..478623e6e074 100644
>>>> --- a/fs/btrfs/volumes.c
>>>> +++ b/fs/btrfs/volumes.c
>>>> @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct
>>>> btrfs_fs_info *fs_info,
>>>>            num = map->num_stripes;
>>>>          switch(fs_info->read_mirror_policy) {
>>>> +    case BTRFS_READ_MIRROR_BY_DEV:
>>>> +        optimal = first;
>>>> +        if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>>>> +                 &map->stripes[optimal].dev->dev_state))
>>>> +            break;
>>>> +        if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>>>> +                 &map->stripes[++optimal].dev->dev_state))
>>>> +            break;
>>>> +        optimal = first;
>>>
>>> you set optimal 2 times, the second one seems redundant.
>>
>>   No actually. When both the disks containing the stripe does not
>>   have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to
>>   use first found stripe.
> 
> Yes, and the fact that you've already set optimal = first right after
> BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again set
> optimal right before the final break? What am I missing here?

   Ah. I think you are missing ++optimal in the 2nd if.

Thanks, Anand

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
  2018-01-31 13:38         ` Anand Jain
@ 2018-01-31 13:42           ` Nikolay Borisov
  2018-01-31 14:36             ` Anand Jain
  0 siblings, 1 reply; 23+ messages in thread
From: Nikolay Borisov @ 2018-01-31 13:42 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs



On 31.01.2018 15:38, Anand Jain wrote:
> 
> 
> On 01/31/2018 05:54 PM, Nikolay Borisov wrote:
>>
>>
>> On 31.01.2018 11:28, Anand Jain wrote:
>>>
>>>
>>> On 01/31/2018 04:38 PM, Nikolay Borisov wrote:
>>>>
>>>>
>>>> On 30.01.2018 08:30, Anand Jain wrote:
>>>>> Adds the mount option:
>>>>>     mount -o read_mirror_policy=<devid>
>>>>>
>>>>> To set the devid of the device which should be used for read. That
>>>>> means all the normal reads will go to that particular device only.
>>>>>
>>>>> This also helps testing and gives a better control for the test
>>>>> scripts including mount context reads.
>>>>
>>>> Some code comments below. OTOH, does such policy really make sense,
>>>> what
>>>> happens if the selected device fails, will the other mirror be retried?
>>>
>>>   Everything as usual, read_mirror_policy=devid just lets the user to
>>>   specify his read optimized disk, so that we don't depend on the pid
>>>   to pick a stripe mirrored disk, and instead we would pick as suggested
>>>   by the user, and if that disk fails then we go back to the other
>>> mirror
>>>   which may not be the read optimized disk as we have no other choice.
>>>
>>>> If the answer to the previous question is positive then why do we
>>>> really
>>>> care which device is going to be tried first?
>>>
>>>   It matters.
>>>     - If you are reading from both disks alternatively, then it
>>>       duplicates the LUN cache on the storage.
>>>     - Some disks are read-optimized and using that for reading and going
>>>       back to the other disk only when this disk fails provides a better
>>>       overall read performance.
>>
>> So usually this should be functionality handled by the raid/san
>> controller I guess, > but given that btrfs is playing the role of a
>> controller here at what point are we drawing the line of not
>> implementing block-level functionality into the filesystem ?
> 
>  Don't worry this is not invading into the block layer. How
>  can you even build this functionality in the block layer ?
>  Block layer even won't know that disks are mirrored. RAID
>  does or BTRFS in our case.
> 

By block layer I guess I meant the storage driver of a particular raid
card. Because what is currently happening is re-implementing
functionality that will generally sit in the driver. So my question was
more generic and high-level - at what point do we draw the line of
implementing feature that are generally implemented in hardware devices
(be it their drivers or firmware).

>>> ::
>>>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>>>> index 39ba59832f38..478623e6e074 100644
>>>>> --- a/fs/btrfs/volumes.c
>>>>> +++ b/fs/btrfs/volumes.c
>>>>> @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct
>>>>> btrfs_fs_info *fs_info,
>>>>>            num = map->num_stripes;
>>>>>          switch(fs_info->read_mirror_policy) {
>>>>> +    case BTRFS_READ_MIRROR_BY_DEV:
>>>>> +        optimal = first;
>>>>> +        if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>>>>> +                 &map->stripes[optimal].dev->dev_state))
>>>>> +            break;
>>>>> +        if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>>>>> +                 &map->stripes[++optimal].dev->dev_state))
>>>>> +            break;
>>>>> +        optimal = first;
>>>>
>>>> you set optimal 2 times, the second one seems redundant.
>>>
>>>   No actually. When both the disks containing the stripe does not
>>>   have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to
>>>   use first found stripe.
>>
>> Yes, and the fact that you've already set optimal = first right after
>> BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again set
>> optimal right before the final break? What am I missing here?
> 
>   Ah. I think you are missing ++optimal in the 2nd if.

You are right, but I'd prefer you index the stripes array with 'optimal'
and 'optimal + 1' and leave just a single assignment

> 
> Thanks, Anand
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] Policy to balance read across mirrored devices
  2018-01-31 10:47     ` Peter Becker
@ 2018-01-31 14:26       ` Anand Jain
  2018-01-31 14:52         ` Peter Becker
  0 siblings, 1 reply; 23+ messages in thread
From: Anand Jain @ 2018-01-31 14:26 UTC (permalink / raw)
  To: Peter Becker; +Cc: linux-btrfs



On 01/31/2018 06:47 PM, Peter Becker wrote:
> 2018-01-31 10:01 GMT+01:00 Anand Jain <anand.jain@oracle.com>:
>>   When a stripe is not present on the read optimized disk it will just
>>   use the lower devid disk containing the stripe (instead of failing back
>>   to the pid based random disk).
> 
> Is this a good behavior? beause this would eliminate every performance
> benefit of the pid base random disk pick if the requested stripe is
> not present on the read optimized disk.
> Wouldn't it be better to specify a fallback and use the pid base
> random pick as default for the fallback.
> 
> For example:
> 
> RAID 1 over 4 disk's
> 
> devid | rpm | size
> ------------------------
> 1 | 7200 rpm | 3 TB
> 2 | 7200 rpm | 3 TB
> 3 | 5400 rpm | 4 TB
> 4 | 5400 rpm | 4 TB
> 
> mount -o read_mirror_policy=1,read_mirror_policy=2
> 
> Cases:
> 1. if the requested stripe is on devid 3 and 4 the algorithm should
> choise on of both randomly to incresse performance instead of read
> everytime from 3 and never from 4
> 2. if the requested stripe is on devid 1 and 3, all is fine ( in case
> of the queue deep of 1 isn't mutch larger then the queue deep of 3 )
> 3. if the requested stripe is on devid 1 and 2, the algorithm should
> choise on of both randomly to incresse performance instead of read
> everytime from 1 and never from 2
 >
> And all randomly picks of a device should be replaced by a heuristic
> algorithm wo respect the queue deep and sequential reads in the
> future.

  This scenario is very well handled by the pid/heuristic based
  read load balancer, pid based read load balancer is by default still,
  Tim has written IO load based read balancer which can be set using
  this mount option when all integrated together, and it needs
  experiments to see if it can be by default replacing the pid method.
  Further as of now we don't do allocation grouping, so if you have two
  ssd and two hd in a RAID1 its not guaranteed that allocation will
  always span across a SSD and a HD, so there is bit of randomness
  in the allocation itself.

Thanks, Anand

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
  2018-01-31 13:42           ` Nikolay Borisov
@ 2018-01-31 14:36             ` Anand Jain
  2018-02-01  5:26               ` Edmund Nadolski
  0 siblings, 1 reply; 23+ messages in thread
From: Anand Jain @ 2018-01-31 14:36 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs



On 01/31/2018 09:42 PM, Nikolay Borisov wrote:


>>> So usually this should be functionality handled by the raid/san
>>> controller I guess, > but given that btrfs is playing the role of a
>>> controller here at what point are we drawing the line of not
>>> implementing block-level functionality into the filesystem ?
>>
>>   Don't worry this is not invading into the block layer. How
>>   can you even build this functionality in the block layer ?
>>   Block layer even won't know that disks are mirrored. RAID
>>   does or BTRFS in our case.
>>
> 
> By block layer I guess I meant the storage driver of a particular raid
> card. Because what is currently happening is re-implementing
> functionality that will generally sit in the driver. So my question was
> more generic and high-level - at what point do we draw the line of
> implementing feature that are generally implemented in hardware devices
> (be it their drivers or firmware).

  Not all HW configs use RAID capable HBAs. A server connected to a SATA
  JBOD using a SATA HBA without MD will relay on BTRFS to provide all the
  features and capabilities that otherwise would have provided by such a
  presumable HW config.

::

>>>> ::
>>>>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>>>>> index 39ba59832f38..478623e6e074 100644
>>>>>> --- a/fs/btrfs/volumes.c
>>>>>> +++ b/fs/btrfs/volumes.c
>>>>>> @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct
>>>>>> btrfs_fs_info *fs_info,
>>>>>>             num = map->num_stripes;
>>>>>>           switch(fs_info->read_mirror_policy) {
>>>>>> +    case BTRFS_READ_MIRROR_BY_DEV:
>>>>>> +        optimal = first;
>>>>>> +        if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>>>>>> +                 &map->stripes[optimal].dev->dev_state))
>>>>>> +            break;
>>>>>> +        if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>>>>>> +                 &map->stripes[++optimal].dev->dev_state))
>>>>>> +            break;
>>>>>> +        optimal = first;
>>>>>
>>>>> you set optimal 2 times, the second one seems redundant.
>>>>
>>>>    No actually. When both the disks containing the stripe does not
>>>>    have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to
>>>>    use first found stripe.
>>>
>>> Yes, and the fact that you've already set optimal = first right after
>>> BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again set
>>> optimal right before the final break? What am I missing here?
>>
>>    Ah. I think you are missing ++optimal in the 2nd if.
> 
> You are right, but I'd prefer you index the stripes array with 'optimal'
> and 'optimal + 1' and leave just a single assignment

  Ok. Will improve that.

Thanks, Anand


>>
>> Thanks, Anand
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] Policy to balance read across mirrored devices
  2018-01-31 14:26       ` Anand Jain
@ 2018-01-31 14:52         ` Peter Becker
  2018-01-31 16:11           ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Becker @ 2018-01-31 14:52 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

This is all clear. My question referes to "use the lower devid disk
containing the stripe"

2018-01-31 10:01 GMT+01:00 Anand Jain <anand.jain@oracle.com>:
>  When a stripe is not present on the read optimized disk it will just
>  use the lower devid disk containing the stripe (instead of failing back
>  to the pid based random disk).

Use only one disk (the disk with the lowest devid that containing the
stripe) as fallback should be not a good option imho.
Instead of it should still be used the pid as fallback to distribute
the workload among all available drives.

[stripe to use] = [preffer stripes present on read_mirror_policy
devids] > [fallback to pid % stripe count]

Perhaps I'm not be able to express myself in English or did I misunderstand you?

2018-01-31 15:26 GMT+01:00 Anand Jain <anand.jain@oracle.com>:
>
>
> On 01/31/2018 06:47 PM, Peter Becker wrote:
>>
>> 2018-01-31 10:01 GMT+01:00 Anand Jain <anand.jain@oracle.com>:
>>>
>>>   When a stripe is not present on the read optimized disk it will just
>>>   use the lower devid disk containing the stripe (instead of failing back
>>>   to the pid based random disk).
>>
>>
>> Is this a good behavior? beause this would eliminate every performance
>> benefit of the pid base random disk pick if the requested stripe is
>> not present on the read optimized disk.
>> Wouldn't it be better to specify a fallback and use the pid base
>> random pick as default for the fallback.
>>
>> For example:
>>
>> RAID 1 over 4 disk's
>>
>> devid | rpm | size
>> ------------------------
>> 1 | 7200 rpm | 3 TB
>> 2 | 7200 rpm | 3 TB
>> 3 | 5400 rpm | 4 TB
>> 4 | 5400 rpm | 4 TB
>>
>> mount -o read_mirror_policy=1,read_mirror_policy=2
>>
>> Cases:
>> 1. if the requested stripe is on devid 3 and 4 the algorithm should
>> choise on of both randomly to incresse performance instead of read
>> everytime from 3 and never from 4
>> 2. if the requested stripe is on devid 1 and 3, all is fine ( in case
>> of the queue deep of 1 isn't mutch larger then the queue deep of 3 )
>> 3. if the requested stripe is on devid 1 and 2, the algorithm should
>> choise on of both randomly to incresse performance instead of read
>> everytime from 1 and never from 2
>
>>
>>
>> And all randomly picks of a device should be replaced by a heuristic
>> algorithm wo respect the queue deep and sequential reads in the
>> future.
>
>
>  This scenario is very well handled by the pid/heuristic based
>  read load balancer, pid based read load balancer is by default still,
>  Tim has written IO load based read balancer which can be set using
>  this mount option when all integrated together, and it needs
>  experiments to see if it can be by default replacing the pid method.
>  Further as of now we don't do allocation grouping, so if you have two
>  ssd and two hd in a RAID1 its not guaranteed that allocation will
>  always span across a SSD and a HD, so there is bit of randomness
>  in the allocation itself.
>
> Thanks, Anand

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] Policy to balance read across mirrored devices
  2018-01-31 14:52         ` Peter Becker
@ 2018-01-31 16:11           ` Austin S. Hemmelgarn
  2018-01-31 16:40             ` Peter Becker
  0 siblings, 1 reply; 23+ messages in thread
From: Austin S. Hemmelgarn @ 2018-01-31 16:11 UTC (permalink / raw)
  To: Peter Becker, Anand Jain; +Cc: linux-btrfs

On 2018-01-31 09:52, Peter Becker wrote:
> This is all clear. My question referes to "use the lower devid disk
> containing the stripe"
> 
> 2018-01-31 10:01 GMT+01:00 Anand Jain <anand.jain@oracle.com>:
>>   When a stripe is not present on the read optimized disk it will just
>>   use the lower devid disk containing the stripe (instead of failing back
>>   to the pid based random disk).
> 
> Use only one disk (the disk with the lowest devid that containing the
> stripe) as fallback should be not a good option imho.
> Instead of it should still be used the pid as fallback to distribute
> the workload among all available drives.
> 
> [stripe to use] = [preffer stripes present on read_mirror_policy
> devids] > [fallback to pid % stripe count]
> 
> Perhaps I'm not be able to express myself in English or did I misunderstand you?
Unless I'm seriously misunderstanding the commit messages, the primary 
purpose of having this as an option at all is for testing.  The fact 
that it happens to allow semantics similar to MD's write-mostly flag 
when dealing with a 2-device raid1 profile is just a bonus.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] Policy to balance read across mirrored devices
  2018-01-31 16:11           ` Austin S. Hemmelgarn
@ 2018-01-31 16:40             ` Peter Becker
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Becker @ 2018-01-31 16:40 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Anand Jain, linux-btrfs

ok, i understood the commitmessage as if the behavior for tests is
more of a bonus

2018-01-30 7:30 GMT+01:00 Anand Jain <anand.jain@oracle.com>:
> This __also__ helps testing

then it's clear to me. would only be good that this behavior is
documented. not that anyone else, like me, tries to use this as
performance tuning. at least the feature with the devid.

Thanks Austin,
Thanks Anand

2018-01-31 17:11 GMT+01:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> On 2018-01-31 09:52, Peter Becker wrote:
>>
>> This is all clear. My question referes to "use the lower devid disk
>> containing the stripe"
>>
>> 2018-01-31 10:01 GMT+01:00 Anand Jain <anand.jain@oracle.com>:
>>>
>>>   When a stripe is not present on the read optimized disk it will just
>>>   use the lower devid disk containing the stripe (instead of failing back
>>>   to the pid based random disk).
>>
>>
>> Use only one disk (the disk with the lowest devid that containing the
>> stripe) as fallback should be not a good option imho.
>> Instead of it should still be used the pid as fallback to distribute
>> the workload among all available drives.
>>
>> [stripe to use] = [preffer stripes present on read_mirror_policy
>> devids] > [fallback to pid % stripe count]
>>
>> Perhaps I'm not be able to express myself in English or did I
>> misunderstand you?
>
> Unless I'm seriously misunderstanding the commit messages, the primary
> purpose of having this as an option at all is for testing.  The fact that it
> happens to allow semantics similar to MD's write-mostly flag when dealing
> with a 2-device raid1 profile is just a bonus.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
  2018-01-31 14:36             ` Anand Jain
@ 2018-02-01  5:26               ` Edmund Nadolski
  2018-02-01  8:12                 ` Anand Jain
  0 siblings, 1 reply; 23+ messages in thread
From: Edmund Nadolski @ 2018-02-01  5:26 UTC (permalink / raw)
  To: Anand Jain, Nikolay Borisov, linux-btrfs

On 1/31/18 7:36 AM, Anand Jain wrote:
> 
> 
> On 01/31/2018 09:42 PM, Nikolay Borisov wrote:
> 
> 
>>>> So usually this should be functionality handled by the raid/san
>>>> controller I guess, > but given that btrfs is playing the role of a
>>>> controller here at what point are we drawing the line of not
>>>> implementing block-level functionality into the filesystem ?
>>>
>>>   Don't worry this is not invading into the block layer. How
>>>   can you even build this functionality in the block layer ?
>>>   Block layer even won't know that disks are mirrored. RAID
>>>   does or BTRFS in our case.
>>>
>>
>> By block layer I guess I meant the storage driver of a particular raid
>> card. Because what is currently happening is re-implementing
>> functionality that will generally sit in the driver. So my question was
>> more generic and high-level - at what point do we draw the line of
>> implementing feature that are generally implemented in hardware devices
>> (be it their drivers or firmware).
> 
>  Not all HW configs use RAID capable HBAs. A server connected to a SATA
>  JBOD using a SATA HBA without MD will relay on BTRFS to provide all the
>  features and capabilities that otherwise would have provided by such a
>  presumable HW config.

That does sort of sound like means implementing some portion of the
HBA features/capabilities in the filesystem.

To me it seems this this could be workable at the fs level, provided it
deals just with policies and remains hardware-neutral.  However most
of the use cases appear to involve some hardware-dependent knowledge
or assumptions.  What happens when someone sets this on a virtual disk,
or say a (persistent) memory-backed block device?  Case #6 seems to
open up some potential for unexpected interactions (which may be hard
to reproduce, esp. in error/recovery scenarios).

Case #2 takes a devid, but I notice btrfs_device::devid says, "the
internal btrfs device id".  How does a user obtain that internal value
so it can be set as a mount option?

Thanks,
Ed


>>>>> ::
>>>>>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>>>>>> index 39ba59832f38..478623e6e074 100644
>>>>>>> --- a/fs/btrfs/volumes.c
>>>>>>> +++ b/fs/btrfs/volumes.c
>>>>>>> @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct
>>>>>>> btrfs_fs_info *fs_info,
>>>>>>>             num = map->num_stripes;
>>>>>>>           switch(fs_info->read_mirror_policy) {
>>>>>>> +    case BTRFS_READ_MIRROR_BY_DEV:
>>>>>>> +        optimal = first;
>>>>>>> +        if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>>>>>>> +                 &map->stripes[optimal].dev->dev_state))
>>>>>>> +            break;
>>>>>>> +        if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>>>>>>> +                 &map->stripes[++optimal].dev->dev_state))
>>>>>>> +            break;
>>>>>>> +        optimal = first;
>>>>>>
>>>>>> you set optimal 2 times, the second one seems redundant.
>>>>>
>>>>>    No actually. When both the disks containing the stripe does not
>>>>>    have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to
>>>>>    use first found stripe.
>>>>
>>>> Yes, and the fact that you've already set optimal = first right after
>>>> BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again
>>>> set
>>>> optimal right before the final break? What am I missing here?
>>>
>>>    Ah. I think you are missing ++optimal in the 2nd if.
>>
>> You are right, but I'd prefer you index the stripes array with 'optimal'
>> and 'optimal + 1' and leave just a single assignment
> 
>  Ok. Will improve that.
> 
> Thanks, Anand
> 
> 
>>>
>>> Thanks, Anand
>>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
  2018-02-01  5:26               ` Edmund Nadolski
@ 2018-02-01  8:12                 ` Anand Jain
  2018-02-01 23:46                   ` Edmund Nadolski
  0 siblings, 1 reply; 23+ messages in thread
From: Anand Jain @ 2018-02-01  8:12 UTC (permalink / raw)
  To: Edmund Nadolski, Nikolay Borisov, linux-btrfs



On 02/01/2018 01:26 PM, Edmund Nadolski wrote:
> On 1/31/18 7:36 AM, Anand Jain wrote:
>>
>>
>> On 01/31/2018 09:42 PM, Nikolay Borisov wrote:
>>
>>
>>>>> So usually this should be functionality handled by the raid/san
>>>>> controller I guess, > but given that btrfs is playing the role of a
>>>>> controller here at what point are we drawing the line of not
>>>>> implementing block-level functionality into the filesystem ?
>>>>
>>>>    Don't worry this is not invading into the block layer. How
>>>>    can you even build this functionality in the block layer ?
>>>>    Block layer even won't know that disks are mirrored. RAID
>>>>    does or BTRFS in our case.
>>>>
>>>
>>> By block layer I guess I meant the storage driver of a particular raid
>>> card. Because what is currently happening is re-implementing
>>> functionality that will generally sit in the driver. So my question was
>>> more generic and high-level - at what point do we draw the line of
>>> implementing feature that are generally implemented in hardware devices
>>> (be it their drivers or firmware).
>>
>>   Not all HW configs use RAID capable HBAs. A server connected to a SATA
>>   JBOD using a SATA HBA without MD will relay on BTRFS to provide all the
>>   features and capabilities that otherwise would have provided by such a
>>   presumable HW config.
> 
> That does sort of sound like means implementing some portion of the
> HBA features/capabilities in the filesystem.
> 
> To me it seems this this could be workable at the fs level, provided it
> deals just with policies and remains hardware-neutral.

  Thanks. Ok.

> However most
> of the use cases appear to involve some hardware-dependent knowledge
> or assumptions. 

> What happens when someone sets this on a virtual disk,
> or say a (persistent) memory-backed block device? 

  Do you have any policy in particular ?

> Case #6 seems to
> open up some potential for unexpected interactions (which may be hard
> to reproduce, esp. in error/recovery scenarios).

  Yep. Even the #1 pid based (current default) which motivated
  me to provide the devid based policy.

> Case #2 takes a devid, but I notice btrfs_device::devid says, "the
> internal btrfs device id".  How does a user obtain that internal value
> so it can be set as a mount option?

   btrfs fi show -m

Thanks, Anand

> Thanks,
> Ed
> 
> 
>>>>>> ::
>>>>>>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>>>>>>> index 39ba59832f38..478623e6e074 100644
>>>>>>>> --- a/fs/btrfs/volumes.c
>>>>>>>> +++ b/fs/btrfs/volumes.c
>>>>>>>> @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct
>>>>>>>> btrfs_fs_info *fs_info,
>>>>>>>>              num = map->num_stripes;
>>>>>>>>            switch(fs_info->read_mirror_policy) {
>>>>>>>> +    case BTRFS_READ_MIRROR_BY_DEV:
>>>>>>>> +        optimal = first;
>>>>>>>> +        if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>>>>>>>> +                 &map->stripes[optimal].dev->dev_state))
>>>>>>>> +            break;
>>>>>>>> +        if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>>>>>>>> +                 &map->stripes[++optimal].dev->dev_state))
>>>>>>>> +            break;
>>>>>>>> +        optimal = first;
>>>>>>>
>>>>>>> you set optimal 2 times, the second one seems redundant.
>>>>>>
>>>>>>     No actually. When both the disks containing the stripe does not
>>>>>>     have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to
>>>>>>     use first found stripe.
>>>>>
>>>>> Yes, and the fact that you've already set optimal = first right after
>>>>> BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again
>>>>> set
>>>>> optimal right before the final break? What am I missing here?
>>>>
>>>>     Ah. I think you are missing ++optimal in the 2nd if.
>>>
>>> You are right, but I'd prefer you index the stripes array with 'optimal'
>>> and 'optimal + 1' and leave just a single assignment
>>
>>   Ok. Will improve that.
>>
>> Thanks, Anand
>>
>>
>>>>
>>>> Thanks, Anand
>>>>
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
  2018-02-01  8:12                 ` Anand Jain
@ 2018-02-01 23:46                   ` Edmund Nadolski
  2018-02-02 12:36                     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 23+ messages in thread
From: Edmund Nadolski @ 2018-02-01 23:46 UTC (permalink / raw)
  To: Anand Jain, Nikolay Borisov, linux-btrfs



On 02/01/2018 01:12 AM, Anand Jain wrote:
> 
> 
> On 02/01/2018 01:26 PM, Edmund Nadolski wrote:
>> On 1/31/18 7:36 AM, Anand Jain wrote:
>>>
>>>
>>> On 01/31/2018 09:42 PM, Nikolay Borisov wrote:
>>>
>>>
>>>>>> So usually this should be functionality handled by the raid/san
>>>>>> controller I guess, > but given that btrfs is playing the role of a
>>>>>> controller here at what point are we drawing the line of not
>>>>>> implementing block-level functionality into the filesystem ?
>>>>>
>>>>>    Don't worry this is not invading into the block layer. How
>>>>>    can you even build this functionality in the block layer ?
>>>>>    Block layer even won't know that disks are mirrored. RAID
>>>>>    does or BTRFS in our case.
>>>>>
>>>>
>>>> By block layer I guess I meant the storage driver of a particular raid
>>>> card. Because what is currently happening is re-implementing
>>>> functionality that will generally sit in the driver. So my question was
>>>> more generic and high-level - at what point do we draw the line of
>>>> implementing feature that are generally implemented in hardware devices
>>>> (be it their drivers or firmware).
>>>
>>>   Not all HW configs use RAID capable HBAs. A server connected to a SATA
>>>   JBOD using a SATA HBA without MD will relay on BTRFS to provide all
>>> the
>>>   features and capabilities that otherwise would have provided by such a
>>>   presumable HW config.
>>
>> That does sort of sound like means implementing some portion of the
>> HBA features/capabilities in the filesystem.
>>
>> To me it seems this this could be workable at the fs level, provided it
>> deals just with policies and remains hardware-neutral.
> 
>  Thanks. Ok.
> 
>> However most
>> of the use cases appear to involve some hardware-dependent knowledge
>> or assumptions. 
> 
>> What happens when someone sets this on a virtual disk,
>> or say a (persistent) memory-backed block device? 
> 
>  Do you have any policy in particular ?

No, this is your proposal ;^)

You've said cases #3 thru #6 are illustrative only. However they make
assumptions about the underlying storage, and/or introduce potential for
unexpected behaviors. Plus they could end up replicating functionality
from other layers as Nikolay pointed out. Seems unlikely these would be
practical to implement.

Case #2 seems concerning if it exposes internal,
implementation-dependent filesystem data into a de facto user-level
interface. (Do we ensure the devid is unique, and cannot get changed or
re-assigned internally to a different device, etc?)

Thanks,
Ed


>> Case #6 seems to
>> open up some potential for unexpected interactions (which may be hard
>> to reproduce, esp. in error/recovery scenarios).
> 
>  Yep. Even the #1 pid based (current default) which motivated
>  me to provide the devid based policy.
> 
>> Case #2 takes a devid, but I notice btrfs_device::devid says, "the
>> internal btrfs device id".  How does a user obtain that internal value
>> so it can be set as a mount option?
> 
>   btrfs fi show -m
> 
> Thanks, Anand
> 
>> Thanks,
>> Ed
>>
>>
>>>>>>> ::
>>>>>>>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>>>>>>>> index 39ba59832f38..478623e6e074 100644
>>>>>>>>> --- a/fs/btrfs/volumes.c
>>>>>>>>> +++ b/fs/btrfs/volumes.c
>>>>>>>>> @@ -5270,6 +5270,16 @@ static int find_live_mirror(struct
>>>>>>>>> btrfs_fs_info *fs_info,
>>>>>>>>>              num = map->num_stripes;
>>>>>>>>>            switch(fs_info->read_mirror_policy) {
>>>>>>>>> +    case BTRFS_READ_MIRROR_BY_DEV:
>>>>>>>>> +        optimal = first;
>>>>>>>>> +        if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>>>>>>>>> +                 &map->stripes[optimal].dev->dev_state))
>>>>>>>>> +            break;
>>>>>>>>> +        if (test_bit(BTRFS_DEV_STATE_READ_MIRROR,
>>>>>>>>> +                 &map->stripes[++optimal].dev->dev_state))
>>>>>>>>> +            break;
>>>>>>>>> +        optimal = first;
>>>>>>>>
>>>>>>>> you set optimal 2 times, the second one seems redundant.
>>>>>>>
>>>>>>>     No actually. When both the disks containing the stripe does not
>>>>>>>     have the BTRFS_DEV_STATE_READ_MIRROR, then I would just want to
>>>>>>>     use first found stripe.
>>>>>>
>>>>>> Yes, and the fact that you've already set optimal = first right after
>>>>>> BTRFS_READ_MIRROR_BY_DEV ensures that, no ? Why do you need to again
>>>>>> set
>>>>>> optimal right before the final break? What am I missing here?
>>>>>
>>>>>     Ah. I think you are missing ++optimal in the 2nd if.
>>>>
>>>> You are right, but I'd prefer you index the stripes array with
>>>> 'optimal'
>>>> and 'optimal + 1' and leave just a single assignment
>>>
>>>   Ok. Will improve that.
>>>
>>> Thanks, Anand
>>>
>>>
>>>>>
>>>>> Thanks, Anand
>>>>>
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
  2018-02-01 23:46                   ` Edmund Nadolski
@ 2018-02-02 12:36                     ` Austin S. Hemmelgarn
  2018-02-05  7:21                       ` Anand Jain
  0 siblings, 1 reply; 23+ messages in thread
From: Austin S. Hemmelgarn @ 2018-02-02 12:36 UTC (permalink / raw)
  To: Edmund Nadolski, Anand Jain, Nikolay Borisov, linux-btrfs

On 2018-02-01 18:46, Edmund Nadolski wrote:
> 
> 
> On 02/01/2018 01:12 AM, Anand Jain wrote:
>>
>>
>> On 02/01/2018 01:26 PM, Edmund Nadolski wrote:
>>> On 1/31/18 7:36 AM, Anand Jain wrote:
>>>>
>>>>
>>>> On 01/31/2018 09:42 PM, Nikolay Borisov wrote:
>>>>
>>>>
>>>>>>> So usually this should be functionality handled by the raid/san
>>>>>>> controller I guess, > but given that btrfs is playing the role of a
>>>>>>> controller here at what point are we drawing the line of not
>>>>>>> implementing block-level functionality into the filesystem ?
>>>>>>
>>>>>>     Don't worry this is not invading into the block layer. How
>>>>>>     can you even build this functionality in the block layer ?
>>>>>>     Block layer even won't know that disks are mirrored. RAID
>>>>>>     does or BTRFS in our case.
>>>>>>
>>>>>
>>>>> By block layer I guess I meant the storage driver of a particular raid
>>>>> card. Because what is currently happening is re-implementing
>>>>> functionality that will generally sit in the driver. So my question was
>>>>> more generic and high-level - at what point do we draw the line of
>>>>> implementing feature that are generally implemented in hardware devices
>>>>> (be it their drivers or firmware).
>>>>
>>>>    Not all HW configs use RAID capable HBAs. A server connected to a SATA
>>>>    JBOD using a SATA HBA without MD will relay on BTRFS to provide all
>>>> the
>>>>    features and capabilities that otherwise would have provided by such a
>>>>    presumable HW config.
>>>
>>> That does sort of sound like means implementing some portion of the
>>> HBA features/capabilities in the filesystem.
>>>
>>> To me it seems this this could be workable at the fs level, provided it
>>> deals just with policies and remains hardware-neutral.
>>
>>   Thanks. Ok.
>>
>>> However most
>>> of the use cases appear to involve some hardware-dependent knowledge
>>> or assumptions.
>>
>>> What happens when someone sets this on a virtual disk,
>>> or say a (persistent) memory-backed block device?
>>
>>   Do you have any policy in particular ?
> 
> No, this is your proposal ;^)
> 
> You've said cases #3 thru #6 are illustrative only. However they make
> assumptions about the underlying storage, and/or introduce potential for
> unexpected behaviors. Plus they could end up replicating functionality
> from other layers as Nikolay pointed out. Seems unlikely these would be
> practical to implement.
The I/O one would actually be rather nice to have and wouldn't really be 
duplicating anything (at least, not duplicating anything we consistently 
run on top of).  The pid-based selector works fine for cases where the 
only thing on the disks is a single BTRFS filesystem.  When there's more 
than that, it can very easily result in highly asymmetrical load on the 
disks because it doesn't account for current I/O load when picking a 
copy to read.  Last I checked, both MD and DM-RAID have at least the 
option to use I/O load in determining where to send reads for RAID1 
setups, and they do a far better job than BTRFS at balancing load in 
these cases.
> 
> Case #2 seems concerning if it exposes internal,
> implementation-dependent filesystem data into a de facto user-level
> interface. (Do we ensure the devid is unique, and cannot get changed or
> re-assigned internally to a different device, etc?)
The devid gets assigned when a device is added to a filesystem, it's a 
monotonically increasing number that gets incremented for every new 
device, and never changes for a given device as long as it remains in 
the filesystem (it will change if you remove the device and then re-add 
it).  The only exception to this is that the replace command will assign 
the new device the same devid that the device it is replacing had (which 
I would argue leads to consistent behavior here).  Given that, I think 
it's sufficiently safe to use it for something like this.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] btrfs: add read_mirror_policy parameter devid
  2018-02-02 12:36                     ` Austin S. Hemmelgarn
@ 2018-02-05  7:21                       ` Anand Jain
  0 siblings, 0 replies; 23+ messages in thread
From: Anand Jain @ 2018-02-05  7:21 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Edmund Nadolski, Nikolay Borisov, linux-btrfs



>>>>>>>> So usually this should be functionality handled by the raid/san
>>>>>>>> controller I guess, > but given that btrfs is playing the role of a
>>>>>>>> controller here at what point are we drawing the line of not
>>>>>>>> implementing block-level functionality into the filesystem ?
>>>>>>>
>>>>>>>     Don't worry this is not invading into the block layer. How
>>>>>>>     can you even build this functionality in the block layer ?
>>>>>>>     Block layer even won't know that disks are mirrored. RAID
>>>>>>>     does or BTRFS in our case.
>>>>>>>
>>>>>>
>>>>>> By block layer I guess I meant the storage driver of a particular 
>>>>>> raid card. Because what is currently happening is re-implementing
>>>>>> functionality that will generally sit in the driver. So my 
>>>>>> question was more generic and high-level - at what point do we draw the line of
>>>>>> implementing feature that are generally implemented in hardware 
>>>>>> devices (be it their drivers or firmware).
>>>>>
>>>>>    Not all HW configs use RAID capable HBAs. A server connected to 
>>>>> a SATA JBOD using a SATA HBA without MD will relay on BTRFS to provide all
>>>>> the features and capabilities that otherwise would have provided by 
>>>>> such a presumable HW config.
>>>>
>>>> That does sort of sound like means implementing some portion of the
>>>> HBA features/capabilities in the filesystem.
>>>>
>>>> To me it seems this this could be workable at the fs level, provided it
>>>> deals just with policies and remains hardware-neutral.
>>>

>>>   Thanks. Ok.
>>>
>>>> However most
>>>> of the use cases appear to involve some hardware-dependent knowledge
>>>> or assumptions.
>>>
>>>> What happens when someone sets this on a virtual disk,
>>>> or say a (persistent) memory-backed block device?
>>>
>>>   Do you have any policy in particular ?
>>
>> No, this is your proposal ;^)


  Policy added here:
    devid
  It is about the devid which is assigned by the btrfs.
  Future policy:
     LBA/ssd/io/Tims-heuristic
  They aren't hardware dependent though ssd says use ssd
  disk for reading if available. LBA is to divide the read
  IO access based on the sector #. The logic is quite simple
       read-sector < FS-SIZE/2 ? mirror1 : mirror2;

>> You've said cases #3 thru #6 are illustrative only. However they make
>> assumptions about the underlying storage, and/or introduce potential for
>> unexpected behaviors.

  The assumptions I am making is that user will understand their
  storage and tune this parameter accordingly, and there is heuristic
  (which Tim wrote) to do things automatically. Sometimes manual settings
  provide better performance than heuristic.

>> Plus they could end up replicating functionality
>> from other layers as Nikolay pointed out. Seems unlikely these would be
>> practical to implement.

> The I/O one would actually be rather nice to have and wouldn't really be 
> duplicating anything (at least, not duplicating anything we consistently 
> run on top of).  The pid-based selector works fine for cases where the 
> only thing on the disks is a single BTRFS filesystem.  When there's more 
> than that, it can very easily result in highly asymmetrical load on the 
> disks because it doesn't account for current I/O load when picking a 
> copy to read.  Last I checked, both MD and DM-RAID have at least the 
> option to use I/O load in determining where to send reads for RAID1 
> setups, and they do a far better job than BTRFS at balancing load in 
> these cases.

  Yeah.. some enterprise FS and storage communicate performance
  tunability automatically between each other. We will be there too.

>> Case #2 seems concerning if it exposes internal,
>> implementation-dependent filesystem data into a de facto user-level
>> interface. (Do we ensure the devid is unique, and cannot get changed or
>> re-assigned internally to a different device, etc?)
> The devid gets assigned when a device is added to a filesystem, it's a 
> monotonically increasing number that gets incremented for every new 
> device, and never changes for a given device as long as it remains in 
> the filesystem (it will change if you remove the device and then re-add 
> it).  The only exception to this is that the replace command will assign 
> the new device the same devid that the device it is replacing had (which 
> I would argue leads to consistent behavior here).  Given that, I think 
> it's sufficiently safe to use it for something like this.


Thanks, Anand


> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-02-05  7:20 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-30  6:30 [PATCH 0/2] Policy to balance read across mirrored devices Anand Jain
2018-01-30  6:30 ` [PATCH 1/2] btrfs: add mount option read_mirror_policy Anand Jain
2018-01-31  8:06   ` Nikolay Borisov
2018-01-31  9:06     ` Anand Jain
2018-01-30  6:30 ` [PATCH 2/2] btrfs: add read_mirror_policy parameter devid Anand Jain
2018-01-31  8:38   ` Nikolay Borisov
2018-01-31  9:28     ` Anand Jain
2018-01-31  9:54       ` Nikolay Borisov
2018-01-31 13:38         ` Anand Jain
2018-01-31 13:42           ` Nikolay Borisov
2018-01-31 14:36             ` Anand Jain
2018-02-01  5:26               ` Edmund Nadolski
2018-02-01  8:12                 ` Anand Jain
2018-02-01 23:46                   ` Edmund Nadolski
2018-02-02 12:36                     ` Austin S. Hemmelgarn
2018-02-05  7:21                       ` Anand Jain
2018-01-31  7:51 ` [PATCH 0/2] Policy to balance read across mirrored devices Peter Becker
2018-01-31  9:01   ` Anand Jain
2018-01-31 10:47     ` Peter Becker
2018-01-31 14:26       ` Anand Jain
2018-01-31 14:52         ` Peter Becker
2018-01-31 16:11           ` Austin S. Hemmelgarn
2018-01-31 16:40             ` Peter Becker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.