linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V7] Btrfs: enhance raid1/10 balance heuristic
@ 2018-11-12 11:58 Timofey Titovets
  2018-11-13 18:55 ` Goffredo Baroncelli
  2018-11-14  1:27 ` Anand Jain
  0 siblings, 2 replies; 5+ messages in thread
From: Timofey Titovets @ 2018-11-12 11:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: nborisov, Timofey Titovets

From: Timofey Titovets <nefelim4ag@gmail.com>

Currently btrfs raid1/10 balancer bаlance requests to mirrors,
based on pid % num of mirrors.

Make logic understood:
 - if one of underline devices are non rotational
 - Queue length to underline devices

By default try use pid % num_mirrors guessing, but:
 - If one of mirrors are non rotational, repick optimal to it
 - If underline mirror have less queue length then optimal,
   repick to that mirror

For avoid round-robin request balancing,
lets round down queue length:
 - By 8 for rotational devs
 - By 2 for all non rotational devs

Some bench results from mail list
(Dmitrii Tcvetkov <demfloro@demfloro.ru>):
Benchmark summary (arithmetic mean of 3 runs):
         Mainline     Patch
------------------------------------
RAID1  | 18.9 MiB/s | 26.5 MiB/s
RAID10 | 30.7 MiB/s | 30.7 MiB/s
------------------------------------------------------------------------
mainline, fio got lucky to read from first HDD (quite slow HDD):
Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS]
  read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec)
  lat (msec): min=2, max=825, avg=60.17, stdev=65.06
------------------------------------------------------------------------
mainline, fio got lucky to read from second HDD (much more modern):
Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS]
  read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec)
  lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56
------------------------------------------------------------------------
mainline, fio got lucky to read from an SSD:
Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS]
  read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec)
  lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36
------------------------------------------------------------------------
With the patch, 2 HDDs:
Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS]
  read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec)
  lat (usec): min=435, max=341037, avg=28511.64, stdev=30000.14
------------------------------------------------------------------------
With the patch, HDD(old one)+SSD:
Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS]
  read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec)
  lat  (usec): min=363, max=346752, avg=1381.73, stdev=6948.32

Changes:
  v1 -> v2:
    - Use helper part_in_flight() from genhd.c
      to get queue length
    - Move guess code to guess_optimal()
    - Change balancer logic, try use pid % mirror by default
      Make balancing on spinning rust if one of underline devices
      are overloaded
  v2 -> v3:
    - Fix arg for RAID10 - use sub_stripes, instead of num_stripes
  v3 -> v4:
    - Rebased on latest misc-next
  v4 -> v5:
    - Rebased on latest misc-next
  v5 -> v6:
    - Fix spelling
    - Include bench results
  v6 -> v7:
    - Fixes based on Nikolay Borisov review:
      * Assume num == 2
      * Remove "for" loop based on that assumption, where possible
      * No functional changes

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Tested-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
Reviewed-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
---
 block/genhd.c      |   1 +
 fs/btrfs/volumes.c | 100 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index be5bab20b2ab..939f0c6a2d79 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct *part,
 				atomic_read(&part->in_flight[1]);
 	}
 }
+EXPORT_SYMBOL_GPL(part_in_flight);
 
 void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
 		       unsigned int inflight[2])
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f4405e430da6..a6632cc2bfab 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -13,6 +13,7 @@
 #include <linux/raid/pq.h>
 #include <linux/semaphore.h>
 #include <linux/uuid.h>
+#include <linux/genhd.h>
 #include <linux/list_sort.h>
 #include "ctree.h"
 #include "extent_map.h"
@@ -5159,6 +5160,102 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 	return ret;
 }
 
+/**
+ * bdev_get_queue_len - return rounded down in flight queue length of bdev
+ *
+ * @bdev: target bdev
+ * @round_down: round factor big for hdd and small for ssd, like 8 and 2
+ */
+static int bdev_get_queue_len(struct block_device *bdev, int round_down)
+{
+	int sum;
+	struct hd_struct *bd_part = bdev->bd_part;
+	struct request_queue *rq = bdev_get_queue(bdev);
+	uint32_t inflight[2] = {0, 0};
+
+	part_in_flight(rq, bd_part, inflight);
+
+	sum = max_t(uint32_t, inflight[0], inflight[1]);
+
+	/*
+	 * Try prevent switch for every sneeze
+	 * By roundup output num by some value
+	 */
+	return ALIGN_DOWN(sum, round_down);
+}
+
+/**
+ * guess_optimal - return guessed optimal mirror
+ *
+ * Optimal expected to be pid % num_stripes
+ *
+ * That's generaly ok for spread load
+ * Add some balancer based on queue length to device
+ *
+ * Basic ideas:
+ *  - Sequential read generate low amount of request
+ *    so if load of drives are equal, use pid % num_stripes balancing
+ *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
+ *    and repick if other dev have "significant" less queue length
+ *  - Repick optimal if queue length of other mirror are less
+ */
+static int guess_optimal(struct map_lookup *map, int num, int optimal)
+{
+	int i;
+	int round_down = 8;
+	/* Init for missing bdevs */
+	int qlen[2] = { INT_MAX, INT_MAX };
+	bool is_nonrot[2] = { false, false };
+	bool all_bdev_nonrot = true;
+	bool all_bdev_rotate = true;
+	struct block_device *bdev;
+
+	ASSERT(num == 2);
+
+	/* Check accessible bdevs */
+	for (i = 0; i < 2; i++) {
+		bdev = map->stripes[i].dev->bdev;
+		if (bdev) {
+			qlen[i] = 0;
+			is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
+			if (is_nonrot[i])
+				all_bdev_rotate = false;
+			else
+				all_bdev_nonrot = false;
+		}
+	}
+
+	/*
+	 * Don't bother with computation
+	 * if only one of two bdevs are accessible
+	 */
+	if (qlen[0] == INT_MAX)
+		return 1;
+	if (qlen[1] == INT_MAX)
+		return 0;
+
+	if (all_bdev_nonrot)
+		round_down = 2;
+
+	for (i = 0; i < 2; i++) {
+		bdev = map->stripes[i].dev->bdev;
+		qlen[i] = bdev_get_queue_len(bdev, round_down);
+	}
+
+	/* For mixed case, pick non rotational dev as optimal */
+	if (all_bdev_rotate == all_bdev_nonrot) {
+		if (is_nonrot[0])
+			optimal = 0;
+		else
+			optimal = 1;
+	}
+
+	if (qlen[optimal] > qlen[(optimal + 1) % 2])
+		optimal = i;
+
+	return optimal;
+}
+
 static int find_live_mirror(struct btrfs_fs_info *fs_info,
 			    struct map_lookup *map, int first,
 			    int dev_replace_is_ongoing)
@@ -5177,7 +5274,8 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 	else
 		num_stripes = map->num_stripes;
 
-	preferred_mirror = first + current->pid % num_stripes;
+	preferred_mirror = first + guess_optimal(map, num_stripes,
+						 current->pid % num_stripes);
 
 	if (dev_replace_is_ongoing &&
 	    fs_info->dev_replace.cont_reading_from_srcdev_mode ==
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH V7] Btrfs: enhance raid1/10 balance heuristic
  2018-11-12 11:58 [PATCH V7] Btrfs: enhance raid1/10 balance heuristic Timofey Titovets
@ 2018-11-13 18:55 ` Goffredo Baroncelli
  2018-11-14  1:27 ` Anand Jain
  1 sibling, 0 replies; 5+ messages in thread
From: Goffredo Baroncelli @ 2018-11-13 18:55 UTC (permalink / raw)
  To: Timofey Titovets, linux-btrfs; +Cc: nborisov, Timofey Titovets

On 12/11/2018 12.58, Timofey Titovets wrote:
> From: Timofey Titovets <nefelim4ag@gmail.com>
> 
> Currently btrfs raid1/10 balancer bаlance requests to mirrors,
> based on pid % num of mirrors.
[...]
>   v6 -> v7:
>     - Fixes based on Nikolay Borisov review:
>       * Assume num == 2
>       * Remove "for" loop based on that assumption, where possible
>       * No functional changes
> 
> Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
> Tested-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
> Reviewed-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
> ---
[...]
> +/**
> + * guess_optimal - return guessed optimal mirror
> + *
> + * Optimal expected to be pid % num_stripes
> + *
> + * That's generaly ok for spread load
> + * Add some balancer based on queue length to device
> + *
> + * Basic ideas:
> + *  - Sequential read generate low amount of request
> + *    so if load of drives are equal, use pid % num_stripes balancing
> + *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
> + *    and repick if other dev have "significant" less queue length
> + *  - Repick optimal if queue length of other mirror are less
> + */
> +static int guess_optimal(struct map_lookup *map, int num, int optimal)
> +{
> +	int i;
> +	int round_down = 8;
> +	/* Init for missing bdevs */
> +	int qlen[2] = { INT_MAX, INT_MAX };
> +	bool is_nonrot[2] = { false, false };
> +	bool all_bdev_nonrot = true;
> +	bool all_bdev_rotate = true;
> +	struct block_device *bdev;
> +
> +	ASSERT(num == 2);
> +
> +	/* Check accessible bdevs */> +	for (i = 0; i < 2; i++) {

From your function comment, it is not clear why you are comparing "num" to "2". Pay attention that there are somewhere some patches which implement raid profile with higher redundancy (IIRC up to 4). I suggest to put your assumption also in the comment ("...this function works up to 2 mirrors...") and, better, add a define like 

#define BTRFS_MAX_RAID1_RAID10_MIRRORS 2

And replace "2" with BTRFS_MAX_RAID1_RAID10_MIRRORS



> +		bdev = map->stripes[i].dev->bdev;
> +		if (bdev) {
> +			qlen[i] = 0;
> +			is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
> +			if (is_nonrot[i])
> +				all_bdev_rotate = false;
> +			else
> +				all_bdev_nonrot = false;
> +		}
> +	}
> +
> +	/*
> +	 * Don't bother with computation
> +	 * if only one of two bdevs are accessible
> +	 */
> +	if (qlen[0] == INT_MAX)
> +		return 1;
> +	if (qlen[1] == INT_MAX)
> +		return 0;
> +
> +	if (all_bdev_nonrot)
> +		round_down = 2;
> +
> +	for (i = 0; i < 2; i++) {
> +		bdev = map->stripes[i].dev->bdev;
> +		qlen[i] = bdev_get_queue_len(bdev, round_down);
> +	}
> +
> +	/* For mixed case, pick non rotational dev as optimal */
> +	if (all_bdev_rotate == all_bdev_nonrot) {
> +		if (is_nonrot[0])
> +			optimal = 0;
> +		else
> +			optimal = 1;
> +	}
> +
> +	if (qlen[optimal] > qlen[(optimal + 1) % 2])
> +		optimal = i;
> +
> +	return optimal;
> +}
> +
>  static int find_live_mirror(struct btrfs_fs_info *fs_info,
>  			    struct map_lookup *map, int first,
>  			    int dev_replace_is_ongoing)
> @@ -5177,7 +5274,8 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
>  	else
>  		num_stripes = map->num_stripes;
>  
> -	preferred_mirror = first + current->pid % num_stripes;
> +	preferred_mirror = first + guess_optimal(map, num_stripes,
> +						 current->pid % num_stripes);
>  
>  	if (dev_replace_is_ongoing &&
>  	    fs_info->dev_replace.cont_reading_from_srcdev_mode ==
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH V7] Btrfs: enhance raid1/10 balance heuristic
  2018-11-12 11:58 [PATCH V7] Btrfs: enhance raid1/10 balance heuristic Timofey Titovets
  2018-11-13 18:55 ` Goffredo Baroncelli
@ 2018-11-14  1:27 ` Anand Jain
  2019-01-01 18:39   ` Timofey Titovets
  1 sibling, 1 reply; 5+ messages in thread
From: Anand Jain @ 2018-11-14  1:27 UTC (permalink / raw)
  To: Timofey Titovets, linux-btrfs; +Cc: nborisov, Timofey Titovets, David Sterba



I am ok with the least used path approach here for the IO routing
that's probably most reasonable in generic configurations. It can
be default read mirror policy as well.

But as I mentioned. Not all configurations would agree to the heuristic
approach here. For example: To make use of the SAN storage cache to
get high IO throughput read must access based on the LBA, And this
heuristic would make matter worst. There are plans to add more
options read_mirror_policy [1].

[1]
https://patchwork.kernel.org/patch/10403299/

I would rather provide the configuration tune-ables to the use
cases rather than fixing it using heuristic. Heuristic are good
only with the known set of IO pattern for which heuristic is
designed for.

This is not the first time you are assuming heuristic would provide
the best possible performance in all use cases. As I mentioned
in the compression heuristic there was no problem statement that
you wanted to address using heuristic, theoretically the integrated
compression heuristic would have to do a lot more computation when
all the file-extents are compressible, its not convenience to me
how compression heuristic would help on a desktop machine where
most of the files are compressible.

IMO heuristic are good only for a set of types of workload. Giving
an option to move away from it for the manual tuning is desired.

Thanks, Anand


On 11/12/2018 07:58 PM, Timofey Titovets wrote:
> From: Timofey Titovets <nefelim4ag@gmail.com>
> 
> Currently btrfs raid1/10 balancer bаlance requests to mirrors,
> based on pid % num of mirrors.
> 
> Make logic understood:
>   - if one of underline devices are non rotational
>   - Queue length to underline devices
> 
> By default try use pid % num_mirrors guessing, but:
>   - If one of mirrors are non rotational, repick optimal to it
>   - If underline mirror have less queue length then optimal,
>     repick to that mirror
> 
> For avoid round-robin request balancing,
> lets round down queue length:
>   - By 8 for rotational devs
>   - By 2 for all non rotational devs
> 
> Some bench results from mail list
> (Dmitrii Tcvetkov <demfloro@demfloro.ru>):
> Benchmark summary (arithmetic mean of 3 runs):
>           Mainline     Patch
> ------------------------------------
> RAID1  | 18.9 MiB/s | 26.5 MiB/s
> RAID10 | 30.7 MiB/s | 30.7 MiB/s
> ------------------------------------------------------------------------
> mainline, fio got lucky to read from first HDD (quite slow HDD):
> Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS]
>    read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec)
>    lat (msec): min=2, max=825, avg=60.17, stdev=65.06
> ------------------------------------------------------------------------
> mainline, fio got lucky to read from second HDD (much more modern):
> Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS]
>    read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec)
>    lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56
> ------------------------------------------------------------------------
> mainline, fio got lucky to read from an SSD:
> Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS]
>    read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec)
>    lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36
> ------------------------------------------------------------------------
> With the patch, 2 HDDs:
> Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS]
>    read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec)
>    lat (usec): min=435, max=341037, avg=28511.64, stdev=30000.14
> ------------------------------------------------------------------------
> With the patch, HDD(old one)+SSD:
> Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS]
>    read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec)
>    lat  (usec): min=363, max=346752, avg=1381.73, stdev=6948.32
> 
> Changes:
>    v1 -> v2:
>      - Use helper part_in_flight() from genhd.c
>        to get queue length
>      - Move guess code to guess_optimal()
>      - Change balancer logic, try use pid % mirror by default
>        Make balancing on spinning rust if one of underline devices
>        are overloaded
>    v2 -> v3:
>      - Fix arg for RAID10 - use sub_stripes, instead of num_stripes
>    v3 -> v4:
>      - Rebased on latest misc-next
>    v4 -> v5:
>      - Rebased on latest misc-next
>    v5 -> v6:
>      - Fix spelling
>      - Include bench results
>    v6 -> v7:
>      - Fixes based on Nikolay Borisov review:
>        * Assume num == 2
>        * Remove "for" loop based on that assumption, where possible
>        * No functional changes
> 
> Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
> Tested-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
> Reviewed-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
> ---
>   block/genhd.c      |   1 +
>   fs/btrfs/volumes.c | 100 ++++++++++++++++++++++++++++++++++++++++++++-
>   2 files changed, 100 insertions(+), 1 deletion(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index be5bab20b2ab..939f0c6a2d79 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct *part,
>   				atomic_read(&part->in_flight[1]);
>   	}
>   }
> +EXPORT_SYMBOL_GPL(part_in_flight);
>   
>   void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
>   		       unsigned int inflight[2])
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index f4405e430da6..a6632cc2bfab 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -13,6 +13,7 @@
>   #include <linux/raid/pq.h>
>   #include <linux/semaphore.h>
>   #include <linux/uuid.h>
> +#include <linux/genhd.h>
>   #include <linux/list_sort.h>
>   #include "ctree.h"
>   #include "extent_map.h"
> @@ -5159,6 +5160,102 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
>   	return ret;
>   }
>   
> +/**
> + * bdev_get_queue_len - return rounded down in flight queue length of bdev
> + *
> + * @bdev: target bdev
> + * @round_down: round factor big for hdd and small for ssd, like 8 and 2
> + */
> +static int bdev_get_queue_len(struct block_device *bdev, int round_down)
> +{
> +	int sum;
> +	struct hd_struct *bd_part = bdev->bd_part;
> +	struct request_queue *rq = bdev_get_queue(bdev);
> +	uint32_t inflight[2] = {0, 0};
> +
> +	part_in_flight(rq, bd_part, inflight);
> +
> +	sum = max_t(uint32_t, inflight[0], inflight[1]);
> +
> +	/*
> +	 * Try prevent switch for every sneeze
> +	 * By roundup output num by some value
> +	 */
> +	return ALIGN_DOWN(sum, round_down);
> +}
> +
> +/**
> + * guess_optimal - return guessed optimal mirror
> + *
> + * Optimal expected to be pid % num_stripes
> + *
> + * That's generaly ok for spread load
> + * Add some balancer based on queue length to device
> + *
> + * Basic ideas:
> + *  - Sequential read generate low amount of request
> + *    so if load of drives are equal, use pid % num_stripes balancing
> + *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
> + *    and repick if other dev have "significant" less queue length
> + *  - Repick optimal if queue length of other mirror are less
> + */
> +static int guess_optimal(struct map_lookup *map, int num, int optimal)
> +{
> +	int i;
> +	int round_down = 8;
> +	/* Init for missing bdevs */
> +	int qlen[2] = { INT_MAX, INT_MAX };
> +	bool is_nonrot[2] = { false, false };
> +	bool all_bdev_nonrot = true;
> +	bool all_bdev_rotate = true;
> +	struct block_device *bdev;
> +
> +	ASSERT(num == 2);
> +
> +	/* Check accessible bdevs */
> +	for (i = 0; i < 2; i++) {
> +		bdev = map->stripes[i].dev->bdev;
> +		if (bdev) {
> +			qlen[i] = 0;
> +			is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
> +			if (is_nonrot[i])
> +				all_bdev_rotate = false;
> +			else
> +				all_bdev_nonrot = false;
> +		}
> +	}
> +
> +	/*
> +	 * Don't bother with computation
> +	 * if only one of two bdevs are accessible
> +	 */
> +	if (qlen[0] == INT_MAX)
> +		return 1;
> +	if (qlen[1] == INT_MAX)
> +		return 0;
> +
> +	if (all_bdev_nonrot)
> +		round_down = 2;
> +
> +	for (i = 0; i < 2; i++) {
> +		bdev = map->stripes[i].dev->bdev;
> +		qlen[i] = bdev_get_queue_len(bdev, round_down);
> +	}
> +
> +	/* For mixed case, pick non rotational dev as optimal */
> +	if (all_bdev_rotate == all_bdev_nonrot) {
> +		if (is_nonrot[0])
> +			optimal = 0;
> +		else
> +			optimal = 1;
> +	}
> +
> +	if (qlen[optimal] > qlen[(optimal + 1) % 2])
> +		optimal = i;
> +
> +	return optimal;
> +}
> +
>   static int find_live_mirror(struct btrfs_fs_info *fs_info,
>   			    struct map_lookup *map, int first,
>   			    int dev_replace_is_ongoing)
> @@ -5177,7 +5274,8 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
>   	else
>   		num_stripes = map->num_stripes;
>   
> -	preferred_mirror = first + current->pid % num_stripes;
> +	preferred_mirror = first + guess_optimal(map, num_stripes,
> +						 current->pid % num_stripes);
>   
>   	if (dev_replace_is_ongoing &&
>   	    fs_info->dev_replace.cont_reading_from_srcdev_mode ==
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH V7] Btrfs: enhance raid1/10 balance heuristic
  2018-11-14  1:27 ` Anand Jain
@ 2019-01-01 18:39   ` Timofey Titovets
  2019-01-03  9:31     ` Anand Jain
  0 siblings, 1 reply; 5+ messages in thread
From: Timofey Titovets @ 2019-01-01 18:39 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs, Nikolay Borisov, David Sterba

Oh, just forgot to answer.

ср, 14 нояб. 2018 г. в 04:27, Anand Jain <anand.jain@oracle.com>:
>
> I am ok with the least used path approach here for the IO routing
> that's probably most reasonable in generic configurations. It can
> be default read mirror policy as well.
(thanks, that's pleasant for me %) )

> But as I mentioned. Not all configurations would agree to the heuristic
> approach here. For example: To make use of the SAN storage cache to
> get high IO throughput read must access based on the LBA, And this
> heuristic would make matter worst. There are plans to add more
> options read_mirror_policy [1].
>
> [1]
> https://patchwork.kernel.org/patch/10403299/

Can you please add some example of SAN stack were that
will make something 'worst'?

Moreover pid lb will also not play good for your example.

In SAN stack client always see one device with N path.
And no raid1 balancing can happen.
Maybe i didn't see all setups in world, but configure raid1
from 2 remote devices sounds very bad for me. Even drbd
only provide one logical device to end user.

> I would rather provide the configuration tune-ables to the use
> cases rather than fixing it using heuristic. Heuristic are good
> only with the known set of IO pattern for which heuristic is
> designed for.

Yep, but how compex that tunables must be?
i.e. we'll _always_ have cerner cases with bad behaviour.

(Also i prefer have sysfs tunables for that, instead of adding another
mount option.)

> This is not the first time you are assuming heuristic would provide
> the best possible performance in all use cases. As I mentioned
> in the compression heuristic there was no problem statement that
> you wanted to address using heuristic, theoretically the integrated
> compression heuristic would have to do a lot more computation when
> all the file-extents are compressible, its not convenience to me
> how compression heuristic would help on a desktop machine where
> most of the files are compressible.

Different tools exists, because we have different use cases.
If something adds more problems than it solves, it must be changed or
just purged.

Moreover, claim what on every desktop machine most of files are compressible
not true. I don't want to make long discussion about "spherical cow" in space.
Just my example:
➜  ~ sudo compsize /mnt
Processed 1198830 files, 1404382 regular extents (1481132 refs), 675870 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       77%      240G         308G         285G
none       100%      202G         202G         176G
zlib        37%      1.8G         5.1G         5.6G
lzo         61%       64M         104M         126M
zstd        35%       36G         100G         103G

That's are system + home.
Home have different type of trash in it: videos, photos, source repos,
steam games, docker images.
DE Apps DB and other random things. Some data have NOCOW because i
just lack of "mental strength"
to finish fix of bad behaviour autodefrag with compressed data.
As you can see most volume of data are not compressed.

> IMO heuristic are good only for a set of types of workload. Giving
> an option to move away from it for the manual tuning is desired.
>
> Thanks, Anand

Any way, may be you right about demand in adding some control about
internal behaviour.
And we can combine our work to properly support that.
(I don't like over engineering, and just try avoid way where users will start to
switch random flags to make things better.)
But before that we need some feedback from upstream. Bad or good.

Because currently core btrfs devs works on companies, which internal
use btrfs and/or
sell that to customers (suse?). I'm afraid what devs afraid to change internal
behaviour without 100% confident that it will be better.

Thanks!

> On 11/12/2018 07:58 PM, Timofey Titovets wrote:
> > From: Timofey Titovets <nefelim4ag@gmail.com>
> >
> > Currently btrfs raid1/10 balancer bаlance requests to mirrors,
> > based on pid % num of mirrors.
> >
> > Make logic understood:
> >   - if one of underline devices are non rotational
> >   - Queue length to underline devices
> >
> > By default try use pid % num_mirrors guessing, but:
> >   - If one of mirrors are non rotational, repick optimal to it
> >   - If underline mirror have less queue length then optimal,
> >     repick to that mirror
> >
> > For avoid round-robin request balancing,
> > lets round down queue length:
> >   - By 8 for rotational devs
> >   - By 2 for all non rotational devs
> >
> > Some bench results from mail list
> > (Dmitrii Tcvetkov <demfloro@demfloro.ru>):
> > Benchmark summary (arithmetic mean of 3 runs):
> >           Mainline     Patch
> > ------------------------------------
> > RAID1  | 18.9 MiB/s | 26.5 MiB/s
> > RAID10 | 30.7 MiB/s | 30.7 MiB/s
> > ------------------------------------------------------------------------
> > mainline, fio got lucky to read from first HDD (quite slow HDD):
> > Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS]
> >    read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec)
> >    lat (msec): min=2, max=825, avg=60.17, stdev=65.06
> > ------------------------------------------------------------------------
> > mainline, fio got lucky to read from second HDD (much more modern):
> > Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS]
> >    read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec)
> >    lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56
> > ------------------------------------------------------------------------
> > mainline, fio got lucky to read from an SSD:
> > Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS]
> >    read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec)
> >    lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36
> > ------------------------------------------------------------------------
> > With the patch, 2 HDDs:
> > Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS]
> >    read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec)
> >    lat (usec): min=435, max=341037, avg=28511.64, stdev=30000.14
> > ------------------------------------------------------------------------
> > With the patch, HDD(old one)+SSD:
> > Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS]
> >    read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec)
> >    lat  (usec): min=363, max=346752, avg=1381.73, stdev=6948.32
> >
> > Changes:
> >    v1 -> v2:
> >      - Use helper part_in_flight() from genhd.c
> >        to get queue length
> >      - Move guess code to guess_optimal()
> >      - Change balancer logic, try use pid % mirror by default
> >        Make balancing on spinning rust if one of underline devices
> >        are overloaded
> >    v2 -> v3:
> >      - Fix arg for RAID10 - use sub_stripes, instead of num_stripes
> >    v3 -> v4:
> >      - Rebased on latest misc-next
> >    v4 -> v5:
> >      - Rebased on latest misc-next
> >    v5 -> v6:
> >      - Fix spelling
> >      - Include bench results
> >    v6 -> v7:
> >      - Fixes based on Nikolay Borisov review:
> >        * Assume num == 2
> >        * Remove "for" loop based on that assumption, where possible
> >        * No functional changes
> >
> > Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
> > Tested-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
> > Reviewed-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
> > ---
> >   block/genhd.c      |   1 +
> >   fs/btrfs/volumes.c | 100 ++++++++++++++++++++++++++++++++++++++++++++-
> >   2 files changed, 100 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/genhd.c b/block/genhd.c
> > index be5bab20b2ab..939f0c6a2d79 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct *part,
> >                               atomic_read(&part->in_flight[1]);
> >       }
> >   }
> > +EXPORT_SYMBOL_GPL(part_in_flight);
> >
> >   void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
> >                      unsigned int inflight[2])
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index f4405e430da6..a6632cc2bfab 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -13,6 +13,7 @@
> >   #include <linux/raid/pq.h>
> >   #include <linux/semaphore.h>
> >   #include <linux/uuid.h>
> > +#include <linux/genhd.h>
> >   #include <linux/list_sort.h>
> >   #include "ctree.h"
> >   #include "extent_map.h"
> > @@ -5159,6 +5160,102 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
> >       return ret;
> >   }
> >
> > +/**
> > + * bdev_get_queue_len - return rounded down in flight queue length of bdev
> > + *
> > + * @bdev: target bdev
> > + * @round_down: round factor big for hdd and small for ssd, like 8 and 2
> > + */
> > +static int bdev_get_queue_len(struct block_device *bdev, int round_down)
> > +{
> > +     int sum;
> > +     struct hd_struct *bd_part = bdev->bd_part;
> > +     struct request_queue *rq = bdev_get_queue(bdev);
> > +     uint32_t inflight[2] = {0, 0};
> > +
> > +     part_in_flight(rq, bd_part, inflight);
> > +
> > +     sum = max_t(uint32_t, inflight[0], inflight[1]);
> > +
> > +     /*
> > +      * Try prevent switch for every sneeze
> > +      * By roundup output num by some value
> > +      */
> > +     return ALIGN_DOWN(sum, round_down);
> > +}
> > +
> > +/**
> > + * guess_optimal - return guessed optimal mirror
> > + *
> > + * Optimal expected to be pid % num_stripes
> > + *
> > + * That's generaly ok for spread load
> > + * Add some balancer based on queue length to device
> > + *
> > + * Basic ideas:
> > + *  - Sequential read generate low amount of request
> > + *    so if load of drives are equal, use pid % num_stripes balancing
> > + *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
> > + *    and repick if other dev have "significant" less queue length
> > + *  - Repick optimal if queue length of other mirror are less
> > + */
> > +static int guess_optimal(struct map_lookup *map, int num, int optimal)
> > +{
> > +     int i;
> > +     int round_down = 8;
> > +     /* Init for missing bdevs */
> > +     int qlen[2] = { INT_MAX, INT_MAX };
> > +     bool is_nonrot[2] = { false, false };
> > +     bool all_bdev_nonrot = true;
> > +     bool all_bdev_rotate = true;
> > +     struct block_device *bdev;
> > +
> > +     ASSERT(num == 2);
> > +
> > +     /* Check accessible bdevs */
> > +     for (i = 0; i < 2; i++) {
> > +             bdev = map->stripes[i].dev->bdev;
> > +             if (bdev) {
> > +                     qlen[i] = 0;
> > +                     is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
> > +                     if (is_nonrot[i])
> > +                             all_bdev_rotate = false;
> > +                     else
> > +                             all_bdev_nonrot = false;
> > +             }
> > +     }
> > +
> > +     /*
> > +      * Don't bother with computation
> > +      * if only one of two bdevs are accessible
> > +      */
> > +     if (qlen[0] == INT_MAX)
> > +             return 1;
> > +     if (qlen[1] == INT_MAX)
> > +             return 0;
> > +
> > +     if (all_bdev_nonrot)
> > +             round_down = 2;
> > +
> > +     for (i = 0; i < 2; i++) {
> > +             bdev = map->stripes[i].dev->bdev;
> > +             qlen[i] = bdev_get_queue_len(bdev, round_down);
> > +     }
> > +
> > +     /* For mixed case, pick non rotational dev as optimal */
> > +     if (all_bdev_rotate == all_bdev_nonrot) {
> > +             if (is_nonrot[0])
> > +                     optimal = 0;
> > +             else
> > +                     optimal = 1;
> > +     }
> > +
> > +     if (qlen[optimal] > qlen[(optimal + 1) % 2])
> > +             optimal = i;
> > +
> > +     return optimal;
> > +}
> > +
> >   static int find_live_mirror(struct btrfs_fs_info *fs_info,
> >                           struct map_lookup *map, int first,
> >                           int dev_replace_is_ongoing)
> > @@ -5177,7 +5274,8 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
> >       else
> >               num_stripes = map->num_stripes;
> >
> > -     preferred_mirror = first + current->pid % num_stripes;
> > +     preferred_mirror = first + guess_optimal(map, num_stripes,
> > +                                              current->pid % num_stripes);
> >
> >       if (dev_replace_is_ongoing &&
> >           fs_info->dev_replace.cont_reading_from_srcdev_mode ==
> >

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH V7] Btrfs: enhance raid1/10 balance heuristic
  2019-01-01 18:39   ` Timofey Titovets
@ 2019-01-03  9:31     ` Anand Jain
  0 siblings, 0 replies; 5+ messages in thread
From: Anand Jain @ 2019-01-03  9:31 UTC (permalink / raw)
  To: Timofey Titovets; +Cc: linux-btrfs, Nikolay Borisov, David Sterba



On 01/02/2019 02:39 AM, Timofey Titovets wrote:
> Oh, just forgot to answer.
> 
> ср, 14 нояб. 2018 г. в 04:27, Anand Jain <anand.jain@oracle.com>:
>>
>> I am ok with the least used path approach here for the IO routing
>> that's probably most reasonable in generic configurations. It can
>> be default read mirror policy as well.
> (thanks, that's pleasant for me %) )
> 
>> But as I mentioned. Not all configurations would agree to the heuristic
>> approach here. For example: To make use of the SAN storage cache to
>> get high IO throughput read must access based on the LBA, And this
>> heuristic would make matter worst. There are plans to add more
>> options read_mirror_policy [1].
>>
>> [1]
>> https://patchwork.kernel.org/patch/10403299/
> 
> Can you please add some example of SAN stack were that
> will make something 'worst'?

As we are discussing about the read mirror policy, consider only read 
IO. Now, if you access a LBA on a disk, that LBA gets cached on the 
storage target controller and on the disk cache, now if the host is 
trying to access the same LBA (which is been mirrored) on another disk, 
the mirror data also gets cached, wasting the cache space, instead the 
cache on the disk/target could have been used to cache new data by 
routing the read based on logical LBA so that disk/storage-target cache 
compliment each other. more below..

> Moreover pid lb will also not play good for your example.

I don't like PID based as it is less deterministic on where the read IO 
is going so its hard to debug (read_mirror_policy patch in the ML will 
fix that). In this context (that is disk/storage cache optimization) the 
current PID based is better as the read IO will consistently go to a 
disk for the life span of the application. Which means the 
disk/storage-target cache is guaranteed to have not wasted as long as 
the application PID remains same, albeit the queue load is not been 
taken care in the PID approach. Where as the proposed 
least-used-path-approach here, its only optimization is about the buffer 
queue on the HBA and the target and falls short on the target/disk cache 
optimization.

> In SAN stack client always see one device with N path.
> And no raid1 balancing can happen.
> Maybe i didn't see all setups in world, but configure raid1
> from 2 remote devices sounds very bad for me.
> Even drbd
> only provide one logical device to end user.

BTRFS is yet to support N path to a LUN/disk moreover N paths doesn't 
represent a data mirror, its just N paths to a copy of the data. and 
discussion here is not about the multi-path. Its about data mirror usage 
policy for the read IO.

>> I would rather provide the configuration tune-ables to the use
>> cases rather than fixing it using heuristic. Heuristic are good
>> only with the known set of IO pattern for which heuristic is
>> designed for.
> 
> Yep, but how compex that tunables must be?
> i.e. we'll _always_ have cerner cases with bad behaviour.

mount -o read_mirror_policy=least_used or
mount -o read_mirror_policy=by_q_depth

> (Also i prefer have sysfs tunables for that, instead of adding another
> mount option.)

There is a lot of work pending w.r.t disk/volume and sysfs. I won't 
corrupt it without a broader discussion of major requisites from the "S" 
part of the RAS features. We still don't have "S" part completely 
addressed. I wrote about it a long time back, I wish there is more 
common interest in commenting on it.

We should discuss - btrfs sysfs is less matured as compared to mount 
options, comments on the read_mirror_policy patch will be better, I am 
ok if it has to be sysfs.

>> This is not the first time you are assuming heuristic would provide
>> the best possible performance in all use cases. As I mentioned
>> in the compression heuristic there was no problem statement that
>> you wanted to address using heuristic, theoretically the integrated
>> compression heuristic would have to do a lot more computation when
>> all the file-extents are compressible, its not convenience to me
>> how compression heuristic would help on a desktop machine where
>> most of the files are compressible.
> 
> Different tools exists, because we have different use cases.
> If something adds more problems than it solves, it must be changed or
> just purged.
> 
> Moreover, claim what on every desktop machine most of files are compressible
> not true. I don't want to make long discussion about "spherical cow" in space.
> Just my example:
> ➜  ~ sudo compsize /mnt
> Processed 1198830 files, 1404382 regular extents (1481132 refs), 675870 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL       77%      240G         308G         285G
> none       100%      202G         202G         176G
> zlib        37%      1.8G         5.1G         5.6G
> lzo         61%       64M         104M         126M
> zstd        35%       36G         100G         103G

This does not address the question above. That is - When all the 
files/extents were compressible by a compression-engine, heuristic 
needlessly makes it more CPU intensive than without heuristic, and 
without achieving any benefit. And more over the original problem of -o 
compress being easily giving-up forever and -o force-compress being 
never gives up is still not addressed. Which heuristic could have.

Thanks, Anand

> That's are system + home.
> Home have different type of trash in it: videos, photos, source repos,
> steam games, docker images.
> DE Apps DB and other random things. Some data have NOCOW because i
> just lack of "mental strength"
> to finish fix of bad behaviour autodefrag with compressed data.
> As you can see most volume of data are not compressed.
> 
>> IMO heuristic are good only for a set of types of workload. Giving
>> an option to move away from it for the manual tuning is desired.
>>
>> Thanks, Anand
> 
> Any way, may be you right about demand in adding some control about
> internal behaviour.
> And we can combine our work to properly support that.
> (I don't like over engineering, and just try avoid way where users will start to
> switch random flags to make things better.)
> But before that we need some feedback from upstream. Bad or good.
> 
> Because currently core btrfs devs works on companies, which internal
> use btrfs and/or
> sell that to customers (suse?). I'm afraid what devs afraid to change internal
> behaviour without 100% confident that it will be better.
> 
> Thanks!
> 
>> On 11/12/2018 07:58 PM, Timofey Titovets wrote:
>>> From: Timofey Titovets <nefelim4ag@gmail.com>
>>>
>>> Currently btrfs raid1/10 balancer bаlance requests to mirrors,
>>> based on pid % num of mirrors.
>>>
>>> Make logic understood:
>>>    - if one of underline devices are non rotational
>>>    - Queue length to underline devices
>>>
>>> By default try use pid % num_mirrors guessing, but:
>>>    - If one of mirrors are non rotational, repick optimal to it
>>>    - If underline mirror have less queue length then optimal,
>>>      repick to that mirror
>>>
>>> For avoid round-robin request balancing,
>>> lets round down queue length:
>>>    - By 8 for rotational devs
>>>    - By 2 for all non rotational devs
>>>
>>> Some bench results from mail list
>>> (Dmitrii Tcvetkov <demfloro@demfloro.ru>):
>>> Benchmark summary (arithmetic mean of 3 runs):
>>>            Mainline     Patch
>>> ------------------------------------
>>> RAID1  | 18.9 MiB/s | 26.5 MiB/s
>>> RAID10 | 30.7 MiB/s | 30.7 MiB/s
>>> ------------------------------------------------------------------------
>>> mainline, fio got lucky to read from first HDD (quite slow HDD):
>>> Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS]
>>>     read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec)
>>>     lat (msec): min=2, max=825, avg=60.17, stdev=65.06
>>> ------------------------------------------------------------------------
>>> mainline, fio got lucky to read from second HDD (much more modern):
>>> Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS]
>>>     read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec)
>>>     lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56
>>> ------------------------------------------------------------------------
>>> mainline, fio got lucky to read from an SSD:
>>> Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS]
>>>     read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec)
>>>     lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36
>>> ------------------------------------------------------------------------
>>> With the patch, 2 HDDs:
>>> Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS]
>>>     read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec)
>>>     lat (usec): min=435, max=341037, avg=28511.64, stdev=30000.14
>>> ------------------------------------------------------------------------
>>> With the patch, HDD(old one)+SSD:
>>> Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS]
>>>     read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec)
>>>     lat  (usec): min=363, max=346752, avg=1381.73, stdev=6948.32
>>>
>>> Changes:
>>>     v1 -> v2:
>>>       - Use helper part_in_flight() from genhd.c
>>>         to get queue length
>>>       - Move guess code to guess_optimal()
>>>       - Change balancer logic, try use pid % mirror by default
>>>         Make balancing on spinning rust if one of underline devices
>>>         are overloaded
>>>     v2 -> v3:
>>>       - Fix arg for RAID10 - use sub_stripes, instead of num_stripes
>>>     v3 -> v4:
>>>       - Rebased on latest misc-next
>>>     v4 -> v5:
>>>       - Rebased on latest misc-next
>>>     v5 -> v6:
>>>       - Fix spelling
>>>       - Include bench results
>>>     v6 -> v7:
>>>       - Fixes based on Nikolay Borisov review:
>>>         * Assume num == 2
>>>         * Remove "for" loop based on that assumption, where possible
>>>         * No functional changes
>>>
>>> Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
>>> Tested-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
>>> Reviewed-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
>>> ---
>>>    block/genhd.c      |   1 +
>>>    fs/btrfs/volumes.c | 100 ++++++++++++++++++++++++++++++++++++++++++++-
>>>    2 files changed, 100 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/genhd.c b/block/genhd.c
>>> index be5bab20b2ab..939f0c6a2d79 100644
>>> --- a/block/genhd.c
>>> +++ b/block/genhd.c
>>> @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct *part,
>>>                                atomic_read(&part->in_flight[1]);
>>>        }
>>>    }
>>> +EXPORT_SYMBOL_GPL(part_in_flight);
>>>
>>>    void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
>>>                       unsigned int inflight[2])
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index f4405e430da6..a6632cc2bfab 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -13,6 +13,7 @@
>>>    #include <linux/raid/pq.h>
>>>    #include <linux/semaphore.h>
>>>    #include <linux/uuid.h>
>>> +#include <linux/genhd.h>
>>>    #include <linux/list_sort.h>
>>>    #include "ctree.h"
>>>    #include "extent_map.h"
>>> @@ -5159,6 +5160,102 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
>>>        return ret;
>>>    }
>>>
>>> +/**
>>> + * bdev_get_queue_len - return rounded down in flight queue length of bdev
>>> + *
>>> + * @bdev: target bdev
>>> + * @round_down: round factor big for hdd and small for ssd, like 8 and 2
>>> + */
>>> +static int bdev_get_queue_len(struct block_device *bdev, int round_down)
>>> +{
>>> +     int sum;
>>> +     struct hd_struct *bd_part = bdev->bd_part;
>>> +     struct request_queue *rq = bdev_get_queue(bdev);
>>> +     uint32_t inflight[2] = {0, 0};
>>> +
>>> +     part_in_flight(rq, bd_part, inflight);
>>> +
>>> +     sum = max_t(uint32_t, inflight[0], inflight[1]);
>>> +
>>> +     /*
>>> +      * Try prevent switch for every sneeze
>>> +      * By roundup output num by some value
>>> +      */
>>> +     return ALIGN_DOWN(sum, round_down);
>>> +}
>>> +
>>> +/**
>>> + * guess_optimal - return guessed optimal mirror
>>> + *
>>> + * Optimal expected to be pid % num_stripes
>>> + *
>>> + * That's generaly ok for spread load
>>> + * Add some balancer based on queue length to device
>>> + *
>>> + * Basic ideas:
>>> + *  - Sequential read generate low amount of request
>>> + *    so if load of drives are equal, use pid % num_stripes balancing
>>> + *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
>>> + *    and repick if other dev have "significant" less queue length
>>> + *  - Repick optimal if queue length of other mirror are less
>>> + */
>>> +static int guess_optimal(struct map_lookup *map, int num, int optimal)
>>> +{
>>> +     int i;
>>> +     int round_down = 8;
>>> +     /* Init for missing bdevs */
>>> +     int qlen[2] = { INT_MAX, INT_MAX };
>>> +     bool is_nonrot[2] = { false, false };
>>> +     bool all_bdev_nonrot = true;
>>> +     bool all_bdev_rotate = true;
>>> +     struct block_device *bdev;
>>> +
>>> +     ASSERT(num == 2);
>>> +
>>> +     /* Check accessible bdevs */
>>> +     for (i = 0; i < 2; i++) {
>>> +             bdev = map->stripes[i].dev->bdev;
>>> +             if (bdev) {
>>> +                     qlen[i] = 0;
>>> +                     is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
>>> +                     if (is_nonrot[i])
>>> +                             all_bdev_rotate = false;
>>> +                     else
>>> +                             all_bdev_nonrot = false;
>>> +             }
>>> +     }
>>> +
>>> +     /*
>>> +      * Don't bother with computation
>>> +      * if only one of two bdevs are accessible
>>> +      */
>>> +     if (qlen[0] == INT_MAX)
>>> +             return 1;
>>> +     if (qlen[1] == INT_MAX)
>>> +             return 0;
>>> +
>>> +     if (all_bdev_nonrot)
>>> +             round_down = 2;
>>> +
>>> +     for (i = 0; i < 2; i++) {
>>> +             bdev = map->stripes[i].dev->bdev;
>>> +             qlen[i] = bdev_get_queue_len(bdev, round_down);
>>> +     }
>>> +
>>> +     /* For mixed case, pick non rotational dev as optimal */
>>> +     if (all_bdev_rotate == all_bdev_nonrot) {
>>> +             if (is_nonrot[0])
>>> +                     optimal = 0;
>>> +             else
>>> +                     optimal = 1;
>>> +     }
>>> +
>>> +     if (qlen[optimal] > qlen[(optimal + 1) % 2])
>>> +             optimal = i;
>>> +
>>> +     return optimal;
>>> +}
>>> +
>>>    static int find_live_mirror(struct btrfs_fs_info *fs_info,
>>>                            struct map_lookup *map, int first,
>>>                            int dev_replace_is_ongoing)
>>> @@ -5177,7 +5274,8 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
>>>        else
>>>                num_stripes = map->num_stripes;
>>>
>>> -     preferred_mirror = first + current->pid % num_stripes;
>>> +     preferred_mirror = first + guess_optimal(map, num_stripes,
>>> +                                              current->pid % num_stripes);
>>>
>>>        if (dev_replace_is_ongoing &&
>>>            fs_info->dev_replace.cont_reading_from_srcdev_mode ==
>>>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-01-03  9:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-12 11:58 [PATCH V7] Btrfs: enhance raid1/10 balance heuristic Timofey Titovets
2018-11-13 18:55 ` Goffredo Baroncelli
2018-11-14  1:27 ` Anand Jain
2019-01-01 18:39   ` Timofey Titovets
2019-01-03  9:31     ` Anand Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).